Special Characters

Numerical Character References

Every character has a numeric code—even characters that exist on the QWERTY keyboard , such as the letter d (which is a good thing, because you can't type d on a Russian keyboard). To create a numeric character reference, the ampersand must be followed directly by a number sign (#). If you omit the number sign in a numerical code, you'll likely see the character reference as plain text instead of a character.

There are two types of Numerical Character References: decimal and hexadecimal.

Decimal

Decimal (aka Base 10) is the numbering system we use every day, using the digits 0 - 9. For example, the decimal entity code of d is 100. To write that as a numerical code, you would write d.

Hexadecimal

Hexadecimal—also known as Base 16—is quite common in computing. Base 2—also known as binary—is what computers themselves work with, and 16 is equal to 2⁴, so hexadecimal notation serves quite nicely as a binary shorthand (each hexadecimal digit essentially serves as 4 bits, or binary digits). Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Below is a comparison between Decimal, Hexadecimal, and Binary numbers (the last included to show how Hexadecimal lines up with Binary).

Comparison of Number Systems
Dec	Hex	Bin	Dec	Hex	Bin	Dec	Hex	Bin	Dec	Hex	Bin	Dec	Hex	Bin
0	0	0	4	4	100	8	8	1000	12	C	1100	16	10	10000
1	1	1	5	5	101	9	9	1001	13	D	1101	17	11	10001
2	2	10	6	6	110	10	A	1010	14	E	1110	18	12	10010
3	3	11	7	7	111	11	B	1011	15	F	1111	19	13	10111

Hexadecimal is used in a system of character encoding called Unicode. In the words of the Unicode Consortium, Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. (Unicode Consortium, FAQ - Basic Questions).

Unicode codes consist of the characters U+ followed by a series of hexadecimal digits (usually four of them); this is known as a code point. Since 100 in Base 10 is only 64 in Base 16, the Unicode code has additional zeros on the left side of the number, so the code point is U+0064.

When creating an NCR using a Unicode code point, U+ is replaced with &#x; &# means this is an NCR, and the following x means it's a hexadecimal number. Thus, the hexadecimal NCR for d is d. A quick definition of each part of that sequence is below:

&: Beginning of reference.
#: The reference is a numeric character reference.
x: The NCR is in hexadecimal notation.
0064: The numerical code itself.
;: End of reference.

If you see that the Unicode code point for a character is U+003E, to use it as an HTML NCR, you would write it as >. By the way, the result is >—the greater-than sign.

Just one last note: Unicode code points are four digits long, but you may omit preceding zeros. There's no difference between π, and π—they're both the same entity. However, π and < are not: while π and π both refer to the character π (the Greek letter pi), &#x003C, < and < refer to <.

It is also my habit to use capital letters when writing hexadecimal numbers, though case does not matter; the entity codes π and π both refer to π.

Numerical Character References You May Not Use

One thing that has greatly affected the development of HTML more than anything else is legacy, and the legacy of character encoding stretches all the way back to the dawn of telecommunications back in the 19^th century. Unicode's most common version, UTF-8 (short for 8-bit Unicode Transformation Format), was created so that its first 256 characters exactly matched the 256 characters of an earlier encoding called ISO-8859-1, still used for many webpages. The first half of ISO-8859-1, in turn, exactly matches the first 128 characters of ASCII (American Standard Code for Information Interchange), which developed from 19^th-century telegraph codes and is commonly used in computers. Thus, ISO-8859-1 had a number of codes that really have no place in a webpage and for the most part cannot be used; they are known as control codes or control characters. They do not appear as written symbols—indeed one (#7, if you must know) is known as the Bell Character; it makes a computer go beep. No, I am not making that up.

These numbers are:

Control Character NCRs
Hexadecimal	Decimal	Reason
0 - 19	0 - 31	These are characters dealing with the display of text rather than actual characters, called Control Codes. All except 9₁₆ and A₁₆ (9₁₀ and 10₁₀) cause a warning to appear on the W3C's validator.
7F - 9F	126 - 159	These are a second set of control codes. All of these NCRs cause a warning to appear on the W3C's validator.

Most of these will show up as a question mark (since the browser has no idea how to display them), or a square containing four hexadecimal digits, or it may even show an actual character (though it's best to use the official NCR for the display character).

Unicode, which goes beyond ISO-8859-1, also has several reserved codes but I won't go into those, as they're all over the place and have ranges of one character to dozens; two Unicode code points, U+FFFE and U+FFFF, will never be assigned values.

Character Entity References

Which reference will create the symbol for a square root (√), also known as a radicand?

√
√
√

The answer is D. All of the above. Now which of them is easier to remember?

Character Entity References (CERs) refer to a piece of coding found in the DTD, which I mentioned back in The Basics of Markup. An actual entity (which you're unlikely to see in an actual HTML webpage) looks something like this:

A Simplified Entity

<!ENTITY radic "√">

Since the browser can download the DTD (and in the case of HTML, probably already has it memorized), when the browser encounters the reference √, it will know that this is a reference to an entity consiting of the NCR √. This makes it much easier to type Money = √(All Evil).

By the way, all character entities in the HTML 4.01 DTD use decimal NCRs, rather than Unicode's hexadecimal NCRs.

Just a warning: If you misspell a character entity reference, one of three things will happen: If the misspelled reference refers to a different, character, you will get that character. If it doesn't, in an HTML you'll get the reference showing up as plain text, and it XHTML your webpage will simply not work due to an undefined entity error. Either way, the validator will say the reference is not defined and no system identifier could be generated.

Oh, and remember: CERs are case-sensitive: the CER &Dagger; results in ‡, while &dagger; becomes †.

Character Entity References Specific To HTML

HTML also has a number of character entity references unique to itself; some of those I have already demonstrated: √, †, and ‡. Others include letters with diacritics (little marks under, over, or through them that affect their pronunciation). Here's a few of them:

Letters with Diacritics
Description	Upper Case	Upper Case Code	Lower Case	Lower Case Code
A with grave	À	À	à	à
A with acute	Á	Á	á	á
A with circumflex	Â	Â	â	â
A with tilde	Ã	Ã	ã	ã
A with diaeresis/umlaut	Ä	Ä	ä	ä
A with ring	Å	Å	å	å
AE Ligature	Æ	Æ	æ	æ
C with cedilla	Ç	Ç	ç	ç
E with grave	È	È	è	è
E with acute	É	É	é	é
E with circumflex	Ê	Ê	ê	ê
E with diaeresis/umlaut	Ë	Ë	ë	ë
I with grave	Ì	Ì	ì	ì
I with acute	Í	Í	í	í
I with circumflex	Î	Î	î	î
I with diaeresis/umlaut	Ï	Ï	ï	ï
ETH (Icelandic voice th)	Ð	Ð	ð	ð
N with tilde	Ñ	Ñ	ñ	ñ
O with grave	Ò	Ò	ò	ò
O with acute	Ó	Ó	ó	ó
O with circumflex	Ô	Ô	ô	ô
O with tilde	Õ	Õ	õ	õ
O with diaeresis/umlaut	Ö	Ö	ö	ö
O with stroke	Ø	Ø	ø	ø
U with grave	Ù	Ù	ù	ù
U with acute	Ú	Ú	ú	ú
U with circumflex	Û	Û	û	û
U with diaeresis/umlaut	Ü	Ü	ü	ü
Y with acute	Ý	Ý	ý	ý
Thorn (Icelandic unvoiced th)	Þ	Þ	þ	þ
German sz ligature	-	-	ß	ß
Lower-case Y with diaeresis/umlaut	-	-	ÿ	ÿ

There are other characters as well that are not letters:

Other HTML characters
Character	Code	Description
		This is called the non-breaking space. It is used to add a space in some cases.
¡	¡	Spanish inverted exclamation mark
¿	¿	Spanish inverted question mark
¢	¢	The cent sign
©	©	Copyright sign
«	«	French left-pointing double angle quotation mark
»	»	French right-pointing double angle quotation mark
®	®	Registered trade mark sign
°	°	Degree sign
÷	÷	Division sign

Character Entity References And XML-Derived Languages

There are four characters that are so common amongst the SGML-derived markup languages (this includes XML) that their character entity references are recognized by one and all:

<: Character entity reference: <
>: Character entity reference: >
&: Character entity reference: &
": Character entity reference: "

The apostrophe (') is an interesting exception. Its character entity reference (') is recognized by all markup languages (as it's part of the XML standard) with the sole exception of HTML.

Aside from the four universal CERs, XML-derived languages often have their own. Knowing what CERs they have is important—if they aren't defined in the DTD, the browser will display an error referring to an unknown character entity reference.

Note: this will not occur with NCRs; those are unambiguous and independant from any DTD. The NCR ♩ will always be the same character, no matter which markup language you use: ♩.