Special Characters

The French phrase for How are you? is Ça va?. The German phrase for Who are you? is Wie heißt du? And Russian for I speak Russian is Я говорю на русском языке. Each phrase uses letters not found on the standard QWERTY keyboard. If you want to show HTML code, you can't use the standard less-than (<) or greater-than (>) signs as you see them on your keyboard; your browser will read that as actual markup.

So how does one make these characters show up on a screen? You do it through a kind of code called a character reference. There are two types:

  1. Numerical Character Reference (NCR for short)
  2. Character Entity Reference (CER for short)

You may have heard these called entities, but that is incorrect. I'll tell you why when we get to character entity references.

How To Create A Character Reference

Every reference starts and ends with specific characters. The ampersand, or &, starts a character reference. Omit this, and your reference will be interpreted literally, rather than displaying what it's supposed to represent. The semicolon (;) ends the character reference. Omit this, and your browser will be confused about where the character reference is supposed to end; you might get a character (not necessarily the one you want) or simply the reference as plain text (if you've seen pages with &quot all over the place, this is why). Since it comes at the end, a semicolon outside of a character reference is simply rendered as text.

Numerical Character References

Every character has a numeric code—even characters that exist on the QWERTY keyboard , such as the letter d (which is a good thing, because you can't type d on a Russian keyboard). To create a numeric character reference, the ampersand must be followed directly by a number sign (#). If you omit the number sign in a numerical code, you'll likely see the character reference as plain text instead of a character.

There are two types of Numerical Character References: decimal and hexadecimal.

Decimal

Decimal (aka Base 10) is the numbering system we use every day, using the digits 0 - 9. For example, the decimal entity code of d is 100. To write that as a numerical code, you would write &#100;.

Hexadecimal

Hexadecimal—also known as Base 16—is quite common in computing. Base 2—also known as binary—is what computers themselves work with, and 16 is equal to 24, so hexadecimal notation serves quite nicely as a binary shorthand (each hexadecimal digit essentially serves as 4 bits, or binary digits). Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Below is a comparison between Decimal, Hexadecimal, and Binary numbers (the last included to show how Hexadecimal lines up with Binary).

Comparison of Number Systems
Dec Hex Bin Dec Hex Bin Dec Hex Bin Dec Hex Bin Dec Hex Bin
0 0 0 4 4 100 8 8 1000 12 C 1100 16 10 10000
1 1 1 5 5 101 9 9 1001 13 D 1101 17 11 10001
2 2 10 6 6 110 10 A 1010 14 E 1110 18 12 10010
3 3 11 7 7 111 11 B 1011 15 F 1111 19 13 10111

Hexadecimal is used in a system of character encoding called Unicode. In the words of the Unicode Consortium, Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols. (Unicode Consortium, FAQ - Basic Questions).

Unicode codes consist of the characters U+ followed by a series of hexadecimal digits (usually four of them); this is known as a code point. Since 100 in Base 10 is only 64 in Base 16, the Unicode code has additional zeros on the left side of the number, so the code point is U+0064.

When creating an NCR using a Unicode code point, U+ is replaced with &#x; &# means this is an NCR, and the following x means it's a hexadecimal number. Thus, the hexadecimal NCR for d is &#x0064;. A quick definition of each part of that sequence is below:

&
Beginning of reference.
#
The reference is a numeric character reference.
x
The NCR is in hexadecimal notation.
0064
The numerical code itself.
;
End of reference.

If you see that the Unicode code point for a character is U+003E, to use it as an HTML NCR, you would write it as &#x003E;. By the way, the result is >—the greater-than sign.

Just one last note: Unicode code points are four digits long, but you may omit preceding zeros. There's no difference between &#x03C0;, and &#x3C0;—they're both the same entity. However, &#x3C0; and &#x3C; are not: while &#x03C0; and &#x3C0; both refer to the character π (the Greek letter pi), &#x003C, &#x03C; and &#x3C; refer to <.

It is also my habit to use capital letters when writing hexadecimal numbers, though case does not matter; the entity codes &#x3C0; and &#x3c0; both refer to π.

Numerical Character References You May Not Use

One thing that has greatly affected the development of HTML more than anything else is legacy, and the legacy of character encoding stretches all the way back to the dawn of telecommunications back in the 19th century. Unicode's most common version, UTF-8 (short for 8-bit Unicode Transformation Format), was created so that its first 256 characters exactly matched the 256 characters of an earlier encoding called ISO-8859-1, still used for many webpages. The first half of ISO-8859-1, in turn, exactly matches the first 128 characters of ASCII (American Standard Code for Information Interchange), which developed from 19th-century telegraph codes and is commonly used in computers. Thus, ISO-8859-1 had a number of codes that really have no place in a webpage and for the most part cannot be used; they are known as control codes or control characters. They do not appear as written symbols—indeed one (#7, if you must know) is known as the Bell Character; it makes a computer go beep. No, I am not making that up.

These numbers are:

Control Character NCRs
Hexadecimal Decimal Reason
0 - 19 0 - 31 These are characters dealing with the display of text rather than actual characters, called Control Codes. All except 916 and A16 (910 and 1010) cause a warning to appear on the W3C's validator.
7F - 9F 126 - 159 These are a second set of control codes. All of these NCRs cause a warning to appear on the W3C's validator.

Most of these will show up as a question mark (since the browser has no idea how to display them), or a square containing four hexadecimal digits, or it may even show an actual character (though it's best to use the official NCR for the display character).

Unicode, which goes beyond ISO-8859-1, also has several reserved codes but I won't go into those, as they're all over the place and have ranges of one character to dozens; two Unicode code points, U+FFFE and U+FFFF, will never be assigned values.

Character Entity References

Which reference will create the symbol for a square root (), also known as a radicand?

  1. &#x221A;
  2. &#8730;
  3. &radic;

The answer is D. All of the above. Now which of them is easier to remember?

Character Entity References (CERs) refer to a piece of coding found in the DTD, which I mentioned back in The Basics of Markup. An actual entity (which you're unlikely to see in an actual HTML webpage) looks something like this:

A Simplified Entity
<!ENTITY radic "&#8730;">

Since the browser can download the DTD (and in the case of HTML, probably already has it memorized), when the browser encounters the reference &radic;, it will know that this is a reference to an entity consiting of the NCR &#8730;. This makes it much easier to type Money = √(All Evil).

By the way, all character entities in the HTML 4.01 DTD use decimal NCRs, rather than Unicode's hexadecimal NCRs.

Just a warning: If you misspell a character entity reference, one of three things will happen: If the misspelled reference refers to a different, character, you will get that character. If it doesn't, in an HTML you'll get the reference showing up as plain text, and it XHTML your webpage will simply not work due to an undefined entity error. Either way, the validator will say the reference is not defined and no system identifier could be generated.

Oh, and remember: CERs are case-sensitive: the CER &Dagger; results in , while &dagger; becomes .

Character Entity References Specific To HTML

HTML also has a number of character entity references unique to itself; some of those I have already demonstrated: √, †, and ‡. Others include letters with diacritics (little marks under, over, or through them that affect their pronunciation). Here's a few of them:

Letters with Diacritics
Description Upper Case Upper Case Code Lower Case Lower Case Code
A with grave À &Agrave; à &agrave;
A with acute Á &Aacute; á &aacute;
A with circumflex  &Acirc; â &acirc;
A with tilde à &Atilde; ã &atilde;
A with diaeresis/umlaut Ä &Auml; ä &auml;
A with ring Å &Aring; å &aring;
AE Ligature Æ &AElig; æ &aelig;
C with cedilla Ç &Ccedil; ç &ccedil;
E with grave È &Egrave; è &egrave;
E with acute É &Eacute; é &eacute;
E with circumflex Ê &Ecirc; ê &ecirc;
E with diaeresis/umlaut Ë &Euml; ë &euml;
I with grave Ì &Igrave; ì &igrave;
I with acute Í &Iacute; í &iacute;
I with circumflex Î &Icirc; î &icirc;
I with diaeresis/umlaut Ï &Iuml; ï &iuml;
ETH (Icelandic voice th) Ð &ETH; ð &eth;
N with tilde Ñ &Ntilde; ñ &ntilde;
O with grave Ò &Ograve; ò &ograve;
O with acute Ó &Oacute; ó &oacute;
O with circumflex Ô &Ocirc; ô &ocirc;
O with tilde Õ &Otilde; õ &otilde;
O with diaeresis/umlaut Ö &Ouml; ö &ouml;
O with stroke Ø &Oslash; ø &oslash;
U with grave Ù &Ugrave; ù &ugrave;
U with acute Ú &Uacute; ú &uacute;
U with circumflex Û &Ucirc; û &ucirc;
U with diaeresis/umlaut Ü &Uuml; ü &uuml;
Y with acute Ý &Yacute; ý &yacute;
Thorn (Icelandic unvoiced th) Þ &THORN; þ &thorn;
German sz ligature - - ß &szlig;
Lower-case Y with diaeresis/umlaut - - ÿ &yuml;

There are other characters as well that are not letters:

Other HTML characters
Character Code Description
  &nbsp; This is called the non-breaking space. It is used to add a space in some cases.
¡ &iexcl; Spanish inverted exclamation mark
¿ &iquest; Spanish inverted question mark
¢ &cent; The cent sign
© &copy; Copyright sign
« &laquo; French left-pointing double angle quotation mark
» &raquo; French right-pointing double angle quotation mark
® &reg; Registered trade mark sign
° &deg; Degree sign
÷ &divide; Division sign

Character Entity References And XML-Derived Languages

There are four characters that are so common amongst the SGML-derived markup languages (this includes XML) that their character entity references are recognized by one and all:

<
Character entity reference: &lt;
>
Character entity reference: &gt;
&
Character entity reference: &amp;
"
Character entity reference: &quot;

The apostrophe (') is an interesting exception. Its character entity reference (&apos;) is recognized by all markup languages (as it's part of the XML standard) with the sole exception of HTML.

Aside from the four universal CERs, XML-derived languages often have their own. Knowing what CERs they have is important—if they aren't defined in the DTD, the browser will display an error referring to an unknown character entity reference.

Note: this will not occur with NCRs; those are unambiguous and independant from any DTD. The NCR &#x2669; will always be the same character, no matter which markup language you use: .