The French phrase for How are you?
is Ça va?
. The German phrase for Who are you?
is Wie heißt du?
And Russian for I speak Russian
is Я говорю на русском языке
. Each phrase uses letters not found on the standard QWERTY keyboard. If you want to show HTML code, you can't use the standard less-than (<
) or greater-than (>
) signs as you see them on your keyboard; your browser will read that as actual markup.
So how does one make these characters show up on a screen? You do it through a kind of code called a character reference. There are two types:
You may have heard these called entities
, but that is incorrect. I'll tell you why when we get to character entity references.
Every reference starts and ends with specific characters. The ampersand, or &
, starts a character reference. Omit this, and your reference will be interpreted literally, rather than displaying what it's supposed to represent. The semicolon (;
) ends the character reference. Omit this, and your browser will be confused about where the character reference is supposed to end; you might get a character (not necessarily the one you want) or simply the reference as plain text (if you've seen pages with
all over the place, this is why). Since it comes at the end, a semicolon outside of a character reference is simply rendered as text."
Every character has a numeric code—even characters that exist on the QWERTY keyboard , such as the letter d
(which is a good thing, because you can't type d
on a Russian keyboard). To create a numeric character reference, the ampersand must be followed directly by a number sign (#
). If you omit the number sign in a numerical code, you'll likely see the character reference as plain text instead of a character.
There are two types of Numerical Character References: decimal and hexadecimal.
Decimal (aka Base 10) is the numbering system we use every day, using the digits 0 - 9. For example, the decimal entity code of d
is 100. To write that as a numerical code, you would write
.d
Hexadecimal—also known as Base 16—is quite common in computing. Base 2—also known as binary—is what computers themselves work with, and 16 is equal to 24, so hexadecimal notation serves quite nicely as a binary shorthand (each hexadecimal digit essentially serves as 4 bits, or binary digits). Hexadecimal digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Below is a comparison between Decimal, Hexadecimal, and Binary numbers (the last included to show how Hexadecimal lines up with Binary).
Dec | Hex | Bin | Dec | Hex | Bin | Dec | Hex | Bin | Dec | Hex | Bin | Dec | Hex | Bin | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 4 | 4 | 100 | 8 | 8 | 1000 | 12 | C | 1100 | 16 | 10 | 10000 | ||||
1 | 1 | 1 | 5 | 5 | 101 | 9 | 9 | 1001 | 13 | D | 1101 | 17 | 11 | 10001 | ||||
2 | 2 | 10 | 6 | 6 | 110 | 10 | A | 1010 | 14 | E | 1110 | 18 | 12 | 10010 | ||||
3 | 3 | 11 | 7 | 7 | 111 | 11 | B | 1011 | 15 | F | 1111 | 19 | 13 | 10111 |
Hexadecimal is used in a system of character encoding called Unicode. In the words of the Unicode Consortium, Unicode is the universal character encoding, maintained by the Unicode Consortium. This encoding standard provides the basis for processing, storage and interchange of text data in any language in all modern software and information technology protocols.
(Unicode Consortium, FAQ - Basic Questions).
Unicode codes consist of the characters U+
followed by a series of hexadecimal digits (usually four of them); this is known as a code point. Since 100
in Base 10 is only 64
in Base 16, the Unicode code has additional zeros on the left side of the number, so the code point is U+0064
.
When creating an NCR using a Unicode code point, U+
is replaced with &#x
; &#
means this is an NCR, and the following x
means it's a hexadecimal number. Thus, the hexadecimal NCR for d
is
. A quick definition of each part of that sequence is below:d
If you see that the Unicode code point for a character is U+003E
, to use it as an HTML NCR, you would write it as
. By the way, the result is >
>
—the greater-than sign.
Just one last note: Unicode code points are four digits long, but you may omit preceding zeros. There's no difference between
, and π
—they're both the same entity. However, π
and π
are not: while <
and π
both refer to the character π
π
(the Greek letter pi
),
, <
and <
refer to <
<
.
It is also my habit to use capital letters when writing hexadecimal numbers, though case does not matter; the entity codes
and π
both refer to π
π
.
One thing that has greatly affected the development of HTML more than anything else is legacy, and the legacy of character encoding stretches all the way back to the dawn of telecommunications back in the 19th century. Unicode's most common version, UTF-8 (short for 8-bit Unicode Transformation Format), was created so that its first 256 characters exactly matched the 256 characters of an earlier encoding called ISO-8859-1
, still used for many webpages. The first half of ISO-8859-1, in turn, exactly matches the first 128 characters of ASCII (American Standard Code for Information Interchange), which developed from 19th-century telegraph codes and is commonly used in computers. Thus, ISO-8859-1 had a number of codes that really have no place in a webpage and for the most part cannot be used; they are known as control codes
or control characters
. They do not appear as written symbols—indeed one (#7, if you must know) is known as the Bell Character; it makes a computer go beep
. No, I am not making that up.
These numbers are:
Hexadecimal | Decimal | Reason |
---|---|---|
0 - 19 | 0 - 31 | These are characters dealing with the display of text rather than actual characters, called Control Codes. All except 916and A16( 910and 1010) cause a warning to appear on the W3C's validator. |
7F - 9F | 126 - 159 | These are a second set of control codes. All of these NCRs cause a warning to appear on the W3C's validator. |
Most of these will show up as a question mark (since the browser has no idea how to display them), or a square containing four hexadecimal digits, or it may even show an actual character (though it's best to use the official NCR for the display character).
Unicode, which goes beyond ISO-8859-1, also has several reserved codes but I won't go into those, as they're all over the place and have ranges of one character to dozens; two Unicode code points, U+FFFE and U+FFFF, will never be assigned values.
Which reference will create the symbol for a square root (√
), also known as a radicand
?
√
√
√
The answer is D. All of the above
. Now which of them is easier to remember?
Character Entity References (CERs) refer to a piece of coding found in the DTD, which I mentioned back in The Basics of Markup. An actual entity (which you're unlikely to see in an actual HTML webpage) looks something like this:
Since the browser can download the DTD (and in the case of HTML, probably already has it memorized), when the browser encounters the reference
, it will know that this is a reference to an entity consiting of the NCR √
. This makes it much easier to type √
Money = √(All Evil)
.
By the way, all character entities in the HTML 4.01 DTD use decimal NCRs, rather than Unicode's hexadecimal NCRs.
Just a warning: If you misspell a character entity reference, one of three things will happen: If the misspelled reference refers to a different, character, you will get that character. If it doesn't, in an HTML you'll get the reference showing up as plain text, and it XHTML your webpage will simply not work due to an undefined entity error. Either way, the validator will say the reference is not defined and no system identifier could be generated.
Oh, and remember: CERs are case-sensitive: the CER
results in ‡
‡
, while
becomes †
†
.
HTML also has a number of character entity references unique to itself; some of those I have already demonstrated: √, †, and ‡. Others include letters with diacritics (little marks under, over, or through them that affect their pronunciation). Here's a few of them:
Description | Upper Case | Upper Case Code | Lower Case | Lower Case Code |
---|---|---|---|---|
Awith grave |
À | À | à | à |
Awith acute |
Á | Á | á | á |
Awith circumflex |
 |  | â | â |
Awith tilde |
à | à | ã | ã |
Awith diaeresis/umlaut |
Ä | Ä | ä | ä |
A with ring | Å | Å | å | å |
AE Ligature | Æ | Æ | æ | æ |
C with cedilla | Ç | Ç | ç | ç |
Ewith grave |
È | È | è | è |
Ewith acute |
É | É | é | é |
Ewith circumflex |
Ê | Ê | ê | ê |
Ewith diaeresis/umlaut |
Ë | Ë | ë | ë |
Iwith grave |
Ì | Ì | ì | ì |
Iwith acute |
Í | Í | í | í |
Iwith circumflex |
Î | Î | î | î |
Iwith diaeresis/umlaut |
Ï | Ï | ï | ï |
ETH (Icelandic voice th) |
Ð | Ð | ð | ð |
N with tilde | Ñ | Ñ | ñ | ñ |
Owith grave |
Ò | Ò | ò | ò |
Owith acute |
Ó | Ó | ó | ó |
Owith circumflex |
Ô | Ô | ô | ô |
Owith tilde |
Õ | Õ | õ | õ |
Owith diaeresis/umlaut |
Ö | Ö | ö | ö |
Owith stroke |
Ø | Ø | ø | ø |
Uwith grave |
Ù | Ù | ù | ù |
Uwith acute |
Ú | Ú | ú | ú |
Uwith circumflex |
Û | Û | û | û |
Uwith diaeresis/umlaut |
Ü | Ü | ü | ü |
Ywith acute |
Ý | Ý | ý | ý |
Thorn (Icelandic unvoiced th) |
Þ | Þ | þ | þ |
German szligature |
- | - | ß | ß |
Lower-case Ywith diaeresis/umlaut |
- | - | ÿ | ÿ |
There are other characters as well that are not letters:
Character | Code | Description |
---|---|---|
| This is called the non-breaking space. It is used to add a space in some cases. |
|
¡ | ¡ | Spanish inverted exclamation mark |
¿ | ¿ | Spanish inverted question mark |
¢ | ¢ | The cent sign |
© | © | Copyright sign |
« | « | French left-pointing double angle quotation mark |
» | » | French right-pointing double angle quotation mark |
® | ® | Registered trade mark sign |
° | ° | Degree sign |
÷ | ÷ | Division sign |
There are four characters that are so common amongst the SGML-derived markup languages (this includes XML) that their character entity references are recognized by one and all:
<
>
&
"
The apostrophe ('
) is an interesting exception. Its character entity reference ('
) is recognized by all markup languages (as it's part of the XML standard) with the sole exception of HTML.
Aside from the four universal CERs, XML-derived languages often have their own. Knowing what CERs they have is important—if they aren't defined in the DTD, the browser will display an error referring to an unknown character entity reference.
Note: this will not occur with NCRs; those are unambiguous and independant from any DTD. The NCR ♩
will always be the same character, no matter which markup language you use: ♩
.