XML And HTML5

I have talked about XHTML and saying it's an XML-derived language throughout the book, so it's long past time I explained XML and laid out its rules. I'm also going to talk about the latest update of HTML, HTML5 (which, as I mentioned earlier, isn't quite finished yet).

An Overview of XML

What does XML do?
As far as the browser is concerned, absolutely nothing. It does not impart any rendering information to the browser, nor does it let the browser know what the heck it's reading.
So what do you do with XML?
You use it as a basis for writing markup languages like XHTML; XML is a meta-language, a markup language used to create markup languages.

Rules Of XML

The basic rule of XML is it must be well-formed to work. If you break the rules of HTML, a validator might squawk, but your page will work. If you flout the rules of XML, the browser will squawk and your page won't work.

The list below shows what well-formed means.

Root elements: there can be only one.
Any markup outside of the root element (aside from the XML declaration, any XML processing instructions, comments, and the Doctype) will cause an error.
All elements must be explicitly opened and closed.
Start and end tags of non-empty elements are never optional; your start tags must have matching end tags and vice versa. Empty elements must be closed too—remember, you do this with a slash just before the >. (For example, an hr element would be written <hr />). The space before the slash isn't technically necessary, but it doesn't hurt.
All elements must be properly nested
I explained how to do that in The Basics of Markup.
All attributes must have values and quotation marks around those values
The shortcuts I showed in Coding Shortcuts And Other Things do not apply. The way I described as correct in Attributes is the only way that XML will accept them.
All attribute and element names are case-sensitive
And in XHTML, they're all lower-case.
Non-markup code must be placed in a CDATA section
A CDATA begins with <![CDATA[ and ends with ]]>. Non-markup code (such as JavaScript or CSS) must be placed in this so the browser doesn't try to parse it as markup, which could get messy. For example:
  • < starts a tag in a markup language, but means less than in JavaScript.
  • > ends a tag in a markup language, means greater than in JavaScript, and specifies a child element in CSS.
  • & starts an entity reference or numeric character reference in a markup language, but && means and in JavaScript.
All ampersands must be encoded as character references, whether entity or numeric. All character entity references must be recognized by the markup language
Only five characer entity references are recognized by XML itself: & (&amp;), < (&lt;), > (&gt;), " (&quot;), and ' (&apos;). All others are specific to their respective markup language, but XHTML recognizes all entity references associated with HTML.

These rules may seem to make things more rigid but in reality, they make things a lot more straightforward.

XML Processing Instructions

One thing I haven't talked about yet are XML processing instructions. They normally follow the XML declaration and precede the Doctype (if you're using one). The most common processing instruction links an XML document to a stylesheet of some kind. For example, instead of using the link element to link a stylesheet to an XHTML document, you can use an xml-stylesheet processing instruction instead.

This comes in really handy when you consider that XHTML is the only XML-derived language that has the link element (at least so far as I know). All other XML-derived languages use processing instructions.

An Example of XML

Another language created from XML is Scalar Vector Graphics, or SVG for short. I talked about it back in (X)HTML Objects, and now you get to see the code of an example. Note the XML processing instruction on the second line—that links the SVG document to a stylesheet.

A Diagram Created With SVG
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/css" href="./nesting.css"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="211" height="132" version="1.1" xmlns="http://www.w3.org/2000/svg">
<title>Chapter 2 Nesting Example 1</title>
<g class="element1">
<rect width="200" height="100" x="10" y="20" />
<rect width="106" height="20" x="15" y="10" class="text_back" />
<text x="16" y="25">&lt;element1&gt;</text>
<rect width="116" height="20" x="85" y="110" class="text_back" />
<text x="86" y="125">&lt;/element1&gt;</text>
</g>
<g class="element2">
<rect width="150" height="50" x="30" y="50" />
<rect width="106" height="20" x="35" y="40" class="text_back" />
<text x="36" y="55">&lt;element2&gt;</text>
<rect width="116" height="20" x="60" y="90" class="text_back" />
<text x="61" y="105">&lt;/element2&gt;</text>
</g>
</svg>

This creates the following diagram (yes, I used SVG to create every diagram in this book and screenshots for the rest of the graphics).

An SVG image, showing proper nesting

Right now, you may be thinking that this looks nothing like HTML, so I'm going to put you on familiar ground.

It's a markup language.
Tags begin with <, end with >, and / marks an end tag.
It has attributes.
And they are in the syntax you are familiar with. It just happens to use a lot of them.
It has a Doctype.
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
It has a root element.
In this case, the root element is svg.
It has nesting.
The rect and text elements are child elements of the g elements, the g elements and title element are sibling elements of each other, and the svg element is the ancestor of the lot. You've seen it all before.

The biggest differences between (X)HTML and SVG are vocabulary and purpose ((X)HTML is for structuring documents to be displayed on the World Wide Web, SVG is for creating graphics). Oh, and the fact that SVG is an always XML-based language and that it has its own CSS vocabulary, as seen below.

The Stylesheet Used By This SVG Document
svg{fill:#fff;}
rect{
fill:#fff;
fill-opacity:0;
stroke-width:2px;
}
rect.text_back{
fill:#fff;
fill-opacity:1;
stroke-width:0px;
}
.element1 rect{stroke:#f00;}
text{
font-family:verdana, sans-serif;
font-weight:bold;
font-size:15px;
}
.element1 text{fill:#f00;}
.element2 rect{stroke:#060;}
.element2 text{fill:#060;}

Please note that even though the properties and values for SVG-specific CSS are different, the syntax remains identical.

Combining XML Languages

As I mentioned earlier in Attributes, browsers require XML namespaces to let the browser know which markup language is which. This becomes doubly important because more than one XML language can be used in the same document—indeed, at least type of document (eXtensible Stylesheet Language Transformations, or XSLT) requires at least three!

I'm not going to use that language for a demonstration as it's hairy hocus-pocus through and through, but I am going to show you a page using a combination of three languages so commonly used together they have their own Doctype and name: XHTML 1.1 plus MathML 2.0 plus SVG 1.1—and if you think the name's long, wait until you see the Doctype!

This page shows the two ways to keep the various languages distinct. The first way, used with the SVG, is to assign a prefix—an additional part of a tag or attribute name that associates it with a namespace other than the default. There are two prefixes that are predefined: xml: and xmlns:.

The prefix xmlns: assigns prefixes to namespaces in a fairly simple manner: you use xmlns: followed by the prefix you want as an attribute name, and the value of that attribute is the intended namespace. For example, in XHTML 1.1 + MathML 2.0 + SVG 1.1, all SVG elements get the prefix svg:. Therefore, the prefix would be set like this: xmlns:svg="http://www.w3.org/2000/svg". This attribute could be technically placed in the svg:svg element (which would be the containing element of an SVG image) or in any ancestor element thereof, but it's usually placed in the root element of the document (in this case, the html element); like any other attribute, a namespace expires when the element it's declared in ends, and a root element shouldn't end until the document does.

The other way—used by the MathML elements—is to place it in the element where the markup language changes (in this case, the math element(s). MathML elements don't get a prefix because the DTD of this particular markup language says they don't. Instead, the containing element of each MathML segment gets its own xmlns attribute, setting the namespace to http://www.w3.org/1998/Math/MathML. Of course, when that math element ends, so also does the namespace.

Without further ado, here is a webpage in XHTML 1.1 plus MathML 2.0 plus SVG 1.1

An XML Document Written In XHTML 1.1 plus MathML 2.0 plus SVG 1.1
XML Namespaces Are Highlighted
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN"
"http://www.w3.org/2002/04/xhtml-math-svg/xhtml-math-svg.dtd">
<html
xmlns="http://www.w3.org/1999/xhtml"
xmlns:svg="http://www.w3.org/2000/svg"
xml:lang="en-CA"
dir="ltr"
>
<!-- Notice that the start tag of the root element (<html>) defines both the default namespace AND the namespace for the svg: prefix. The other two prefixes in this document - xml: amd xmlns: - are predefined in the XML standard itself. -->
<head>
<meta http-equiv="Content-Type" content="text/xml" />
<title>The Area Of A Rectangle</title>
</head>
<body>
<h1>The Area Of A Rectangle</h1>
<div>
<!-- SVG starts here. Note the use of the "svg:" prefix, which was defined in the root element. Furthermore, since the attributes here are ALSO part of SVG, they do NOT get prefixes of their own. Only if they were part of another XML standard (for example, the attributes xml:lang and xmlns:svg in the root element, part of the XML standard) would they get their own prefix. -->
<svg:svg width="300" height="150" version="1.1">
<svg:text x="85" y="20">200px</svg:text>
<svg:text x="210" y="55">50px</svg:text>
<svg:rect width="200" height="50" x="2" y="30" />
</svg:svg>
<!-- SVG Ends Here -->
</div>
<div>
<!-- MathML starts here. Note that the <math> element itself uses the xmlns attribute, which means that XML prefixes are not required. -->
<math xmlns="http://www.w3.org/1998/Math/MathML"><mrow>
<mn>200px</mn> <mo>×</mo>
<mn>50px</mn> <mo>=</mo>
<msup><mn>10,000px</mn><mn>2</mn></msup>
</mrow></math>
<!-- MathML Ends Here -->
</div>
</body>
</html>
A mixed XML Document, which uses XHTML, MathML, and SVG

It all works very nicely together—in most browsers. Sadly, Internet Explorer can't even begin to read it. Older versions would display the XML element tree, newer versions try to download processing information that's not included in this file, and so never display anything.

Combining XML And Validation

There are two very important things you should note about XHTML 1.1 + MathML 2.0 + SVG 1.1. First, it has its own Doctype and DTD. The Doctype and DTD are not used for the constituent languages, but for this particular combination as a whole.

The second thing you should know is that this combination is unusual in that it has a Doctype and DTD at all. XSLT, (which, as I mentioned, requires at least three XML languages to be used), has no such set of rules—the only way its code can be confirmed as valid is if it actually works.

Namespaces In CSS

Since XML languages can be blended and some have elements and attributes that share a common name, CSS 3 introduced a way to tell the browser which element or attribute belongs to which language using namespaces.

As in a markup document, namespaces must be declared before they are used; you declare a namespace by starting out with @namespace, followed by the name you want, followed by the namespace identifier in quotes. In the example below, I declare the namespaces for XHTML, SVG, XLink and MathML. I've already demonstrated that XHTML, SVG, and MathML are used together in a single document. I should also mention that XHTML and SVG happen to share some common elements—for example, the a element. XLink is a means of creating hyperlinks; while XHTML has its own method of doing so, SVG needs XLink.

Declaring Namespaces In A Stylesheet
@namespace html "http://www.w3.org/1999/xhtml"; /* XHTML namespace */
@namespace svg "http://www.w3.org/2000/svg"; /* SVG namespace */
@namespace math "http://www.w3.org/1998/Math/MathML"; /* MathML namespace */
@namespace xlink "http://www.w3.org/1999/xlink"; /* XLink namespace */

The namespace names are seperated from an element or attribute name by a verticle bar (|).

Using Namespaces In Selectors
html|a /* Refers to the XHTML <a> element */
svg|a /* Refers to the SVG <a> element */
*|a /* Refers to any <a> element */
|a /* Refers to an <a> element NOT bound to any namespace */
a /* Refers to an <a> element of the default namespace or any namespace if no default has been declared */
[html|href] /* Refers to the XHTML href attribute */
[xlink|href] /* Refers to the XLink href attribute */

HTML5

Perhaps in recognition that HTML was not going to be dislodged any time soon, the W3C reopened the HTML working group to create a new version of HTML.

This version has probably the shortest Doctype of any markup language. It reads:

The HTML5 Doctype
<!Doctype HTML>

The idea goes that the browser will see that it's HTML and automatically know what it's supposed to do with the markup language—not all that far-fetched a notion, when you think about it.

HTML5's status is still that of Editor's Draft (in other words, everyone is still bickering over everything, including pizza toppings), so it's useless to upgrade just yet. However, I will give you a list of some future proposed elements, taken straight from the HTML5 specification at http://dev.w3.org/html5/spec/Overview.html.

article
A section that forms an independent part of the page
aside
A section that is related to the document, but not quite part of it. For example, a sidebar.
audio
For audio files, such as mp3s.
canvas
This represents a bitmap canvas which can be used for graphs, graphics, etc.
dialog
Works like a definition list, but instead gives a list of names and quotes, like a script.
figure
Can be used to annotate illustrations, pictures, and so on.
footer
For by-lines, footnotes, and copyright notices.
header
Possibly contains an introduction to an article, or even heading elements.
hgroup
This is for a group of h1 - h6 elements
mark
Highlights text for reference purposes.
meter
Used to compare a measurement's position between a minimum and maximum value
nav
For the part of a page with a navigation block (such as a list)
progress
This represents the completion progress of a task.
section
A section of a document.
source
Contains the address of media resources.
time
For dates and times.
video
For video content, such as mpegs.

Of course, this list is very subject to change, and it could all be replaced with the single element element by the time the HTML group is done.