A common mistake made by newcomers to XML is to try to insert an HTML link in a text node like this:

<publisher><a href="http://www.apress.com/">Apress</a></publisher> It won’t work.

XML has no concept of what an <a> tag or href attribute means. Not only that, the <a> tag is treated as another element. “Apress” is no longer the text node of the <publisher> element. the <a> element is now a child of <publisher>, and “Apress” is the text node inside <a>.

Although this is still valid XML, attempting to extract information from this type of hierarchy rapidly becomes an exercise in frustration. When embedding HTML code in a text node, you either need to convert the angle brackets of the tags into &lt; and &gt; or wrap the code in a CDATA section (CDATA stands for “character data”). Using a CDATA section is much simpler.

A CDATA section begins with the following opening tag:

<![CDATA[

To close a CDATA section, use the following tag:
]]>
To embed the <a> tag in a CDATA section, change the previous example like this:
<publisher><![CDATA[<a href="http://www.apress.com/">Apress</a>]]>

</publisher>

This restores the XML tree to the original hierarchy. Everything between the opening and closing tags of a CDATA section is treated as raw data. The <a> tag is no longer treated as a separate XML element, but as part of the <publisher> text node.

You can put anything inside a CDATA section; the only thing that cannot appear inside one is the same sequence of characters as the closing tag. An opening angle bracket is not treated as the start of a tag, and an ampersand is not regarded as the beginning of an HTML or numeric entity.

This is particularly important when you want to include JavaScript in an XML text node, because converting the less-than (<) and greater-than (>) operators to &lt; and &gt; breaks the code. With a CDATA section, they are safe. After that whirlwind tour of XML, let’s take a look at how to extract information from an
XML document with SimpleXML.