XML Frequently Asked Questions

XML Frequently Asked Questions

What is an XML declaration?

All XML documents should begin with an XML declaration which is also called the prolog.  The role of the XML declaration is to:

  1. Declare the document is an XML document
  2. State the version of XML used
  3. State the character encoding used
  4. State whether a DTD is required for accurate processing of the document.

The XML declaration is always the first line of the document. It is part of the prolog. The prolog is part of the XML document that tells computers how to interpret the document. The XML declaration is not a mandatory requirement, however it is recommended that you always use it. If it is included, the rule is it must be the first line in the document and no other content or white-space can precede it.

The following is an example of the XML declaration

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

As we can see it starts with an opening angle bracket with a question mark followed by the letters XML and then a version an encoding and then an optional standalone declaration and then closes with an question mark followed by a closing angle bracket. The letters XML which are highlighted declares the document to be an XML document.

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

Next is the version number

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

The version number is mandatory within the declaration. As we can see, the way we write the version of XML is the word version, followed by the equals sign, followed by the version which is contained within quotation marks.

Currently there are two versions of XML. Version 1.0 was initially defined in 1998.  It has undergone minor revisions since then, without being given a new version number. It is currently in its fifth edition which was published on November 26, 2008. It is widely implemented and still recommended for general use.

The second version, XML 1.1 was initially published on February 4, 2004. It is currently in its second edition and was published on August 16, 2006. It contains features that are intended to make some aspects of XML easier to use and contains a number of unique features that are not found in XML Version 1.0. It is not very widely implemented and is only recommended for use by those who need its unique features.

 

The next part of the xml declaration is the encoding declaration

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

This is optional. If used, the rule is the encoding declaration must appear immediately after the version information in the XML declaration, and must contain a value representing an existing character encoding. As we can see in the XML declaration, the character encoding is defined in the same way as the version. That is, we write the word encoding, followed by the equals sign, followed by the character encoding  used, in this case UTF-8 which is also contained in quotation marks.

Lastly, there is the standalone declaration

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

Like the encoding declaration, the standalone declaration is optional. If used, the rule is, the standalone declaration must appear last in the XML declaration. The standalone declaration states whether the XML document stands alone or whether DTD is required for accurate processing of the document. Briefly, a DTD contains instructions on how to process an XML file. As we can see we define the standalone declaration the same way as both the version and encoding declarations. The standalone declaration can take on the values “yes” or “no”. The yes value means the XML document is standalone and does not require a DTD to process it and likewise, the value no tells the XML processor that it needs to refer to a DTD to accurately process document. As already mentioned, the XML declaration is optional. If it is not included in the XML document then the XML version 1.0 and character encoding UTF-8 are the default versions used.

A summary of the rules for the XML declaration are as follows;

  • The XML declaration is not mandatory
  • If it is used, the XML declaration is always the first line of the document.
  • The XML version must be included within the XML declaration
  • The encoding declaration is optional
  • If the encoding declaration is used, it must appear immediately after the XML version
  • The standalone declaration is optional
  • If the standalone declaration is used it appear last in the XML declaration

What is character encoding?

Words and sentences in text are created from characters. For the English language these characters would include 26 upper case letters, 26 lower case letters, punctuation marks and others characters such as those you would see on an English keyboard etc.

In computing, characters are encoded in various encoding formats, generally depending on the language of the document. For example, Hebrew, Japanese, and English characters require different character encodings in order to be legible to document authors. If you use anything other than the most basic characters needed for English, people may not be able to read your text unless you say what character encoding you used. Character encoding defines a unique binary code for each different character used in an XML document

It is important to clearly distinguish between the concepts of a character set versus a character encoding.

character set otherwise known as a repertoire is comprised of the set of characters one might use for a particular purpose, for example, all those characters you would see every day in a document, a book, an email etc. or those a Russian child will learn at school and has nothing to do with computers.

Characters are grouped into a character set and are stored in computers using a code. This is then called a coded character set when each character in the set is assigned a particular number, called a code point. These code points will be represented in the computer by one or more bytes. The character encoding is the key that maps code points to bytes in the computer memory, and read the bytes back into code points.

One such character encoding is the American Standard Code for Information Interchange abbreviated to A-S-C-I-I, and is pronounced ass-kee . ASCII codes represent text in computers, communications equipment, and other devices that use text. Most modern character-encoding schemes are based on ASCII, though they support many additional characters. ASCII was the most common character encoding on the World Wide Web until December 2007, when it was surpassed by Unicode UTF-8, which includes ASCII as a subset.

XML’s character set is Unicode. Unicode is the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet.  XML allows the use of any of the Unicode-defined encodings, and any other encodings whose characters also appear in Unicode.

The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32, though UTF-8 is the most commonly used. UTF stands for Universal character set Transformation Format

There are some character encodings that include almost all characters of the world’s languages, such as ISO 10646 and various Unicode encodings. In an XML document, character encoding can be declared via the encoding declaration.

What is an element?

Before we can put an XML document together, we must first understand how an XML document is structured.

So, what do we mean by structure?

Most documents that you encounter, for example books and magazines are broken down into components such as chapters and articles. These can be broken down into smaller components, for example, titles, sections, paragraphs, figures, tables and so forth. And again, these can be broken down into smaller components again, such as sentences and individual words.

Every document can be viewed this way, in that a complete document is the sum of its individual components. In XML, these individual components are called elements. An XML element is everything from and including the element’s start tag, up to and including the element’s end tag. Each element represents some logical component of a document.

The rules for elements are as follows. An element can contain:

  • text
  • attributes (See FAQ: What is an attribute?)
  • other elements
  • or a mix of the above

The root element

The first element in the document is called the root element, also known as the document element. All other elements in the document are nested into the root element. All XML documents begin with one root element and the root element cannot exist anywhere else in the document.

Nested elements are elements that are contained within other elements.

An example of a root element and it’s associated nested elements is shown here.

 

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”yes”?>

<note>
<to>Tom</to>
<from>Marie</from>
<heading>Reminder</heading>
<body>Don’t forget our meeting today at 2pm</body>
</note

In this example, <note> is the root element of the document. As we can see, the elements <to>, <from>, <heading> and <body> are all contained or nested within the root element <note>.

An element that is contained or nested within another element is called a child element and the element containing the child element is known as the parent element.

All elements within the root element must nest properly.  This means that each child element must close before its parent closes.

 

XML Tree Structure

 

XML elements in a document form a tree structure. The tree starts with the element root and branches out to the lowest sub elements or child elements as seen in this example

In this case, Company is the root element, Employee is a child element of the root element, and likewise FirstName, LastName, ContactNo, Email and Address are child elements of the Employee element.

FirstName, LastName, ContactNo, Email and Address are all equal to each other and are therefore known as sibling elements.

The City, State and Zip elements are sibling elements and are child element to their parent element, Address.

The following XML document demonstrates how this element tree structure should appear.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Company>
  <Employee>
    <FirstName>John</FirstName>
    <LastName>Smith</LastName>
    <ContactNo>1234567</ContactNo>
    <Email>JohnSmith@ABC.com</Email>
    <Address>
      <City>Sunnyvale</City>
      <State>Calafornia/State>
      <Zip>94089</Zip>
    </Address>
  </Employee>
</Company>