The The R Primer logo Primer

Read data from an XML file

You want to import a dataset stored in the XML file format by manually coding how to extract the relevant information

Solution: The XML (eXtensible Markup Language) was designed to transport and store data and XML has seen widespread use in interchanging data over the Internet.

An XML file consists of a series of elements which form a document tree. The tree starts at the root and branches to the lowest level of the tree. XML documents must contain a root node (or element) which is "the parent" of all other nodes, and all nodes can have their own sub nodes ("child elements").

The XML package provides numerous tools for parsing and generating XML in R. Since XML is such a flexible format, the XML package primarily consists of functions that must be combined to parse and extract information from a specific type of XML structure.

The xmlTreeParse function is the work-horse for importing general XML documents. xmlTreeParse parses an XML file and stores the tree in an R structure. We subsequently traverse the tree and extract data from the relevant nodes. xmlTreeParse requires a file name or location as input for where to find the XML file, and it returns an R XML object with the parsed XML file. The useInternalNodes option can be set to TRUE to increase parsing speed.

First, xmlRoot should be called to get a pointer to the top-level node or parent of the XML tree. The skip option can be set to FALSE to prevent R from skipping over document type definitions in the XML file if those are present. The XML tree structure works like a recursive list-like object and the individual nodes in the tree are accessed using named or numbered indices, [[]]. The XML tree can be traversed with the proper indices and for each node we can get the parent and list of children sub-nodes using the xmlParent and xmlChildren functions, respectively.

Information can be extracted from a node using one of the xmlName, xmlValue, xmlGetAttr and xmlAttrs functions, which return the node name, node contents, a named attribute and all attributes, respectively.

See rule 1.3 in The R Primer for a worked example which also shows the use of XPath.

Back to tips.