Learning to parse XML is often thought of as a complex adventure, but it doesn’t have to be. XML is highly structured, so it is relatively predictable. There are also a number of tools that can help make this work manageable.

One of my favorite XML tools is XMLStarlet, an XML toolkit for your terminal. With XMLStarlet, you can validate, parse, edit, format, and transform XML data. XMLStarlet is a relatively minimal command, but XML browsing is full of potential, so this article demonstrates how to use it to query XML data.

The installation

On CentOS, Fedora, and many other modern Linux distributions, XMLStarlet is installed by default, so simply open a terminal and type XMLStarlet to access it. If XMLStarlet is not already installed, your operating system will voluntarily install it for you.

Alternatively, you can install the xmlstarlet command from your package manager.

$ sudo dnf install xmlstarlet
Copy the code

On macOS, use MacPorts or Homebrew. On Windows, use Chocolatey.

If all else fails, you can install it manually from Sourceforge’s source code.

Parse XML with XMLStarlet

A number of tools are designed to help parse and transform XML data, including software libraries that let you write your own parser and complex commands like FOp and XSLtproc. However, sometimes you don’t need to deal with XML data; You just need a convenient way to extract important data, update it, or just validate it. For spontaneous XML interaction, I use XML Starlet, a typical “Swiss Army Knife” application that performs the most common XML tasks. You can see what it provides by running the command and the –help option.

$ xmlstarlet --help
Usage: xmlstarlet []  []
where  is one of:
  ed    (or edit)      - Edit/Update XML document(s)
  sel   (or select)    - Select data or query XML document(s) (XPATH, etc)
  tr    (or transform) - Transform XML document(s) using XSLT
  val   (or validate)  - Validate XML document(s) (well-formed/DTD/XSD/RelaxNG)
  fo    (or format)    - Format XML document(s)
  el    (or elements)  - Display element structure of XML document
  c14n  (or canonic)   - XML canonicalization
  ls    (or list)      - List directory as XML
[...]
Copy the code

You can add –help to the end of these subcommands for further help.

$ xmlstarlet sel --help
  -Q or --quiet             - do not write anything to standard output.
  -C or --comp              - display generated XSLT
  -R or --root              - print root element 
  -T or --text              - output is text (default is XML)
  -I or --indent            - indent output
[...]
Copy the code

Select data with SEL

You can view data in XML using the XMLStarlet Select (sel for short) command. Here is a simple XML file.



  
   
    
      Fedora
      7
      Moonshine
      
        Live
        Fedora
        Everything
      
    
    
      Fedora Core
      6
      Zod
      
    
   
  

Copy the code

When looking for data in an XML file, your first priority is to focus on the nodes you want to explore. If you know the path to the node, specify the full path with the –value-of option. The earlier you start exploring the Document Object Model (DOM) tree, the more information you’ll see.

$ xmlstarlet select --template \
--value-of /xml/os/linux/distribution \
--nl myfile.xml
      Fedora
      7
      Moonshine
        Live
        Fedora
        Everything
      Fedora Core
      6
      Zod
Copy the code

Nl stands for “new line” and inserts a lot of white space to make sure your terminal prompt gets a new line when your result comes out. I have removed some of the excess space in the sample output.

Narrow your focus by descending further into the DOM tree.

$ xmlstarlet select --template \
--value-of /xml/os/linux/distribution/name \
--nl myfile.xml
Fedora
Fedora Core
Copy the code

Conditional selection

One of the most powerful tools for navigating and parsing XML is called XPath. It regulates the syntax used in XML searches and invokes functions from XML libraries. XMLStarlet understands XPath expressions, so you can use XPath functions to make your choice conditional. XPath is rich in functionality and well documented by the W3C, but I find Mozilla’s XPath documentation more concise.

You can use square brackets as a test function to compare the contents of an element to a value. Here is a test of the value of the element, which returns the version number associated only with a particular match.

Imagine that the sample XML file contains all the Fedora versions starting with 1. See all version numbers associated with the old name “Fedora Core “(the project removed “Core” from the name starting with release 7).

$ xmlstarlet sel --template \ --value-of '/xml/os/linux/distribution[name = "Fedora Core"]/release' \ --nl myfile.xml 6 5, 4, 3, 2, 1Copy the code

You can also view the distribution of all the code, as long as the value – of – the path to/XML/OS/Linux/distribution [name = “Fedora Core”] / codename.

Matches the path and gets the value

One advantage of thinking of XML tags as nodes is that once you find the node, you can think of it as your current “directory” of data. It’s not really a directory, at least not in the filesystem sense, but it’s a collection of data that you can query. To help you separate the destination from the “inside” data, XMLStarlet uses the –match option to distinguish what you want to match and the –value-of option to distinguish what data values you want.

Suppose you know that a node contains several elements. That makes it your destination. Once you’re there, you can specify which element’s value you want with –value-of. To see all the elements, use a dot (.) To represent your current position.

$ xmlstarlet sel --template \
--match '/xml/os/linux/distribution/spin' \
--value-of '.' --nl myfile.xml \
Live
Fedora
Everything
Copy the code

As with DOM browsing, you can use XPath expressions to limit the range of data returned. In this example, I use the last() function to retrieve only the last element in the Spin node.

$ xmlstarlet select --template \
--match '/xml/os/linux/distribution/spin' \
--value-of '*[last()]' --nl myfile.xml
Everything
Copy the code

In this example, I use the position() function to select a specific element in the Spin node.

$ xmlstarlet select --template \
--match '/xml/os/linux/distribution/spin' \
--value-of '*[position() = 2]' --nl myfile.xml
Fedora
Copy the code

The –match and –value-of options can overlap, so it’s up to you to decide how to use them together. These two expressions, in the example XML, do the same thing.

$ xmlstarlet select  --template \
--match '/xml/os/linux/distribution/spin' \
--value-of '.' \
--nl myfile.xml
Live
Fedora
Everything
$ xmlstarlet select --template \
--match '/xml/os/linux/distribution' \
--value-of 'spin' \
--nl myfile.xml
Live
Fedora
Everything
Copy the code

To adapt to the XML

XML can sometimes seem overly verbose and cumbersome, but the tools built to interact with it never ceases to amaze me. If you want to leverage XML, XMLStarlet might be a good place to start. The next time you want to open an XML file to view structured data, try XMLStarlet and see if you can query the data. The more familiar you are with XML, the more it will serve you as a powerful and flexible data format.