Efficient XPath for Go

This week I found some time to work on another small spin-off from the juju project at Canonical, and I’m happy to make it openly available today: the xmlpath package, which implements an efficient and strict subset of the XPath specification for the Go language.

This new package will be used in an upcoming (and long due) revision of the goamz package API, which is currently limited by the fact that once the XML result returned by Amazon is unmarshalled into a static structure, any other data that the package wasn’t prepared to deal with becomes hard to access by clients. This problem is being solved by parsing the tree into an intermediary form which can then have XPath expressions conveniently and efficiently applied to it.

Path expressions currently supported by the package are in the following format, with all components being optional:

/axis-name::node-test[predicate]/axis-name::node-test[predicate]

Compatibility with the XPath specification goes to the following extent:

  • All axes are supported (“child”, “following-sibling”, etc)
  • All abbreviated forms are supported (“.”, “//”, etc)
  • All node types except for namespace are supported
  • Predicates are restricted to [N], [path], and [path=literal] forms
  • Only a single predicate is supported per path step
  • Richer expressions and namespaces are not supported

For example, consider this simple document:

<library>
  <!-- Great book. -->
  <book id="b0836217462">
    <isbn>0836217462</isbn>
    <title>Being a Dog Is a Full-Time Job</title>
    <author id="CMS">
      <name>Charles M Schulz</name>
      <born>1922-11-26</born>
    </author>
    <character id="PP">
      <name>Peppermint Patty</name>
      <born>1966-08-22</born>
    </character>
    <character id="Snoopy">
      <name>Snoopy</name>
      <born>1950-10-04</born>
    </character>
  </book>
</library>

The following expressions can be applied to it, with the indicated result as first match:

/library/book/isbn “0836217462”
/library/*/isbn “0836217462”
/library/book/../book/./isbn “0836217462”
/library/book/character[2]/name “Snoopy”
/library/book/character[born='1950-10-04']/name “Snoopy”
/library/book//node()[@id='PP']/name “Peppermint Patty”
//*[author/@id='CMS']/name “Charles M Schulz”
/library/book/preceding::comment() ” Great book. “

The API implemented allows compiled paths to be held and re-applied any number of times, concurrently or not. For example:

path := xmlpath.MustCompile("/library/book/isbn")
root, err := xmlpath.Parse(file)
if err != nil {
        log.Fatal(err)
}
if value, ok := path.String(root); ok {
        fmt.Println("Found:", value)
}

Result sets can also be optionally stepped over via an idiomatic iterator interface.

The performance of these operations is close to using the static unmarshaling currently implemented by Go’s encoding/xml package:

BenchmarkParse                 5000        613862 ns/op
BenchmarkSimplePathCompile     1000000     1983 ns/op
BenchmarkSimplePathString      1000000     1565 ns/op

As a reference, this is a similar encoding/xml operation, using a struct with a single nested field on the same document:

BenchmarkSimpleUnmarshal       5000        622519 ns/op

I’m hoping this will make our unavoidable XML interactions slightly less painful.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>