Parsing XML documents with CSS selectors

Fabien Potencier

Mar 31, 2010

HTML and XML documents are the bread and butter of web developers. On a day to day basis, you probably create a lot of HTML documents. And odds are you also need to parse some from time to time: because you consume a web service and want to extract some information, or because you want to gather data from scraped web pages, or just because you want to write functional tests for a website. Retrieving the document is quite easy, but how do you navigate through it to extract the information you need?

PHP already comes with a lot of useful tools for parsing XML documents: SimpleXML, DOM, and XMLReader, just to name a few. But as soon as you need to extract information deeply embedded in the document structure, things are not as easy as they should be. Of course, XPath is your best friend when you need to select elements, but the learning curve is really steep. Even expressions that should be easy can be complex. As an example, here is the XPath expression to retrieve all h1 tags that have a foo class:

h1[contains(concat(' ', normalize-space(@class), ' '), ' foo ')]

The XPath expression is complex because a tag can have several classes:

<h1 class="foo">Foo</h1>
<h1 class="foo bar">Foo</h1>
<h1 class="foobar bar">Foo</h1>

The expression should match the first two h1 tags, but not the third one.

Of course, everybody knows that doing the same with a CSS selector is a piece of cake:

h1.foo

For Symfony 2 functional tests, I wanted a way to leverage the power and expressiveness of CSS selectors with the tools we already have in PHP. The first idea that came to my mind was to convert a CSS selector to its XPath equivalent. But is it possible? The answer is a surrounding ‘YES’.

As John Resig wrote in a blog post some time ago about the same topic: “The biggest thing to realize is that CSS Selectors are, typically, very short - but woefully underpowered, when compared to XPath.”

Writing a tokenizer, a parser, and a compiler able to convert CSS selectors to XPath is no trivial task. So, instead of reinventing the wheel, I looked for some existing libraries. I didn’t look too much before stumbling upon lxml, a Python library. The lxml.cssselect module of lxml does exactly this. So, I took the time to translate the Python code to PHP, added some unit tests, and voilĂ , the Symfony 2 CSS Selector component was born.

Note

symfony 1 has a sfDomCssSelector class, but it does not convert the CSS selector to XPath. It does the job nicely but it is limited to very simple CSS selectors and it cannot easily be used with standard XML tools.

The Symfony 2 CSS Selector component does only one thing, and it tries to do it well: converting CSS selectors to XPath expressions. Using it is dead simple:

use Symfony\Components\CssSelector\Parser;

$xpath = Parser::cssToXpath('h1.foo');

The $xpath variable should now contain h1[contains(concat(' ', normalize-space(@class), ' '), ' foo ')].

Let’s take an example to see how you can use it. Let’s say you want to retrieve all post titles and URLs for this blog (the information is available at https://fabien.potencier.org/articles).

use Symfony\Components\CssSelector\Parser;

$document = new \DOMDocument();
$document->loadHTMLFile('https://fabien.potencier.org/articles');

$xpath = new \DOMXPath($document);
foreach ($xpath->query(Parser::cssToXpath('div.item > h4 > a')) as $node)
{
  printf("%s (%s)\n", $node->nodeValue, $node->getAttribute('href'));
}

The code is straightforward and instead of using an XPath expression, we let the Parser class convert the CSS Selector to XPath for us:

$xpath->query(Parser::cssToXpath('div.item > h4 > a'))

Be warned that if you work with XML documents, you need to register the namespaces you use. Let’s use SimpleXMLElement, which only understand well-formed XML documents:

$document = new \SimpleXMLElement('https://fabien.potencier.org/articles', 0, true);
$document->registerXPathNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
foreach ($document->xpath(Parser::cssToXpath('xhtml|div.item > xhtml|h4 > xhtml|a')) as $node)
{
  printf("%s (%s)\n", $node, $node['href']);
}

As you can notice, CSS selectors support namespaces (xhtml|div).

This new CSS Selector component will be used in Symfony 2 for functional tests (but as you will see in the coming weeks, in a very different way than what we had in symfony 1).

The code is unit-tested and has a good code coverage, so feel free to use it (code is on Github: http://github.com/fabpot/symfony under the Symfony\Components\CssSelector namespace) and send me some feedback.

Stay tuned!