Parsing XML documents with CSS selectors

Fabien Potencier

March 31, 2010

HTML and XML documents are the bread and butter of web developers. On a day to day basis, you probably create a lot of HTML documents. And odds are you also need to parse some from time to time: because you consume a web service and want to extract some information, or because you want to gather data from scraped web pages, or just because you want to write functional tests for a website. Retrieving the document is quite easy, but how do you navigate through it to extract the information you need?

PHP already comes with a lot of useful tools for parsing XML documents: SimpleXML, DOM, and XMLReader, just to name a few. But as soon as you need to extract information deeply embedded in the document structure, things are not as easy as they should be. Of course, XPath is your best friend when you need to select elements, but the learning curve is really steep. Even expressions that should be easy can be complex. As an example, here is the XPath expression to retrieve all h1 tags that have a foo class:

h1[contains(concat(' ', normalize-space(@class), ' '), ' foo ')]
 

The XPath expression is complex because a tag can have several classes:

<h1 class="foo">Foo</h1>
<h1 class="foo bar">Foo</h1>
<h1 class="foobar bar">Foo</h1>
 

The expression should match the first two h1 tags, but not the third one.

Of course, everybody knows that doing the same with a CSS selector is a piece of cake:

h1.foo
 

For Symfony 2 functional tests, I wanted a way to leverage the power and expressiveness of CSS selectors with the tools we already have in PHP. The first idea that came to my mind was to convert a CSS selector to its XPath equivalent. But is it possible? The answer is a surrounding 'YES'.

As John Resig wrote in a blog post some time ago about the same topic: "The biggest thing to realize is that CSS Selectors are, typically, very short - but woefully underpowered, when compared to XPath."

Writing a tokenizer, a parser, and a compiler able to convert CSS selectors to XPath is no trivial task. So, instead of reinventing the wheel, I looked for some existing libraries. I didn't look too much before stumbling upon lxml, a Python library. The lxml.cssselect module of lxml does exactly this. So, I took the time to translate the Python code to PHP, added some unit tests, and voilà, the Symfony 2 CSS Selector component was born.

symfony 1 has a sfDomCssSelector class, but it does not convert the CSS selector to XPath. It does the job nicely but it is limited to very simple CSS selectors and it cannot easily be used with standard XML tools.

The Symfony 2 CSS Selector component does only one thing, and it tries to do it well: converting CSS selectors to XPath expressions. Using it is dead simple:

use Symfony\Components\CssSelector\Parser;
 
$xpath = Parser::cssToXpath('h1.foo');
 

The $xpath variable should now contain h1[contains(concat(' ', normalize-space(@class), ' '), ' foo ')].

Let's take an example to see how you can use it. Let's say you want to retrieve all post titles and URLs for this blog (the information is available at http://fabien.potencier.org/articles).

use Symfony\Components\CssSelector\Parser;
 
$document = new \DOMDocument();
$document->loadHTMLFile('http://fabien.potencier.org/articles');
 
$xpath = new \DOMXPath($document);
foreach ($xpath->query(Parser::cssToXpath('div.item > h4 > a')) as $node)
{
  printf("%s (%s)\n", $node->nodeValue, $node->getAttribute('href'));
}
 

The code is straightforward and instead of using an XPath expression, we let the Parser class convert the CSS Selector to XPath for us:

$xpath->query(Parser::cssToXpath('div.item > h4 > a'))
 

Be warned that if you work with XML documents, you need to register the namespaces you use. Let's use SimpleXMLElement, which only understand well-formed XML documents:

$document = new \SimpleXMLElement('http://fabien.potencier.org/articles', 0, true);
$document->registerXPathNamespace('xhtml', 'http://www.w3.org/1999/xhtml');
foreach ($document->xpath(Parser::cssToXpath('xhtml|div.item > xhtml|h4 > xhtml|a')) as $node)
{
  printf("%s (%s)\n", $node, $node['href']);
}
 

As you can notice, CSS selectors support namespaces (xhtml|div).

This new CSS Selector component will be used in Symfony 2 for functional tests (but as you will see in the coming weeks, in a very different way than what we had in symfony 1).

The code is unit-tested and has a good code coverage, so feel free to use it (code is on Github: http://github.com/fabpot/symfony under the Symfony\Components\CssSelector namespace) and send me some feedback.

Stay tuned!

Discussion

gravatar stereoscott  — March 31, 2010 20:29   #1
Fabien, we all appreciate how much you have given the php community, and this looks like a great tool for both functional tests and for parsing out content from existing xhtml documents. thank you, and well done!
gravatar Rich  — March 31, 2010 20:44   #2
thanks Fabien, i wish you post this a few months ago - i used to parse XML's a YML file which described the different nodes. of course this solution is much more elegant.
gravatar Tom Boutell  — March 31, 2010 21:17   #3
Sweet! Like many people I'm sure, I've written halfassed reimplementations of this in the past. Thanks for doing the DRY thing and tracking down a solid existing implementation and porting it.
gravatar Toby  — April 01, 2010 00:31   #4
This approach looks pretty similar to http://www.fluentdom.org/ which realizes a jQuery style access approach to the DOM tree.
gravatar Sebastian Golasch  — April 01, 2010 01:32   #5
First of all, thanks to Fabien for another really usefull component.
I´ll give it a short try tomorrow.
But one question bothers me, what about performance and/or benchmarks?!
Are there any first experiences and/or short comments about this topic?

@Toby
FluentDom seem to work like a wrapper around the build-in XML functions of php, while the CssSelector component 'only' converts Css expressions to XPath expressions.

It enables you to use the build in XML functions of php directly.
No need to learn a new api,
less error prone because there is no library between you and your good old xml functions,
lower W.T.F factor while refactoring your 'xml-parsing' code, and so on...

I´am still curious about whats cooking at the 'symfony component kitchen' right now and in the near future...
gravatar kiang  — April 01, 2010 02:32   #6
Just another similar solution: http://framework.zend.com/manual/en/zend.dom.query.html
gravatar Marijn Huizendveld  — April 01, 2010 04:39   #7
Sometimes I wonder, are you a robot? Seriously...:-)

Anyway, congratulations!
gravatar Fabien  — April 01, 2010 07:13   #8
@Toby: FluentDOM is very different as it only uses XPath and not CSS selectors.

@kiang: Zend_Dom_Query has a very limited support of CSS. It is only able to parse simple CSS selectors. The Symfony 2 component supports the whole CSS specification (with just a few exceptions listed on the lxml.cssselect documentation page). Just have a look at the code and you will understand what I'm talking about ;)
gravatar Andris  — April 01, 2010 10:15   #9
Another jQuery port worth mentioning would be phpQuery (has crazy stuff like plugins and event triggers).
gravatar Jordi Boggiano  — April 01, 2010 11:05   #10
Great! I will let the dompdf guys know, hopefully they can replace their parser with this, less code to maintain and especially the current one is not liking CSS3 selectors too much.
gravatar Tom  — April 01, 2010 13:01   #11
@Fabien for a document with namespaces, you could use:

*[local-name() = 'h1' and contains(concat(' ', normalize-space(@class), ' '), ' foo ')]

@Toby FluentDOM emulates the api (traversing and manipulation), but not the selectors (at least at the moment). You could use both projects together of course.
gravatar Toby  — April 01, 2010 13:58   #12
Ah, OK. Thought the guys would have added that to FluentDOM already. That was the original die I was using as an example in my workshop. :) Nevermind.
gravatar Stoyan  — April 12, 2010 06:54   #13
@Fabien, you said that you hate to reinvent the wheel, but did you look at phpquery ? - http://code.google.com/p/phpquery/

It is exactly what you describe here as functionality and even more. It is also relatively mature project.
gravatar Fabien  — April 12, 2010 08:57   #14
@Soyan: Of course I had a look at phpquery! But it does something totally different. It does not convert CSS selectors to XPath... or I missed something.

It behaves more like the CSS selector we have in symfony 1 (in the sfDomCssSelector class). And this one was created in 2006!
gravatar Bob McConnell  — April 19, 2010 20:42   #15
Re: YAML

I downloaded your YAML component from Symphony, but what's the next step to install it? I now have your tar.gz file in D:\Downloads, and php.exe in D:\PHP, so where do I go from here?