log in | about 

When it comes to data analysis data preparation is the most time-consuming task. According to one survey it takes 80% of time. For those who deal with semi-structured text collections such as Wikipedia, one of the most annoying problem is parsing. I am not talking about splitting documents into sentences, obtaining POS tags, and dependency trees. I mean a mundane task of extracting, e.g., Wikipedia title and text. Somewhat unexpectedly, this can be quite a pain in the ass.

Many of the text collections that I deal with have two things in common: (1) they are stored in XML format and (2) they have repeating entries of the same structure enclosed within a pair of unique tags (i.e., tags do not repeat inside the entry itself). In the case of Wikipedia, an entry is a Wikipedia article surrounded by the tags <page> and </page>. Because it is a large XML document, one has typically to resort to using an event-driven method called SAX.

Consider an example of such parsing code written by my fellow student Di Wang. As you can see, an event-driven approach is not easy to implement. A SAX parser tells you a few things like when it encounters a starting tag and everything else is your own headache. Basically, you need to keep some sort of a state variable and keep track of opening/closing tags. Not only is this tedious, but it is also error prone and fragile. You change the format of the document a bit and your code may stop working.

It would be a long post to explain everything what I hate about SAX parsing, but let me simply state that it sux in my opinion. What I would prefer instead is to parse everything using a DOM parser. Then, accessing necessary nodes would be a walk in the park. I do not have to care about parsing details, I can use things like XSLT and all sort of useful helper functions that work with an existing DOM tree. Buuuut, this approach is extremely memory inefficient.

Instead it would be nice to have something like an XML iterator that would go over the list of similarly-structured entities, parse one entry (e.g., a Wikipedia article) at a time, and generate a DOM tree only for this entry. How does one implement such a thing? Recall that each entry is enclosed by the pair of unique tags. Thus we can find the start/end of each entry and parse one entry using a DOM parser. Of course, there are some subtleties to be taken care of. For example, the enclosing tags may occasionally have attributes and document entries may have, e.g., CDATA sections. However, it should not be too complicated to implement such functionality.

This is exactly what I did when I got tired of using pesky SAX parsers. I have been using my "XML iterator" implementation for more than a year, but only recently did I extract the code so it can be used in a standalone fashion. The repository is on GitHub. It contains an XML iterator class as well as a Wikipedia parsing example. It can be executed by calling a script sample_run.sh. The code is in Java (8+). Feel free to (dis)like the code and send me pull requests should you find any problems.

The XML iterator does not do any deep XML parsing. It only extracts the text of document entries (one at a time). An entry should be enclosed by the unique tag. This means that the tag cannot be reused inside the document entry. On obtaining the next entry, you parse it using a DOM parser of your choice. You do not have to use the same DOM parser as I did. You can process DOM trees in a more elegant way than I did. For example, for complex documents, you can use XSLT/XPATH. To conclude I note that this approach is reasonably efficiently (and uses little memory), but it is not as efficient as the SAX parser. So, if the parsing speed is of paramount importance (which I doubt), then SAX is still your best friend.


The suggested approach has several (severe) disadvantages:
1. It assumes, that parsed XML is a sequence of entities of the same type.
2. In your specific case it misses the wikipedia header for example. The one, that explains the namespaces names etc. It is unimportant if you want to extract title and raw text only. But if you intend to parse wiki-markup (especially in non-English wikis) you will need these metadata info.
3. It has not thought out all corner cases of XML parsing. :-)

Perhaps, StAX is a more standard and convenient way of achieving this. You can easily parse one document a time without messing with nasty XML parsing. And you will simply need a procedure that constructs a DOM document from XML events. For example, you can build DOM documents from all elements at depth of 1 of the element stack.

Thank you for the feedback.

  1. I do mention this, but it is actually true in most interesting cases. Remember that you parse entities of the same type: pages, answers, etc... How do you propose to name enclosing tags? document1, document2, documentN, ... this would be strange.
  2. Do you mean XML header? This can be added if necessary. Or do you mean siteinfo? siteinfo can be extracted.
  3. Do you know any specific corner cases?
  4. Thank you for the reference. I had a quick glance. Maybe, I got it wrong, but it still has events like START and END_ELEMENT. Is there a simple sample code that works in the similar way as my XML iterator? I mean does StAX allow you to get a DOM tree between specified start/end tags easily?

1. It depends :-) For example the XML iterator approach is not applicable to the OpenStreetMap dump. This dump contains three different kinds of entities: nodes, ways and relations. And some other metadata as changesets.
2. I meant "siteinfo", I just forgot how it is named it the dump.
3. The encoding issues is a nice one. I didn't see any encoding handling then I skipped through the source code. The encoding isn't specified in the constructor of the XmlIterator and depends on the environment. For example, the Russian locale of Microsoft Windows assumes the "CP1251" encoding.
4. The StAX deals with low-level XML entities as START_ELEMENT, END_ELEMENT etc. It is similar to the SAX in this aspect. But in SAX you don't control how much of the XML you want to read. Sure, you can skip the unneeded parts of it, but you don't control the parsing machinery. StAX allows to read a part of XML then close it.

As for StAX/DOM hybrid this link http://stackoverflow.com/questions/9379868/reading-a-big-xml-file-using-stax-and-dom suggests a simple way, however I never tried it. I prefer to parse StAX to POJO directly. Or, if I need to perform some extraction - to use JSoup. CSS-like selectors are nicer than XPath for extraction, IMHO. And you'll need to deal this XML specific stuff like CDATA and even when using DOM interface anyway.

Good points, thanks. In any case, you should separate between two things: the idea of parsing XML differently (in a truly iterative partial-DOM fashion) and a specific implementation. To emphasize approach != code. The approach rulez, but the code may not yet.

More comments:
1. Will add multiple tags instead of one. This is a trivial change.
2. The siteinfo will be easy to extract.
3. Good point. I have not yet seen a non-utf8 XML file. But I can imagine you can easily recode cp1251 into utf8 in a couple of quirky cases. It is probably also a good idea to memorize XML header, this could help dealing with encoding more easily.
4. I guess the example that I reference in my post is StAX not SAX, but, anyways, this is the way I don't like dealing with data. In the simple case (two-level hierarchy), you can do this, but not in a more complicated ones.

Thank you for the reference again. The code seems to be weird though: It reacts to START_ELEMENT, but not to END_ELEMENT. Anyways, even if it doesn't work, I guess, you can make it a working code. If this is, indeed, possible, I will incorporate it in my code (say as version 2).