## A surprising novel stopword that appears if you use Stanford NLP tokenizer

I recently learned a new stopword that seems to be missing from most of the standard lists of stopwords (for example, it is not on the list of the Lemur/Indri toolkit), which likely means it is pretty novel to the IR community. This stop word is a simple three letter combination: n't. How does it arise? Well, it is a result of tokenization of contractions such as can't or aren't. But don't blindly trust my words, check the tokenization results yourself, e.g., using the following sentences (as a reminder this can be done using an online Stanford tool):

I ain't interested in this.
I can't attend this conference.
Aren't you hungry?
Don't trust me, verify!

Well, this may be correct linguistically, but this is not something the IR community is fully aware of. In particular, a stopword list may contain full contractions such as can't, ain't or don't, but not the suffix n't! If you work with a text where contractions are often, there you go, you have a new stop word! Inclusion of stop words into a query may not necessarily have effect on accuracy, but it will certainly hurt efficiency. BTW, other contractions may also produce interesting tokens, e.g., gonna is tokenized into gon and na. Yet, tokens like na seem to be far less frequent.

## Why SAX sucks and XML iterator rulez when you have to parse Wikipedia or a similar collection

When it comes to data analysis data preparation is the most time-consuming task. According to one survey it takes 80% of time. For those who deal with semi-structured text collections such as Wikipedia, one of the most annoying problem is parsing. I am not talking about splitting documents into sentences, obtaining POS tags, and dependency trees. I mean a mundane task of extracting, e.g., Wikipedia title and text. Somewhat unexpectedly, this can be quite a pain in the ass.

Many of the text collections that I deal with have two things in common: (1) they are stored in XML format and (2) they have repeating entries of the same structure enclosed within a pair of unique tags (i.e., tags do not repeat inside the entry itself). In the case of Wikipedia, an entry is a Wikipedia article surrounded by the tags <page> and </page>. Because it is a large XML document, one has typically to resort to using an event-driven method called SAX.

Consider an example of such parsing code written by my fellow student Di Wang. As you can see, an event-driven approach is not easy to implement. A SAX parser tells you a few things like when it encounters a starting tag and everything else is your own headache. Basically, you need to keep some sort of a state variable and keep track of opening/closing tags. Not only is this tedious, but it is also error prone and fragile. You change the format of the document a bit and your code may stop working.

It would be a long post to explain everything what I hate about SAX parsing, but let me simply state that it sux in my opinion. What I would prefer instead is to parse everything using a DOM parser. Then, accessing necessary nodes would be a walk in the park. I do not have to care about parsing details, I can use things like XSLT and all sort of useful helper functions that work with an existing DOM tree. Buuuut, this approach is extremely memory inefficient.

Instead it would be nice to have something like an XML iterator that would go over the list of similarly-structured entities, parse one entry (e.g., a Wikipedia article) at a time, and generate a DOM tree only for this entry. How does one implement such a thing? Recall that each entry is enclosed by the pair of unique tags. Thus we can find the start/end of each entry and parse one entry using a DOM parser. Of course, there are some subtleties to be taken care of. For example, the enclosing tags may occasionally have attributes and document entries may have, e.g., CDATA sections. However, it should not be too complicated to implement such functionality.

This is exactly what I did when I got tired of using pesky SAX parsers. I have been using my "XML iterator" implementation for more than a year, but only recently did I extract the code so it can be used in a standalone fashion. The repository is on GitHub. It contains an XML iterator class as well as a Wikipedia parsing example. It can be executed by calling a script sample_run.sh. The code is in Java (8+). Feel free to (dis)like the code and send me pull requests should you find any problems.

The XML iterator does not do any deep XML parsing. It only extracts the text of document entries (one at a time). An entry should be enclosed by the unique tag. This means that the tag cannot be reused inside the document entry. On obtaining the next entry, you parse it using a DOM parser of your choice. You do not have to use the same DOM parser as I did. You can process DOM trees in a more elegant way than I did. For example, for complex documents, you can use XSLT/XPATH. To conclude I note that this approach is reasonably efficiently (and uses little memory), but it is not as efficient as the SAX parser. So, if the parsing speed is of paramount importance (which I doubt), then SAX is still your best friend.

## Text retrieval can and should benefit from using generic k-NN search algorithms

We believe that text retrieval can and should benefit from using generic k-NN search algorithms. To support our conjecture, we carried out a bunch of experiments, published a paper, as well as related software. A high-level summary of the paper is given in the talk, whose text we also post online (just in case, slides are also available).

What is all this about? Why should one use k-NN search? In a classic filter-and-refine pipeline, you would usually get candidate result set filtered by TFxIDF. What if we replace TFxIDF with some expensive-to-compute but accurate similarity? Clearly, we will not be able to use text-based inverted files to answer queries efficiently. At the same time, a brute-force comparison of query against every document would be terribly slow. However, we can try answer queries using some distance-based approximate k-NN search algorithm. If such approach is sufficiently fast, we might get a practical tool to find documents that are not possible or hard to find using TFxIDF based retrieval.

I would not claim that we have fully achieved our objective, but we have probably made a good step towards achieving it. In fact, the phrase "Let's replace" in the title of the paper means only that we see such a replacement as an important goal.

## An electric highway may be, indeed, nearer than we think

Frankly speaking, I have been a bit skeptical about electric cars coming to our highways in large numbers. So, when I first heard about Germans wanting to ban sales of new internal combustion engines by 2030, my first thought was that this Bundesrat initiative was absolutely nuts (for the record, Bundesrat decision does not yet have legislative power. First, the number of plug-in cars is still laughable and not all of these plug-ins are fully electric. Second, the current infrastructure does not support en-mass charging of electric vehicles. Tesla and Nissan have (I guess incompatible) superchargers here and there, but... Man when was the last time you drove 500 miles? Imagine it is 700 now because you need to drive through a supercharging station. Last, but not least, I am not sure that battery technology is ready. These are all valid concerns, but, after doing some basic research, I have come to a conclusion that the era of electric cars may be closer than we thought.

Perhaps, my primary concern was the cost of a battery. Battery is, probably, the most expensive part of the electric car. For example, in 2010 you would pay 750 dollars per kWh of a Li-Ion battery. For an all-purpose electric car, one would need a 100+ kWh battery back, which would cost a whooping $75,000 in 2010. However, somewhat miraculously, the cost of battery reduced 5X. Furthermore, GM expects a further 1.5x reduction by the end of 2021. Wow, this means that already in 2021, the cost of a good battery would be only$10,000! This is still a lot. However, you have to remember that an all-electric car is a simpler gadget, which needs a simple engine and a simpler transmission. So, without battery shortages potentially hiking the battery price (which is, of course, a serious unknown variable), electric cars will soon be quite affordable. Perhaps, even cheaper than gasoline cars, which are also more expensive to maintain! To sum up this paragraph, even Li-Ion batteries seem to be quite a viable option. Furthermore, one should not exclude potential alternative battery technologies kicking in by 2030-2040.

Another big concern is, of course, lack of infrastructure. However, infrastructure would not necessarily be all that costly. For most commuter cars, charging can happen at home. In addition, it seems that it is actually much simpler to build superchargers than gas stations (credits to my neighbor Alex for this observation)! For example, gas stations require an underground fuel tank, but superchargers only require a reliable connection to the grid. A good question is where all the additional electricity would come from? It is a valid question, because powering electric cars with coal is not a good idea. Due to losses in, e.g., electricity transmission, the overall efficiency of such a system is not all that impressive compared to a fuel-efficient (e.g., hybrid) vehicle. In other words, we would likely only increase the amount of emissions by powering electric vehicles by new coal powerplants. Natural gas would be a better option, yet, it has its own issues. However, I also have high hopes to renewables. In particular, the price of solar panel has decreased to a point where utility companies are starting to lose money (due to people heavily relying on solar panels). At the very least, it would be affordable to use solar or wind or a combination thereof to power your local commute.

In conclusion, I note that, while adoption of electric vehicles is a process full of uncertainties, the electric highway now seems to be closer than I originally thought. Maybe, not in 2030, but 2040-2050 does not look as an unrealistic date to me any more.