log in | about 
 

A catch for "Min Number Should Match" in Solr's ExtendedDisMax parser.

One great feature of Solr is that you can employ different query parsers, even in the same query. There is a standard Solr/Lucene parser and there are number of extensions. One useful extension is the ExtendedDisMax parser. In this parser, it is possible to specify a percentage of query words (or blocks) that should appear in a document. This is some kind of fuzzy matching.

Consider an example of a two-word query "telegraph invent". To retrieve documents using a 80% threshold for the number of matching words, one can specify the following search request:

_query_: "{!edismax mm=80%} telegraph invent "

There is, however, a catch. One may expect that 80% of matching words in a two-word query means that retrieved documents contain both query words. However, this appears not be the case. Somewhat counter-intuitively, the minimum required number of matching keywords is computed by rounding down rather than by a more standard rounding half up. (or half way down)

What if you want to enforce the minimum number of words appearing in a document in a more transparent way? Turns out that you can still do this. To this end, one needs to specify the minimum number of words explicitly, rather than via a percentage. The above example would need to be rewritten as follows:

_query_: "{!edismax mm=2} telegraph invent "

It should apparently be possible to specify even more complex restricting conditions where, e.g., percentages and absolute thresholds are combined. More details on this can be found here. However, combining conditionals did not work out for me (I got a syntax error).



On the Size of the Deep Web

The World Wide Web (or simply Web) started as a tiny collection of several dozen web sites about 20 years ago. Since then, the number of Web pages grew tremendously and became quite segregated. There is a well-lit part of it, a surface Web, which is indexed by search engines, and there is a so-called deep-web, which is studied only slightly better than the outer deep space.

How many pages are on the surface? According to some of the measurements, there are several dozens of billions pages indexed. Were all of these pages created by humans manually? It is possible, but I doubt it. There are about 100 million books written by humans. Let us assume that a book has 100 pages each of which is published as a separate HTML page. This would give us only 10 billion pages. I think that during the 20 years of the existence of the Web, the number of manually created pages could have hardly surpassed this threshold. Consequently, it is not unreasonable to assume that most of the Web pages were automatically generated, e.g., for spamming purposes (two common generation approaches are: scrapping/mirroring contents from other web sites and generating gibberish text algorithmically).

Ok, but what is the size of the deep web? Six years ago, Google announced it knew about a trillion of Web pages. Assuming that the Web doubles each year, the size of the deep Web should be in the dozens of trillions of pages right now. This is supported by a more recent Google announcement: There are at least 60 trillion pages lurking in the depths of the Web!

What constitutes this massive dataset? There are allegations that the Deep Web is used for all kind of illegal activities. Well, there is definitely some illegal activity going on there, but I seriously doubt that humans could have manually created even a tiny fraction of the Deep Web directly. To make this possible, everybody would have to create about 10 thousand Web pages. This would be a tremendous enterprise even if each Web page were just a short status update on Facebook or Twitter. Anyways, most people write status updates probably once a year and not everybody is connected to the Web either.

Therefore, I conclude that the Deep Web should be mostly trash generated by (supposedly) spamming software. Any other thoughts regarding the origin of so many Web pages?



The first search algorithm based on user behavior was invented more than 60 years ago

The first search algorithm based on user behavior was invented more than 60 years ago. I learned this from a seminal paper authored by Yehoshua Bar-Hillel. Bar-Hillel was an influential logician, philosopher, and linguist. He was also known as a proponent of the categorial grammar. Being a logician, Bar-Hillel was very critical of statistical and insufficiently rigorous methods. So, he wrote opinionatedly:

A colleague of mine, a well-known expert in information theory, proposed recently, as a useful tool for literature search, the compiling of pair-lists of documents that are requested together by users of libraries. He even suggested, if I understood him rightly, that the frequency of such co-requests might conceivably serve as an indicator of the degree of relatedness of the topics treated in these documents. I believe that this proposal should be treated with the greatest reserve.

On one hand, Bar-Hillel was very critical. On the other hand, he was also politic and cited his friend invention anonymously. This left us wondering: Who was that prominent information theorist?



Not all date extractors are born equal: on using the right extractor in Stanford NLP toolkit

If you use a Stanford NLP toolkit, how do you extract dates? One may be tempted to directly use the statistical named entity recognizer, included in the toolkit. A demo of this NER is provided online. One immediate catch here is that there are several pre-trained statistical models. The demo code published online is using a 3-class model, which doesn't include dates! One should be careful enough to use the model english.muc.7class.distsim.crf.ser.gz.

The 7-class Muc-trained model is working ok, but there are a couple of issues. First of all, it often fails to detect a complete date. Go to the Stanford NER demo page, select the model english.muc.7class.distsim.crf.ser.gz and enter the text "Barack Hussein Obama was born on 4 August 1961.". The output would be like this:

Barack Hussein Obama was born on 4 August 1961.

Potential tags:
  LOCATION
  TIME
  PERSON
  ORGANIZATION
  MONEY
  PERCENT
  DATE

As you can see, the month and the year were tagged, but not the date of the month. BTW, not all of the Barack Obama's name was tagged either. Surely, I used a bit non-standard format of the date, but this format occurs frequently on the Web. Another issue is that the statistical tagger does not support date standardization. For example, given the dates August 1961 and 4 August 1961, the statistical NER cannot provide standardized date representations such as 1961-08 and 1961-08-04, which are easy to process and compare.

How big is the deal? My evidence is mostly anecdotal as I do not have a large enough sample to obtain reliable results. Yet, in one of our custom question answering pipeline, I gained about 20% in accuracy by using a rule-based Stanford Temporal Tagger (SUTime), instead of the statistical NER.

Interestingly, the SUTime is enabled automatically with the StanfordCoreNLP pipeline by including the NER annotator. The catch, again, is that it is not included when you use the statistical NER directly. Not only the SUTime has better recall and precision, but it also returns dates in the normalized form. An example of using the SUTime is provided by Stanford folks.



On teaching kids to program

What is a good framework to teach kids programming? For several years, I and my son were playing with Scratch. We have been making a slow progress and finally reached a point of understanding basic concepts. Scratch is a great environment, which hides a lot of implementation details from you. You have a few basic graphic primitives (the stage and the sprites) and a bunch of functions to manipulate them. Coding is easy because you create scripts by drag-and-dropping basic programming primitives: loops, conditionals, etc... Overall, Scratch is a powerful framework that allows one to create simple games/programs quickly (implementation speed is important here!).

The Scratch paradigm is not exactly event-driven programming. Yet, it involves parallel processing. For example, if one sprite (a hero) bumps into another sprite (a villain), it is easy to detect and process this event. However, there will be one script that controls the behavior of the hero and another script that controls the behavior of the villain. There are "naturally" occurring events (e.g, one sprite touches another) and programmatically initiated messages (broadcasts).

If you want to start with Scratch, there is a good project book. Beware, though, that there seems to be bugs in the program descriptions: be prepared to debug and fix them.

One big problem with the Scratch is that it is all parallel. As many of you know full well, parallelism entails race conditions. This is truly bad news, because even full-grown bearded software programmers have troubles with race conditions. How can we expect kids to tackle them easily? This is why it may be easier to start with a (mostly) single-thread programming.

There is a good a review of platforms/languages that can be taught after or instead of the Scratch (via Greg Linden). One option it suggests is JavaScript. After pondering for a while, I think this option is best. JavaScript is not terribly hard, there are exceptional interactive tutorials, and it is a production language, which is used in Web-development.

The latter reason seems to be quite appealing. Teens like to do cool things. And what can be cooler than an interactive HTML page that you create yourself?

UPDATE: interestingly, it seems that we will be studying Python instead. In that, it seems that the Code Academy is a good place to start.



Pages

Subscribe to RSS - srchvrs's blog