log in | about 

Does arXiv really have a high citation index?

In a recent post, Daniel Lemire says that "... though unrefereed, arXiv has a better h-index than most journals". In particular, arXiv is included in the Google's list of most cited venues, where it consistently beats most other journals and conferences. Take, e.g., a look at the section Databases & Information Systems. Daniel concludes by advising to subscribe to arXiv Twitter stream.

Well, obviously, arXiv is a great collection of open-source high-quality publications (at least a subset is great), but what implications does it have for a young researcher? Does she have to stop publishing at good journals and conferences? Likely not, because the high ranking of arXiv seems to be counterfactual.

Why is that? Simply because arXiv is not an independent venue and mirrors papers published elsewhere. Consider, e.g., top 3 papers in the Databases & Information Systems section:

  1. Low, Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment.
  2. Hay, Michael, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. "Boosting the accuracy of differentially private histograms through consistency." Proceedings of the VLDB Endowment
  3. Xiao, Xiaokui, Guozhang Wang, and Johannes Gehrke. "Differential privacy via wavelet transforms." Knowledge and Data Engineering, IEEE Transactions

All of them appeared elsewhere, two in a prestigious VLDB conference. Perhaps, this is just a sample bias, but out of top-10 papers in this section, all 10 were published elsewhere, mostly in VLDB proceedings.

However, Daniel argues that only a small fraction of VLDB papers appears on arXiv, thus, apparently implying that high ranking of arXiv cannot be explained away by the fact that arXiv is not independent:

One could argue that the good ranking can be explained by the fact that arXiv includes everything. However, it is far from true. There are typically less than 30 new database papers every month on arXiv whereas big conferences often have more than 100 articles (150 at SIGMOD 2013 and and over 200 at VLDB 2013).

But it absolutely can! Note that venues are ranked using an h5 index, which is equal to the largest number h such that h articles published in 2009-2013 have at least h citations each. For a high h5-index, it is sufficient to have just a few dozens of highly cited papers. And these papers could come from VLDB and other prestigious venues.

I have to disclaim that, aside from verifying top-10 papers in the Databases & Information Systems section of arXiv, I did not collect solid statistics on the co-publishing of top arXiv papers. If any one has such statistics and the statistics shows a low co-publishing rate, I will be happy to retract my arguments. However, so far the statement "arXiv has a high citation index" looks like an outcome from a regression that misses an important covariate.

The arguments in support of arXiv are in line with other Daniel's posts. Check, for example, his recent essay, where Daniel argues that a great paper should not necessarily be published in VLDB or SIGIR. While I absolutely agree that obsessing about top-tier conferences is outright harmful, I think that publishing some of the work there makes a lot of sense and here is why.

If you are a renowned computer scientist and have a popular blog, dissemination of your work is an easy-peasy business. You can inscribe your findings on the Great Wall of China and your colleagues will rush buying airline tickets to see it. You can send an e-mail, you can publish a paper on arXiv. Delivery method disirregardless, your paper will still get a lot of attention (as long as the content is good). For less known individuals, things are much more complicated. In particular, a young scientist has to play a close-to-zero-sum game and compete for attention of readers. If she approaches her professor or employer and says: I have done good work recently and published 10 papers on arXiv, this is almost certainly guaranteed to create merely a comical effect. She will be sneered at and taught a lesson about promoting her work better.

People are busy and nobody wants to waste time on reading potentially uninteresting papers. One good time-saving strategy is to make other people read them first. Does this screening strategy have false positives and/or false negatives? It absolutely does, but, on average, it works well. At least, this is a common belief. In particular, Daniel himself will not read any P=NP proofs.

To conclude, Knuth and other luminaries may not care about prestigious conferences and journals, but for other people they mean a lot. I am pretty sure that co-publishing your paper online and promoting it in the blogs is a great supplementary strategy (I do recommend doing this, if you care about my lowly opinion), but this is likely not a replacement for traditional publishing approaches. In addition, I am not yet convinced that arXiv could have a high citation index on its own, without being a co-publishing venue.

A catch for "Min Number Should Match" in Solr's ExtendedDisMax parser.

One great feature of Solr is that you can employ different query parsers, even in the same query. There is a standard Solr/Lucene parser and there are number of extensions. One useful extension is the ExtendedDisMax parser. In this parser, it is possible to specify a percentage of query words (or blocks) that should appear in a document. This is some kind of fuzzy matching.

Consider an example of a two-word query "telegraph invent". To retrieve documents using a 80% threshold for the number of matching words, one can specify the following search request:

_query_: "{!edismax mm=80%} telegraph invent "

There is, however, a catch. One may expect that 80% of matching words in a two-word query means that retrieved documents contain both query words. However, this appears not be the case. Somewhat counter-intuitively, the minimum required number of matching keywords is computed by rounding down rather than by a more standard rounding half up. (or half way down)

What if you want to enforce the minimum number of words appearing in a document in a more transparent way? Turns out that you can still do this. To this end, one needs to specify the minimum number of words explicitly, rather than via a percentage. The above example would need to be rewritten as follows:

_query_: "{!edismax mm=2} telegraph invent "

It should apparently be possible to specify even more complex restricting conditions where, e.g., percentages and absolute thresholds are combined. More details on this can be found here. However, combining conditionals did not work out for me (I got a syntax error).

On the Size of the Deep Web

The World Wide Web (or simply Web) started as a tiny collection of several dozen web sites about 20 years ago. Since then, the number of Web pages grew tremendously and became quite segregated. There is a well-lit part of it, a surface Web, which is indexed by search engines, and there is a so-called deep-web, which is studied only slightly better than the outer deep space.

How many pages are on the surface? According to some of the measurements, there are several dozens of billions pages indexed. Were all of these pages created by humans manually? It is possible, but I doubt it. There are about 100 million books written by humans. Let us assume that a book has 100 pages each of which is published as a separate HTML page. This would give us only 10 billion pages. I think that during the 20 years of the existence of the Web, the number of manually created pages could have hardly surpassed this threshold. Consequently, it is not unreasonable to assume that most of the Web pages were automatically generated, e.g., for spamming purposes (two common generation approaches are: scrapping/mirroring contents from other web sites and generating gibberish text algorithmically).

Ok, but what is the size of the deep web? Six years ago, Google announced it knew about a trillion of Web pages. Assuming that the Web doubles each year, the size of the deep Web should be in the dozens of trillions of pages right now. This is supported by a more recent Google announcement: There are at least 60 trillion pages lurking in the depths of the Web!

What constitutes this massive dataset? There are allegations that the Deep Web is used for all kind of illegal activities. Well, there is definitely some illegal activity going on there, but I seriously doubt that humans could have manually created even a tiny fraction of the Deep Web directly. To make this possible, everybody would have to create about 10 thousand Web pages. This would be a tremendous enterprise even if each Web page were just a short status update on Facebook or Twitter. Anyways, most people write status updates probably once a year and not everybody is connected to the Web either.

Therefore, I conclude that the Deep Web should be mostly trash generated by (supposedly) spamming software. Any other thoughts regarding the origin of so many Web pages?

The first search algorithm based on user behavior was invented more than 60 years ago

The first search algorithm based on user behavior was invented more than 60 years ago. I learned this from a seminal paper authored by Yehoshua Bar-Hillel. Bar-Hillel was an influential logician, philosopher, and linguist. He was also known as a proponent of the categorial grammar. Being a logician, Bar-Hillel was very critical of statistical and insufficiently rigorous methods. So, he wrote opinionatedly:

A colleague of mine, a well-known expert in information theory, proposed recently, as a useful tool for literature search, the compiling of pair-lists of documents that are requested together by users of libraries. He even suggested, if I understood him rightly, that the frequency of such co-requests might conceivably serve as an indicator of the degree of relatedness of the topics treated in these documents. I believe that this proposal should be treated with the greatest reserve.

On one hand, Bar-Hillel was very critical. On the other hand, he was also politic and cited his friend invention anonymously. This left us wondering: Who was that prominent information theorist?

Not all date extractors are born equal: on using the right extractor in Stanford NLP toolkit

If you use a Stanford NLP toolkit, how do you extract dates? One may be tempted to directly use the statistical named entity recognizer, included in the toolkit. A demo of this NER is provided online. One immediate catch here is that there are several pre-trained statistical models. The demo code published online is using a 3-class model, which doesn't include dates! One should be careful enough to use the model english.muc.7class.distsim.crf.ser.gz.

The 7-class Muc-trained model is working ok, but there are a couple of issues. First of all, it often fails to detect a complete date. Go to the Stanford NER demo page, select the model english.muc.7class.distsim.crf.ser.gz and enter the text "Barack Hussein Obama was born on 4 August 1961.". The output would be like this:

Barack Hussein Obama was born on 4 August 1961.

Potential tags:

As you can see, the month and the year were tagged, but not the date of the month. BTW, not all of the Barack Obama's name was tagged either. Surely, I used a bit non-standard format of the date, but this format occurs frequently on the Web. Another issue is that the statistical tagger does not support date standardization. For example, given the dates August 1961 and 4 August 1961, the statistical NER cannot provide standardized date representations such as 1961-08 and 1961-08-04, which are easy to process and compare.

How big is the deal? My evidence is mostly anecdotal as I do not have a large enough sample to obtain reliable results. Yet, in one of our custom question answering pipeline, I gained about 20% in accuracy by using a rule-based Stanford Temporal Tagger (SUTime), instead of the statistical NER.

Interestingly, the SUTime is enabled automatically with the StanfordCoreNLP pipeline by including the NER annotator. The catch, again, is that it is not included when you use the statistical NER directly. Not only the SUTime has better recall and precision, but it also returns dates in the normalized form. An example of using the SUTime is provided by Stanford folks.


Subscribe to RSS - blogs