In a recent post, Daniel Lemire says that "... though unrefereed, arXiv has a better h-index than most journals". In particular, arXiv is included in the Google's list of most cited venues, where it consistently beats most other journals and conferences. Take, e.g., a look at the section Databases & Information Systems. Daniel concludes by advising to subscribe to arXiv Twitter stream.

Well, obviously, arXiv is a great collection of open-source high-quality publications (at least a subset is great), but what implications does it have for a young researcher? Does she have to stop publishing at good journals and conferences? Likely not, because the high ranking of arXiv seems to be counterfactual.

Why is that? Simply because arXiv is not an independent venue and mirrors papers published elsewhere. Consider, e.g., top 3 papers in the Databases & Information Systems section:

  1. Low, Yucheng, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment.
  2. Hay, Michael, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. "Boosting the accuracy of differentially private histograms through consistency." Proceedings of the VLDB Endowment
  3. Xiao, Xiaokui, Guozhang Wang, and Johannes Gehrke. "Differential privacy via wavelet transforms." Knowledge and Data Engineering, IEEE Transactions

All of them appeared elsewhere, two in a prestigious VLDB conference. Perhaps, this is just a sample bias, but out of top-10 papers in this section, all 10 were published elsewhere, mostly in VLDB proceedings.

However, Daniel argues that only a small fraction of VLDB papers appears on arXiv, thus, apparently implying that high ranking of arXiv cannot be explained away by the fact that arXiv is not independent:

One could argue that the good ranking can be explained by the fact that arXiv includes everything. However, it is far from true. There are typically less than 30 new database papers every month on arXiv whereas big conferences often have more than 100 articles (150 at SIGMOD 2013 and and over 200 at VLDB 2013).

But it absolutely can! Note that venues are ranked using an h5 index, which is equal to the largest number h such that h articles published in 2009-2013 have at least h citations each. For a high h5-index, it is sufficient to have just a few dozens of highly cited papers. And these papers could come from VLDB and other prestigious venues.

I have to disclaim that, aside from verifying top-10 papers in the Databases & Information Systems section of arXiv, I did not collect solid statistics on the co-publishing of top arXiv papers. If any one has such statistics and the statistics shows a low co-publishing rate, I will be happy to retract my arguments. However, so far the statement "arXiv has a high citation index" looks like an outcome from a regression that misses an important covariate.

The arguments in support of arXiv are in line with other Daniel's posts. Check, for example, his recent essay, where Daniel argues that a great paper should not necessarily be published in VLDB or SIGIR. While I absolutely agree that obsessing about top-tier conferences is outright harmful, I think that publishing some of the work there makes a lot of sense and here is why.

If you are a renowned computer scientist and have a popular blog, dissemination of your work is an easy-peasy business. You can inscribe your findings on the Great Wall of China and your colleagues will rush buying airline tickets to see it. You can send an e-mail, you can publish a paper on arXiv. Delivery method disirregardless, your paper will still get a lot of attention (as long as the content is good). For less known individuals, things are much more complicated. In particular, a young scientist has to play a close-to-zero-sum game and compete for attention of readers. If she approaches her professor or employer and says: I have done good work recently and published 10 papers on arXiv, this is almost certainly guaranteed to create merely a comical effect. She will be sneered at and taught a lesson about promoting her work better.

People are busy and nobody wants to waste time on reading potentially uninteresting papers. One good time-saving strategy is to make other people read them first. Does this screening strategy have false positives and/or false negatives? It absolutely does, but, on average, it works well. At least, this is a common belief. In particular, Daniel himself will not read any P=NP proofs.

To conclude, Knuth and other luminaries may not care about prestigious conferences and journals, but for other people they mean a lot. I am pretty sure that co-publishing your paper online and promoting it in the blogs is a great supplementary strategy (I do recommend doing this, if you care about my lowly opinion), but this is likely not a replacement for traditional publishing approaches. In addition, I am not yet convinced that arXiv could have a high citation index on its own, without being a co-publishing venue.


You grant me the opinion that the "high ranking of arXiv cannot be explained away by the fact that arXiv is not independent". This is not what I wrote nor what I believe.

Regarding the post you link to, you write "Daniel argues that a great paper should not necessarily be published in VLDB or SIGIR". It seems hard to disagree with this general statement, but the post you link to appears unrelated to this claim. The real implication from that post you link to is that "obsessing about top-tier conferences is outright harmful", something you "absolutely" agree with.

Other than that, I agree with everything you have written. It is pretty obvious stuff.

Thank you for the clarification, Daniel. Let the reader decide who made which claims.