log in | about 

theorists vs experimentalists

This was prompted by several recent posts, in particular, by Zach Lipton's tweet, where he complained that all ML culture has been revolving around hacking: "The dominant practice of the applied machine learnist has shifted from ad-hoc feature hacking (2000s) to ad-hoc architecture hacking (2010s) to ad-hoc pre-training hacking (2020s)."

This may seem to be just another (relatively innocent) complaint about the lack of rigor and scholarship in the machine learning field. However, in my opinion, it represents a much bigger issue, namely, a divide between theorists and experimentalists; between tinkerers and scholars. There are very different opinions on both sides of this divide. For example, my friend and co-author Daniel Lemire goes rather far by saying that scholarship is conservative while tinkering is progressive. On the other side, we have people eager to label tinkerers and experimentalists as tech bros or merely engineers.

I do not subscribe to any of these extremes. However, I believe that tinkering has been an essential, if not the primary engine of progress. There is an understandable desire to explain things, which amounts to building a theoretical model of the world. This is clearly super-useful, but there are limitations. First of all, theories are not bullet-proof. They aim to explain experimental data and they evolve over time. One example is a "contest" between Geo- vs Heliocentric systems. At some point, the Geocentric system was better supported by data, despite being wrong (in the modern understanding). Somewhat similarly, we had to amend Newton's physics and we will probably have to make many amendments to existing theoretical models as well.

Second, theories are limited, often to a great extent. One has to make assumptions (which never truly hold in practice) as well as a lot of simplifications. One of my most favorite examples is a theory of locality-sensitive hashing. Another example is parsing in natural language processing (NLP). Parsing is a crucial component of a rule-based (and hybrid) NLP. There was a lot of effort devoted to making parsing effective, in particular, by training (deep) neural networks models to do parsing. Despite being improved by deep learning, parsing is not particularly popular nowadays. One problem, in my opinion, is that linguistic theory behind parsing permits explaining only a limited number of language phenomena. Thus, these theories (so far) have been more useful to debugging existing neural networks rather than for building fully-functional applications such as question-answering or sentiment analysis systems.

In summary, I would emphasize that theory is certainly useful: not only to fully understand the world, but also to provide insights for tinkering. That said, I believe it is and will continue to be limited, so we cannot dismiss tinkering as some sort of inferior approach to do science/engineering. Daniel Lemire also notes that tinkering is dangerous and it is hard to disagree. The dangers need to be mitigated. However, I do not think it is realistic to expect people to wait till fully-formed useful theories appear, in particular, because this depends on tinkerers producing experimental results.

Traditional IR rivals neural models on the MS MARCO Document Ranking Leaderboard

A few days ago I launched a traditional IR system into (lower layers of) the Transformer cloud. Although inferior to most BERT-based models, it outperformed several neural submissions (as well as all non-neural ones), including two submissions that used a large pretrained Transformer model for re-ranking.

My objectives were:

  • To provide a stronger traditional baseline;
  • To develop an effective first-stage retrieval system,
    which can be efficient and effective without expensive index-time precomputation.

I have posted a short write-up on arxiv to describe the submitted system. The write-up comes with two notebooks, which can be used to reproduce results.

This work was possible largely due to using our own flexible retrieval toolkit FlexNeuART (intended pronunciation flex-noo-art), which was recently presented at the EMNLP OSS Workshop. FlexNeuART was also instrumental to achieving top spots on the MS MARCO document ranking leaderboard in August and November 2020.

Simple advice for runners

I would like to share a couple of tricks that may make running more pleasurable.

The first trick is obvious in the hindsight, but it took me quite a while to figure it out on my own. Take a wet small piece of cloth, which can fit into a side or chest pocket and wipe your face regularly. For longer runs, you may take more than one piece of cloth. The cloth should be pretty small: I find it is inconvenient to run with big towel-like wipes!

The second trick is concerned with the phone, which needs to be stored somewhere, ideally, where it can be accessed easily. Turns out, that one of the best holders is the so-called Running Buddy. With the Running Buddy , It is easy to open the cover and adjust the volume. It is also quite easy to take the phone in and out of this pouch.

I have been running with these pouches for several years already: They stick very well: If attached properly, they do not fall off. In the unlikely detachment event, it is hard to miss that the pouch is not there anymore. Note that nowadays phones have become very large and you will likely need to buy the largest size possible! I have a couple of these (just in case) and I buy them from the Running Buddy (BTW, I am not affiliated with them in any way!).

Last, but not least: If it is hot, I run without a t-shirt or only with a reflective vest if I run at night. It might offend somebody's feelings (hopefully not), but it is just a very practical way to reduce overheating during the warm season. Somewhat surprisingly, it is not easy to get sunburnt while running. As a side comment, I find the 20C (68F) weather to be pretty uncomfortable for running. I overheat very easily and my ideal running temperature is 5-10 degrees Centigrade (10-20 Fahrenheit) above the freezing point.

Dear childless employee

Preamble: This blog post is inspired by a recent outrage at Facebook and Twitter in regard to parents getting extra time off.

Dear childless employee. We are really sorry to hear that many of you feel so lonely and frustrated nowadays. I believe it can cause a lot of real distress and I also wish employers paid more attention to mental health issues. It should also be covered better through a short term disability insurance or a similar policy, which is regretfully lacking. Understandably, some of you are frustrated that parents have gotten a bit more time off. Remember, however, that this is not a permanent benefit, but rather a short-term measure.

Our family was able to work productively when our daycare was closed, but we are totally sympathetic to people who were not able to do so and we are ready to pick up the slack. We are ready despite we are not as young as a vast majority of Facebook employees and we have had our difficult times when we slept close to five hours a day for many years in a row.

Whether giving parents some preferential treatment is fair is a difficult question, which needs to be considered in a broader social context. Here, there is a typical conservative opinion, which is basically "screw you, you are totally on your own" and a more liberal one, which asserts that (some) redistribution of benefits is good for society in the long run. Whether for-profit companies should be responsible for solving any social issues is a tricky question too. We do not have a full agreement on this even in our family.

Understandably, one trend is to hire mostly young employees, who have lower salary expectations and can more readily put in longer hours. However, there is another trend to create healthier and diverse workplaces, which are welcoming women and minorities, because it may benefit us all in the long run. Remember that lack of adequate parental leave affects disproportionately women, who are often default caregivers.

From this perspective, there is nothing unfair in supporting parents through these difficult times: It is just an integral part of building a healthier workplace. Likewise, we should have support for overworked and overstressed people. I wish unexpected parental leaves were handled via a special insurance (or fund), which is similar to the disability insurance. However, we do not have such government policy and the current pandemic situation is unprecedented.

Being a parent is certainly a privilege and some of it is supported through your taxes. We greatly appreciate this help. However, let us also not forget that societies do love babies: They just do not like to put effort in their upbringing. In theory, we have an overpopulation threat, but, in practice, birth rates seem to be plummeting everywhere and especially in the developed countries. Among these US has been doing pretty well, but even here the average is 1.7 birth per woman.

To stay competitive, the US will need many more smart and hardworking people. I speculate that the US can easily absorb 100-200 million people over a period of three-five decades, but immigration is a difficult topic and it has become tricky to invite even highly qualified people. It is quite sad because a skilled workforce is not a burden but a driver of innovation and economic growth.

In conclusion, my dear childless employee, I would like to remind you that one day you may become a parent too. Whether this happens or not should certainly be your personal choice, which could come with a lot of work and years of sleep deprivation. It could also come with a long commute, because good schools are in the suburbs and not where the offices are. If this ever happens, I really hope that your future managers will have some sympathy for your long commute and will not insist you have to be in the office every day. On the plus side, if you are lucky, parenting can also be quite rewarding, so I hope you might enjoy it as we do now.

On the differences between CPU and GPU or why we cannot use GPU for everything

This is written in response to a Quora question. It is a somewhat vague question wondering why we cannot use GPU hardware for all computation tasks. Feel free to vote there for my answer on Quora!

CPU and GPU are fundamentally very different computational devices, but not many people realize it. CPU has a few low-latency cores, elaborate large caches and flow control (prefetching, branch prediction, etc) and a large relatively inexpensive RAM. GPU is a massively parallel device, which uses an expensive high-throughput memory. GPU memory is optimized for throughput, but not necessarily for latency.

Each GPU core is slow, but there can be thousands of them. When a GPU starts thousands of threads, each thread knows its “number” and uses this number to figure out which part of the “puzzle” it needs to solve (by loading and storing corresponding areas of memory). For example, to carry out a scalar product between two vectors, it is fine to start a GPU thread to multiply just two vector elements. However, it is quite unusual from the perspective of a software developer who has been programming CPUs all their life.

GPU designers make a number of trade-offs that are very different from the CPU trade-offs (in terms of flow control, cache size, and management, etc), which are particularly well suited for parallelizable tasks. However, it does not make GPU universally faster than CPU. GPU works well for massively parallel tasks such as matrix multiplication, but it can be quite inefficient for tasks where massive parallelization is impossible or difficult.

Given a large number of “data-hungry” cores, it is IMHO more important (than in the case of the CPU) to have a high-bandwidth memory (but higher memory latency can be tolerated). Yet, due to a high cost of the GPU memory, its amount is limited. Thus, GPU often relies on external lower-bandwidth memory (such as CPU RAM) to fetch data. If we did not have CPU memory, loading data directly from the disk (even from an SSD disk) would have slowed down many GPU workloads quite substantially. In some cases, this problem can be resolved by connecting GPUs using a fast interconnect (NVLink, Infiniband), but it comes with an extra cost and does not resolve all the issues related to having only very limited memory.

Some answers claim that all GPU cores can do only the same thing, but it is only partially correct. However, cores in the same group (warp) do operate in a lock-step. To process a branch operation, GPU needs to stop some of the cores in the warp and restart them when the branch finishes. Different warps can operate independently (e.g., execute different CUDA kernels).

Furthermore, GPU cores are simpler than CPU cores primarily in terms of the flow control. Yet, they are not primitive by far and support a wide range of arithmetic operations (including lower-precision fast operations). Unlike CPU that manages its caches automatically, GPU have fast shared memory, which is managed explicitly by a software developer (there is also a small L1 cache). Shared memory is essentially a manually-managed cache.

Note that not all GPUs support recursive calls (those that support seem to be pretty restrictive about the recursion depth) and none of the GPUs that I know support virtual memory. In particular, the current CUDA recursion depth seems to be 24. GPUs do not have interrupts and lack support for communication with external IO devices. All these limitations make it difficult or impossible to use GPU as the main processing unit that can run an operating system (See also the following paper for more details: GPUfs: the case for operating system services on GPUs. M Silberstein, B Ford, E Witch, 2014.) I am convinced that future computation systems are going to be hybrid systems that combine low-latency very generic processing units and high-throughput specialized units suitable for massively parallel tasks.


Subscribe to RSS - blogs