## Dear childless employee

Preamble: This blog post is inspired by a recent outrage at Facebook and Twitter in regard to parents getting extra time off.

Dear childless employee. We are really sorry to hear that many of you feel so lonely and frustrated nowadays. I believe it can cause a lot of real distress and I also wish employers paid more attention to mental health issues. It should also be covered better through a short term disability insurance or a similar policy, which is regretfully lacking. Understandably, some of you are frustrated that parents have gotten a bit more time off. Remember, however, that this is not a permanent benefit, but rather a short-term measure.

Our family was able to work productively when our daycare was closed, but we are totally sympathetic to people who were not able to do so and we are ready to pick up the slack. We are ready despite we are not as young as a vast majority of Facebook employees and we have had our difficult times when we slept close to five hours a day for many years in a row.

Whether giving parents some preferential treatment is fair is a difficult question, which needs to be considered in a broader social context. Here, there is a typical conservative opinion, which is basically "screw you, you are totally on your own" and a more liberal one, which asserts that (some) redistribution of benefits is good for society in the long run. Whether for-profit companies should be responsible for solving any social issues is a tricky question too. We do not have a full agreement on this even in our family.

Understandably, one trend is to hire mostly young employees, who have lower salary expectations and can more readily put in longer hours. However, there is another trend to create healthier and diverse workplaces, which are welcoming women and minorities, because it may benefit us all in the long run. Remember that lack of adequate parental leave affects disproportionately women, who are often default caregivers.

From this perspective, there is nothing unfair in supporting parents through these difficult times: It is just an integral part of building a healthier workplace. Likewise, we should have support for overworked and overstressed people. I wish unexpected parental leaves were handled via a special insurance (or fund), which is similar to the disability insurance. However, we do not have such government policy and the current pandemic situation is unprecedented.

Being a parent is certainly a privilege and some of it is supported through your taxes. We greatly appreciate this help. However, let us also not forget that societies do love babies: They just do not like to put effort in their upbringing. In theory, we have an overpopulation threat, but, in practice, birth rates seem to be plummeting everywhere and especially in the developed countries. Among these US has been doing pretty well, but even here the average is 1.7 birth per woman.

To stay competitive, the US will need many more smart and hardworking people. I speculate that the US can easily absorb 100-200 million people over a period of three-five decades, but immigration is a difficult topic and it has become tricky to invite even highly qualified people. It is quite sad because a skilled workforce is not a burden but a driver of innovation and economic growth.

In conclusion, my dear childless employee, I would like to remind you that one day you may become a parent too. Whether this happens or not should certainly be your personal choice, which could come with a lot of work and years of sleep deprivation. It could also come with a long commute, because good schools are in the suburbs and not where the offices are. If this ever happens, I really hope that your future managers will have some sympathy for your long commute and will not insist you have to be in the office every day. On the plus side, if you are lucky, parenting can also be quite rewarding, so I hope you might enjoy it as we do now.

## On the differences between CPU and GPU or why we cannot use GPU for everything

This is written in response to a Quora question. It is a somewhat vague question wondering why we cannot use GPU hardware for all computation tasks. Feel free to vote there for my answer on Quora!

CPU and GPU are fundamentally very different computational devices, but not many people realize it. CPU has a few low-latency cores, elaborate large caches and flow control (prefetching, branch prediction, etc) and a large relatively inexpensive RAM. GPU is a massively parallel device, which uses an expensive high-throughput memory. GPU memory is optimized for throughput, but not necessarily for latency.

Each GPU core is slow, but there can be thousands of them. When a GPU starts thousands of threads, each thread knows its “number” and uses this number to figure out which part of the “puzzle” it needs to solve (by loading and storing corresponding areas of memory). For example, to carry out a scalar product between two vectors, it is fine to start a GPU thread to multiply just two vector elements. However, it is quite unusual from the perspective of a software developer who has been programming CPUs all their life.

GPU designers make a number of trade-offs that are very different from the CPU trade-offs (in terms of flow control, cache size, and management, etc), which are particularly well suited for parallelizable tasks. However, it does not make GPU universally faster than CPU. GPU works well for massively parallel tasks such as matrix multiplication, but it can be quite inefficient for tasks where massive parallelization is impossible or difficult.

Given a large number of “data-hungry” cores, it is IMHO more important (than in the case of the CPU) to have a high-bandwidth memory (but higher memory latency can be tolerated). Yet, due to a high cost of the GPU memory, its amount is limited. Thus, GPU often relies on external lower-bandwidth memory (such as CPU RAM) to fetch data. If we did not have CPU memory, loading data directly from the disk (even from an SSD disk) would have slowed down many GPU workloads quite substantially. In some cases, this problem can be resolved by connecting GPUs using a fast interconnect (NVLink, Infiniband), but it comes with an extra cost and does not resolve all the issues related to having only very limited memory.

Some answers claim that all GPU cores can do only the same thing, but it is only partially correct. However, cores in the same group (warp) do operate in a lock-step. To process a branch operation, GPU needs to stop some of the cores in the warp and restart them when the branch finishes. Different warps can operate independently (e.g., execute different CUDA kernels).

Furthermore, GPU cores are simpler than CPU cores primarily in terms of the flow control. Yet, they are not primitive by far and support a wide range of arithmetic operations (including lower-precision fast operations). Unlike CPU that manages its caches automatically, GPU have fast shared memory, which is managed explicitly by a software developer (there is also a small L1 cache). Shared memory is essentially a manually-managed cache.

Note that not all GPUs support recursive calls (those that support seem to be pretty restrictive about the recursion depth) and none of the GPUs that I know support virtual memory. In particular, the current CUDA recursion depth seems to be 24. GPUs do not have interrupts and lack support for communication with external IO devices. All these limitations make it difficult or impossible to use GPU as the main processing unit that can run an operating system (See also the following paper for more details: GPUfs: the case for operating system services on GPUs. M Silberstein, B Ford, E Witch, 2014.) I am convinced that future computation systems are going to be hybrid systems that combine low-latency very generic processing units and high-throughput specialized units suitable for massively parallel tasks.

## MNIST is super easy and few people know it!

One can never be too surprised by the phenomenal success of the MNIST dataset, which is used in so many image publications. But do people realize how easy this dataset is? One clear measure of hardness is performance of a simplistic k-NN classifier with vanilla L2 metric directly on pixels. As a variant: performance of the k-NN classifier with some basic unsupervised transformations such as the principal component analysis (PCA) or denoising.

I created a small poll to assess what people think about MNIST's k-NN search accuracy. I thank everybody for participation: Fortunately, more than one hundred people responded (most of them are machine learning practitioners and enthusiasts I assume). So, I think the results are rather reliable.

In summary, nearly 40% of the respondents think that the accuracy would be at most 80%, 45% think the accuracy is 95%. Unfortunately, I did not create the option for 90%. I think it would have had quite a few responses as well. That said the vanilla k-NN search on pixels has 97% accuracy and the combination of the PCA and the k-NN classifier has nearly 98% accuracy (here is a notebook to back up 98% claim.). In fact, with a bit of additional pre-processing such as deskewing and denoising, one can get a nearly 99% accuracy.

Turns out that few people realize how effective the k-NN classifier is on MNIST: only 17% voted for 98%. That said, it does not mean that the k-NN classifier is such a good method overall (it can be good for tabular data, see, e.g., this paper by Shlomo Geva, but not for complex image data, check, e.g., out numbers for CIFAR and IMAGENET). It means, however, that MNIST is very easy. Understandably, people need some toy dataset to play and quickly get results with. One better alternative is the fashion MNIST. However, it is not too hard either. A vanilla k-NN classifier has about 85% accuracy and it is probably possible to push the accuracy close to 90% with a bit of preprocessing. Thus, we may need a comparably small, but much more difficult dataset to replace both of them.

## Hello precision my old friend!

PREAMBLE:When dealing with retrieval, I have been traditionally using TREC NIST evaluation tools (trec_eval and gdeval) for information retrieval. Despite these tools are old, there has been a good amount of effort invested into making them right. Unfortunately, you have to call them as an external tool. Your program forks and runs out of memory. Despite Linux fork is lazy and does not really copy memory, it still happens. It happens even if you use the posix_spawn function and Python's spawn-type creation of new processes: multiprocessing.set_start_method('spawn')

The issue: I decided to switch to scikit-learn or a similarly-interface code (e.g., MatchZoo classes) to compute the IR metrics. I cross-compared results and I have come to the conclusion that very likely all scikit-learn-like packages are fundamentally broken when it comes to computing the mean average precision (MAP) and the normalized discounted cumulative gain NDCG

To compute both of the metrics, one needs two things:

1. The list of relevant documents, where the relevance label can be binary or graded
2. The list of scored/ranked documents.

Ideally, an evaluation tool could ingest this data directly. However, sklearn and other libraries cut the corner by accepting two arrays: y_score and y_true. Effectively each document is paired with its relevance grade, see, e.g., scikit-learn MAP.

Unfortunately, such an evaluation ignores all relevant documents, which are not returned by the system. In that, both NDCG and MAP have a normalizing factor that depends on the number of relevant documents. For example, in my understanding, if your system finds only 10% of all relevant documents, the scikit-learn MAP would produce a 10x larger MAP score compared to NIST trec_eval (and the Wikipedia formula). NDCG is still affected by this issue but to a lesser degree, because scores for omitted relevant documents will be heavily discounted.

I have created the notebook to illustrate this issue using one-query example and the MAP metric. By the way, for some reason, scikit-learn refuses to compute NDCG on this data and fails with a weird error.

Related reading: MAP is often (but not always) a meaningless metric if you do intrinsic evaluation of the k-NN search.