log in | about 

NMSLIB 1.5 is out

We have recently released a new version of our Non-Metric Space Library (NMSLIB). NMSLIB is an efficient and extendible toolkit for searching in generic non-metric spaces. Our killer feature is generality. Despite being general, some of the methods may outperform the celebrated LSH in the spaces where LSH methods seem to work best. For more details, please, check my reflections on the current state of the library and future work.

Does IBM Watson have elements of deep learning in its code?

This post duplicates my Quora answer to a question: Does IBM Watson have elements of deep learning in its code?. Feel free to vote and comment there.

Let's assume that we talk about the original IBM Watson that won the Jeopardy! and not a series of diverse products collectively branded as Watson. Keeping this assumption in mind: no, I don't think that IBM Watson used deep learning.

At the very least it didn't seem to play any substantial role. IBM Watson seem to be relying on a classic approach to answer factoid questions, where answer-bearing passages and documents are found via a full-text search. A small fraction of answers came from knowledge bases, something around 1%. Another interesting technique (again it didn't account for many retrieved answers) is previously in my blog entry: On an inconsistency in IBM Watson papers.

To select answer candidates, IBM Watson used a bunch of techniques including filtering by an answer type (e.g., for a question "Who is a a president of the United States" an answer should be a person). It also helped that most answers (90+%) were either Wikipedia titles or a part of the Wikipedia title. Additional help came from redundancy in answer-bearing sentences, i.e., an answer was contained in many text passages.

Disclaimer: I have never worked for IBM and my understanding comes from reading several articles that the IBM Watson team published several years ago.

Is it necessary to assume a distribution of keys for hashing to obtain constant access time guarantees?

This post duplicates my Quora answer. The question is regarding distribution assumptions necessary to obtain constant time access to a hash table. Feel free to vote and comment there.

In the ideal world, you would assume that there is a distribution of keys and this would allow you to analyze an average case behavior. However, in the real world, this is extremely hard to do for two reasons: (1) it's not clear which distribution to use, (2) your math will be crazy complex if doable at all.

So, as Daniel Tunkelang noted, we make a Simple Uniform Hashing Assumption (SUHA) and blindly assume that things are going to be alright: That is, the keys will be miraculously distributed evenly among buckets. This is not the only simplifying assumption about hash functions, there are several others. For example, in the analysis of locality sensitive hashing, we assume that a hash function is randomly and independently reselected for each pair of data points (see my notes here: Does Locality Sensitive Hashing (LSH) analysis have a fatal flaw?)

Note, however, that the SUHA assumption doesn't allow you to tell anything about the worst case complexity (as noted by Mark Gritter). It allows you to establish an average-case complexity (guarantee). If you want to optimize for the worst case, you may opt to use a perfect hash function. For a prespecified set of values (from a static universe of possible hash keys), the perfect hash function always hashes different elements into different buckets. In other words, the perfect hash function is collision-free.

One catch here is that the hash function is not specified for keys outside a given domain. For example, you can hash perfectly integers from 0 to 1000, but you won't know how to deal with 1001. A Cuckoo hashing doesn't have this limitation, while still allowing to answer queries in O(1) worst case time.

Sounds good, eehh? Well, actually both the cuckoo and perfect hashing share a common disadvantage: indexing is a much more expensive procedure compared to the classic hashing scheme. AFAIK, there are only probabilistic guarantees of a success. In practice, I think it is very unlikely that you won't be able to create a hash table, but it may take you quite a while to do so.

Obviously, there are better and worse hash functions. With better hash functions, keys are distributed among buckets more or less uniformly. This is not necessarily true if your hash function is bad. In his famous book, Donald Knuth considers hash function testing in detail. Are bad functions completely useless? In my experience, this is not necessarily so (but better hash functions do lead to substantially better performance).

While even bad hash functions may be ok for many practical purposes, an adversarial selection of keys and their insertion order may cause a real performance problem. For many hash functions, a hacker who knows your hash function may select such a sequence of keys that would result in a nearly O(N) insertion time (N is the number of entries).

One solution to this problem (not yet adopted in all the mainstream languages) is randomized hashing: One may use the same hash function, but some hashing parameter will be selected randomly for each hash table that you create in your program (see, e.g., Use random hashing if you care about security? )

Relation extraction for question answering

This post duplicates my Quora answer. Feel free to vote and comment there.

Relation extraction is used to build knowledge bases. These can be used:

  1. to answer simple questions directly;
  2. to combine extracted pieces of knowledge to answer more complex questions;
  3. for answer typing.

For example, if you extract a ternary relation (frogs, eat, insects) from the sentence: "Adult frogs eat mainly small insects", you can answer a question "What do frog eat?". In the modern era of QA this approach was pioneered by the MIT system Start. See, e.g. a related publication: REXTOR a system for generating relations from natural language, 2000, Katz, Boris, and Jimmy Lin.

Furthermore, answers to more complex questions can be found by combining several ternary relations, however, this requires a complex (and often intractable) logical inference. Some approximate approaches are often applied here, see, e.g., the paper Relation extraction and scoring in DeepQA

Answer typing is a classic answer extraction/matching technology employed from the early days of QA in extractive QA systems (see a seminal paper by R. F. Simmons: Answering English questions by computer). For example, if the question is "What is the largest mammal?" (answer blue whale), a possible answer can only be an animal that feeds babies with its own milk.

To be able to deduce the correct answer blue whale, a QA system needs to know that the blue whale is a mammal. This information may come from human-crafted ontologies. However, human-created ontologies often have poor coverage. An alternative strategy to obtaining such knowledge is relation extraction. For more details, please, see the following IBM Watson paper: Automatic knowledge extraction from documents

My 'Aha!' moments with machine learning/data science

This post on aha moments related to statistical learning duplicates my Quora answer. Feel free to vote and comment there.

I had the same moment a couple of times. It is not the aha moment though, it is a duh moment. Machine learning, which is more appropriately called statistical learning, is so a rear-view window approach. It learns statistical patterns from data, but nothing else. Such a learning creates some sort of a lossy compressed representation of the "past". This compressed representation can be used to predict "the future" as long as the future has the same statistical patterns (as the past). As obvious as it may seem, a clear understanding of this fact helps greatly. In my opinion, this holds for at least the basic supervised learning.

Another duh-moment is that we have been using "machine learning" since the dawn of the civilization to explain natural phenomena. Scientists observed data and came up with some sort of rules to explain why one event follows another one. We clearly started with some basic logical rules (e.g., one can predict that it will be snowing tomorrow given how skies look like today) and progressed to more sophisticated ones that involved math.

Interestingly, human learning has essentially the same flaw as the so called machine learning: Human theories can overfit easily to the data. Given enough degrees of freedom, almost anything can be explained. Yet convoluted theories are rarely true. This is probably one reason why we prefer simple elegant ones: This is some sort of a regularizer that prevents theories from overfitting data.

One well known overfitting example is a Geocentric system, which did not quite agree with observations in the first place. However, it was fixed by introducing a complex scheme of how planets rotate. As a result, the theory predicted planet movements better than alternatives, in particular, better than the simpler Heliocentric system (which was also somewhat flawed in the beginning because it assumed a perfectly circular motion). Many more examples (sadly) arise in a social context, when people try to explain too much while knowing too little. Most of our beliefs and conspiracy theories are probably nothing more than overfitting.


Subscribe to RSS - srchvrs's blog