Bringing a large Russian QA data set to light

Blog

Directory

Submitted by srchvrs on Mon, 02/03/2020 - 14:59

"It is achingly apparent that an overwhelming amount of research in speech and language technologies considers exactly one human language: English." (Kyle Gorman) For this reason Emily Bender has been famously encouraging people to (1) explicitly name languages they work on (2) do more work on non-English-data. This has become known as a Bender rule.

Despite the importance of multilingual NLP, frankly speaking, it has been difficult to have an opportunity to work on non-English data (in the previous decade my only major opportunity was a stint on cross-lingual metaphor detection). I am therefore very pleased to have been recently participating in bringing to light a large Russian question-answering/reading-comprehension (QA) data set SberQuAD, which was created similarly to SQuAD.

I have been helping my co-authors Pavel Efimov and Pavel Braslavski (who did nearly all the work) to analyze and describe this data set. We have conducted a very thorough analysis and evaluated several powerful models. The full analysis is available online, but here I would like to highlight the following:

SberQuAD was created similarly to Stanford SQuAD. Yet, despite the similarities, all the models perform worse on SberQuAD than on SQuAD, which can be attributed to having only a single answer variant and fewer answers that are named entities. A lot of answers in SberQuAD still often contain an entity, but it is normally only a part of an answer. This stands in contrast to SQuAD where roughly half of the answers are named entities.

srchvrs's blog

You are here