More than 90% of machine learning applications improve with human feedback. For example, a model that classifying news articles into pre-defined topics has been trained on 1000s of examples where humans have manually annotated the topics. However, if there are tens of millions of news articles, it might not be feasible to manually annotate even 1% of them. If we only sample randomly, we will mostly get popular topics like “politics” that the machine learning model can already identify accurately. So, we need to be smarter about how we sample. This talk is about “Active Learning”, the process of deciding what raw data is the most optimal for human review, covering: Uncertainty Sampling; Diversity Sampling; and some advanced methods like Active Transfer Learning.

(more…)

Corpus Linguistics is a neglected field of linguistics. Linguists tend to think that it cannot offer much, only some methodological tools to support their ideas. However, they often blame it, when it contradicts to their results. Corpus Linguistics was often considered the historic predecessor of Natural Language Processing in the pre-Big Data era. In this post, we claim that Corpus Linguistics offers a unique perspective on language, and it provides experts with theoretical and practical framework to analyze linguistic data. For the best resources of Corpus Linguistics, don’t stop reading!

(more…)

Are you tired of talking about the trolley problem whenever you start a conversation on autonomous vehicles? Are you bored with the fear about robots? Do you want to be sure that while you are working on your deep reinforcement learning startup, autocracies can’t use your technology to strengthen their power? What if your technology deepens the gap between the rich and the poor further? Do you think that Ethics is inseparable from development and we have to care about moral questions? If your answer is “yes” to any of these questions, Mark Coeckelbergh’s AI Ethics is your book!

(more…)

Children and women had no rights for a long time in human history. Universal suffrage and women’s rights were unimaginable for centuries before the modern era. These days, most of the developed countries protect the rights of animals and the (living) environment to some extent. The technological development raises the question if we should give rights to machines. Should we stop beating our robots?

(more…)

City names in Hungary are like lego parts. You can put together two ore more words almost freely and get an existing city name. However one can easily discover some pattern of the names, i.e. their reasonable proportion ends with the same word. Our project aims to map the most frequent endings of the municipality names of Hungary.

(more…)

“I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?”, said Hal Varian chief economist at Google in 2009. These days machine learning and artificial intelligence are the sexiest fields, but their practitioners should be undercover statisticians. If you are looking for an intro into stats, this is a must-read post for you.

(more…)

Although it is commonplace that we are living in the era of big data and experiencing a data deluge, tons of projects fail because of the lack of (labeled) data. If you don’t have enough data, you can’t make a great machine learning application, even if your team has got the brightest minds of the universe. In this post, we present an example of label spreading to get more labeled data using Python. If you are not a Pythonista, scroll down for other resources on the topic.

(more…)