What does it mean? Does AI know the answer?

Recently a linguistic search engine overturned the federal mask mandate in the US, the recent issue of The Economist says “Huge “foundation models” are turbo-charging AI progress” and Gary Marcus seems to be the only person who bets against the success of AI. But can we trust empiricism and big data? A large language model can decide the debate over the common or right usage of words?

OK, we are in the era of (big) data. We know that The Unreasonable Effectiveness of Data still holds. Pouring more training data into traditional machine learning methods produced better models. Increasing the number of parameters of large language models (or computer vision models, etc.) yields even better models. On the one hand, one might think we are at The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. One might think one day an impartial AI will decide what “sanitation” and other terms really mean and how we should interpret the law. But wait a minute!

Source: https://www.economist.com/interactive/briefing/2022/06/11/huge-foundation-models-are-turbo-charging-ai-progress

We have to keep in mind that no corpus is unbiased so we cannot say it represents all possible usage of a word/term. E.g. most corpora contain freely available data sources like online news, Wikipedia articles and etc. Regional, class and other differences in usage are barely represented in an online corpus. For more on this, read this paper by top-notch NLP and AI researchers.

Corpus linguists are better at designing their corpora, so specialized corpora might be much better than big data. But there are other factors that are hard to balance out (e.g. gender, age, etc.). However, corpus linguists are usually well aware of the limits of their data and they know that theory plays an important role in their work. At first, corpus linguistics seems to be a quantitative research program, but it is a qualitative one too. A corpus linguist or anyone doing discourse analysis, is interpreting empirical results in some way (often time by adhering to a research tradition). The only thing we can do is one must be honest about his/her predisposition and the theoretical framework he/she uses to interpret the results. Yes, saying absolute truths in this field is impossible, and you have to get used to this. This leads us to the philosophy of science and we let the philosophically inclined reader discover Quine’s in- and underdetermination hypotheses and the role of theory in gathering data.

If you’d like to learn more about corpus linguistics, we wrote a post about the best resources to get started with.

The source of the cover image: https://www.theverge.com/2022/6/7/23153218/legal-corpus-linguistics-mask-mandate-judges

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

¤5.00
¤15.00
¤100.00
¤5.00
¤15.00
¤100.00
¤5.00
¤15.00
¤100.00

Or enter a custom amount


Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Do you like our visualizations? Buy them for yourself!

Visit our shop on Society6 to get a printout of our vizs.

Posted In: