Corpus Linguistics – the theoretical minimum

Corpus Linguistics is a neglected field of linguistics. Linguists tend to think that it cannot offer much, only some methodological tools to support their ideas. However, they often blame it, when it contradicts to their results. Corpus Linguistics was often considered the historic predecessor of Natural Language Processing in the pre-Big Data era. In this post, we claim that Corpus Linguistics offers a unique perspective on language, and it provides experts with theoretical and practical framework to analyze linguistic data. For the best resources of Corpus Linguistics, don’t stop reading!

The Corpus MOOC

Lancaster University is the epicenter of Corpus Linguistics and you can take their superb Corpus Linguistics: Method, Analysis, Interpretation MOOC course on FutureLearn for free! This is the easiest way to get into Corpus Linguistics. It is strongly recommended even for professional NLP and text/content analysis experts, since it gives a different perspective on linguistic data than other disciplines do.

Take a look at the ESRC Centre for Corpus Approaches to Social Science (CASS) website to get an idea of how corpus methods can be applied to content analysis. If you are a student, consider applying to the Lancaster Summer Schools in Corpus Linguistics. It has a reputation that students gain fantastic experiences there.

Books on the theory and methodology of Corpus Linguistics

Corpus Linguistics by Tony McEnery and Andrew Hardie is a perfect intro into the field. OK, it is not the most exciting book on earth, because it has to deal with questions of data sources and ethics. It shines when it describes use-cases in neo-Firthian/functional and cognitive linguistics – but don’t be afraid of those very technical terms! This is a textbook so it explains everything that you need to know about the topics.

Oaks’ Statistics for Corpus Linguistics is our favorite book from the field. First, we used it as a textbook during our studies in the early 2000s and we often open it as a reference book since then.

Software tools for the non-programmers

Laurence Anthony’s AntConc was the one and only free and comprehensive corpus analysis toolkit for non-programmers. The accompanying YouTube tutorials are the best resources to learn how to use it in practice. We’ve been using AntConc for years now. Although its user interface is spartan, we learned to love it, since we haven’t found a better tool yet.

#LancsBox: Lancaster University corpus toolbox is “a new-generation software package for the analysis of language data and corpora developed at Lancaster University ” Developed by the best corpus linguistics research center, #LancsBox seems to be the heir apparent to AntConc. Its user interface is more user-friendly and its functionality is more versatile. We esp. love its collocation network visualization capabilities.

Practical Programming for Corpus Linguistics

God knows why, but corpus linguists prefer the R programming language, so here we list the best sources to learn R and corpus linguistics hand in hand.

R. Harald Baayen is one of the early pioneers of quantitative linguistics. His Analyzing Linguistic Data is an excellent introduction into corpus/quantitative methods and into programming with R. This book came out in 2008 and shows its age now, so we don’t recommend it to complete beginners in R.

If you read only one book on corpus linguistics, and you are not afraid of coding, Gries’s Quantitative Corpus Linguistics with R should be that book. Gries is an exceptional teacher, who wrote a pedagogically brilliant textbook. It helps you acquire the necessary skills to analyze linguistic data in a step-by-step fashion. It provides the reader with lucid explanations at every stage. Read our interview with Gries from 2010 on our previous blog.

Written in the same vein as Quantitative Corpus Linguistics, Statistics for Linguistics with R introduces the main statistical methods and their use in linguistics. Just like Baayen’s book, this one covers topics of corpus and quantitative linguistics. Although it is a masterpiece, we only recommend it to those who have a strong interest in linguistics.


  • The header image was generated for the meetup on visualizing linguistic data. If you speak Hungarian, you can read more about it here.
  • Each book cover was downloaded from Amazon via Google Image Search.

Subscribe to our newsletter

Get highlights on NLP, AI, and applied cognitive science straight into your inbox.

powered by TinyLetter