A simple Markov chain text generator in Python

What does it mean to build a language model? How it is related to the Beat Generation’s cut-up technique and can be used for text generation? We’ll shed light on these questions using Python!

Pick a book any book cut it up
cut up
prose
poems
newspapers
magazines
the bible
the koran
the book of moroni
lao-tzu
confucius
the bhagavad gita
anything
letters
business correspondence
ads
all the words
what you can do
cut it upslice down the middle dice into sections
according to taste
chop in some bible pour on some Madison Avenue
prose
shuffle like cards toss like confetti
taste it like piping hot alphabet soup

Brion Gysin: Minutes to Go (in Burroughs-Gysin: The Third Mind)

In the summer of 1959 Brion Gysin painter and writer cut newspaper articles into sections and rearranged the sections at random. “Minutes to Go” resulted from this initial cut-up experiment. “Minutes to Go” contains unedited unchanged cut-ups emerging as quite coherent and meaningful prose. […]The cut-up method brings to writers the collage, which has been used by painters for fifty years.

William S. Burroughs: The Cut-up Method of Brion Gysin (in Burroughs-Gysin: The Third Mind)

The most basic text generation technique is the cut-up. Although it is ascribed to the Beat Generation, earlier movements (like dadaism) knew it well-before the 60s. Anyway, as described above, it selects words randomly from one or more texts. However, this doesn’t yield good results. I’m quite sure that Gysin ordered the randomly selected words, so there is some structure behind the cut-up.

Just before the rise of the Beat Generation, Claude Shannon, the father of information theory, used his newly established science for generating texts! In his A Mathematical Theory of Communication, Shannon calls completely random series zero-ordered approximations. Putting randomly selected words after each other yields totally unintelligible lines. As one selects words according to their frequency in a huge corpus, the resulting text gets more natural. If we go further, and we take two-word or three-word or n-word sequences, we get better and better results. One can see these n-word sequences (or n-grams) as transitions from one word to the other. Do you remember the concept of conditional probability from school? What’s the probability that the >>Don’t forget the<< sequence will be followed by >>milk<<, or P(milk|the)? We know that P(A|B) = P(A and B)/P(B), so we have to divide the frequency of >>the milk<< by the frequency of >>the<<. Historically, Markov was interested in questions like “What’s the probability that a randomly selected vowel is followed by an other vowel in Pushkin’s Onegin?” This was published in a landmark paper titled An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (the link points to a password protected version of the study).

In our case, we’d like to generate a text. We’ll use a word that is either randomly selected, or given by the user as a seed. This can be used to pick up every bi-gram whose first element is the seed. We can randomly select the second word of any bi-gram, but it is more elegant to use their frequencies for a weighted selection. We can repeat this procedure with the chosen word even ad infinitum (thought this wouldn’t be very practical). We can easily extended this method to n-grams, let’s see how to implement it in Python!

We are going to use the GoodNewsEveryone corpus of English news headlines. Although the corpus contains not only headlines, but their sentiments and other information, we only deal with the headlines here. First, we are going to clean up the headlines, e.g. strip all leading and trailing whitepaces and etc.

The next step is to build language models. We are going to count the relative frequencies of words, bi-grams, tri-grams and four-grams. One big question arises here, what is a word? What should we do with punctuation marks? How should we treat contractions like >>don’t<<? We don’t answer this question here. We use the word_tokenize function from tokenize module of the NLTK toolkit since its implementers spend lots of time on answering these question and put it into a consistent and well-tested function. We use Counter from the collections module to get the raw frequencies, and we get the relative frequencies using dictionary comprehensions.

Having generated and saved the frequency tables, we have language models and we can start working on text generation! Our simple generator takes three arguments, the corpus, the seed and n. In this post, we don’t deal with the Onegin and the Alice corpus, but if you download our code, you can play with them. First the function reads in the language models. It is very simple, because it selects n-grams starting with seed. If seed not in the model, it returns a randomly selected word, otherwise it returns a weighted choice from the n-grams starting with seed.

If we are to generate texts using our language models, we need a seed. The seed should be one word shorter than the n in the n-gram model we’d like to use. We know that the average headline contains 13 words, and we’d like to use a four-gram model. We need a seed of three words. We collect the return value of simple_generator into a list and we call it again with the last three elements of this list until we have 13 words in our list.

The result of calling simple_generator is a list containing words and punctuation marks. To get a well-formed sentence, we have to remove those extra whitespaces between words and commas and other punctuation marks. This is called detokenization. We use the mosestokenizer to detokenize the list and get our headlines.

Let’s see a few examples using bi-gram, tri-gram and four-gram models. The seeds starts with the same word for each experiment.

Earlier gains 15 questions. you dont just did it misspent funds needed for release
Earlier gains rooms farages restoration topsoil well extremely reaction consumer alligator babson cemetery units
Earlier gains rooms journalist totally transporting photo improving steube stopgap steele philip stomach

Exceptions to address instead of herself look into force vote on effectiveness of 800 people
Exceptions to vote-exit premiere congresss reaffirms landmark rebel terrorists pleading waterways 1.35 rupert slice
Exceptions to vote-exit everything discovery grilled dolly homeownership typhoon dedication mundo rabalais crackdown

Visits after shunning guns seized at burger. ilhan omars racist former congressman recalls witnessing
Visits after scouts dollar aapl scandals myth owner grandma wife-killer matthijs contributed confronts buggies
Visits after scouts thief rattles television desire reviewer russell 13-year-old rikers yankees nurse

Our examples are far from perfect, but some of them could be a cut-up like sentence. The language model could be improved by using five- six- bigger-n grams. Another way is to use a bigger corpus. Also, we can fine-tune our language models with incorporating part-of-speech information into it. Next time we will go deeper into these questions, so stay tuned!

Resources

You can find our code on GitHub. The repo contains code for processing various other texts, e.g. Pushkin’s Onegin, and Alice in Wonderland by Lewis Carroll. Warning! The code is designed for demonstrative purposes and it is neither efficient nor 100% Pythonic!

The cover image is a collage by Gysin. It was downloaded from this link.

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00
€15.00
€100.00
€5.00
€15.00
€100.00
€5.00
€15.00
€100.00

Or enter a custom amount


Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Do you like our visualizations? Buy them for yourself!

Visit our shop on Society6 to get a printout of our vizs.