Verifying Veritasium's Markov Chains

The Veritasium chap made a video about Markov Chains. Veritasium outlines Markov’s original text-based experiment. The experiment seems like something that could be educational to duplicate.

As Veritasium described Markov’s original experiment, the experimenter chooses a sample of text, then:

  1. Removes all punctuation, including newlines.
  2. Classifies each letter as a vowel or a consonant.
  3. Counts “overlapping pairs” of classified letters.
  4. Determines probabilities of categories of pairs.

Take this as the text sample:

The experiment seems like something that could be educational to duplicate.

Step 1: remove any non-letters:

Theexperimentseemslikesomethingthatcouldbeeducationaltoduplicate

Step 2: classify as vowel or consonant

CCVVCCVCVCVCCCVVCCCVCVCVCVCCVCCCCVCCVVCCCVVCVCVCVVCVCCVCVCCVCVCV

Step 3: break into overlapping pairs

The first 10 “overlapping pairs” are:

CC, CV, VV, VC, CC, CV, VC, CV, VC, CC

Step 4: Count the overlapping pairs

There are 64 letters in the example, so there are 64 total pairs.

Pair Count Probability
CC 16 0.4324
CV 21 0.5676
VV 5 0.1852
VC 22 0.8148

Step 5: Draw a transition diagram

Each pair transforms into a transition from a state (C or V) to a state, which might be the same state.

Example text’s transition diagram

The probability for each pair gets calculated like this:

For the pair CC, that’s a transition while in C state back to C state. 16/(16+21) = 0.4324 . Each state should have a sum of exit probabilities of 1.00.

Andrey Markov must have been a very smart man. This is quite a procedure to invent.

Try the procedure

I think there are two questions to try to answer.

  1. For a given ratio of consonants-to-vowels, does a randomly-generated sequence of letters have the same CC, CV, VV, VC probabilities as human words?
  2. For the transition diagram calculated for a sample text, can the transition diagram be used to generate text that has overlapping pair probabilities the same or very similar to those of the original text?

Naturally, I wrote some programs to do all the counting, random letter choice, and “executing” the transition diagrams.

Code repo

Question 1 - overlapping pairs probabilities

I’m going to try the contents of all the markdown files of this very blog. I’m going to generate randomly-selected letters based on the consonant-to-vowel ratio of the markdown files’ contents.

Blog markdown Generated letters
Total letters 1293957 1293957
Consonants 813080 (0.6284) 812783 (0.6281)
Vowels 480877 (0.3716) 481174 (0.3719)
CC 398057 (0.4896) 510680 (0.6283)
CV 415023 (0.5104) 302103 (0.3717)
VV 65854 (0.1369) 179070 (0.3722)
VC 415023 (0.8631) 302104 (0.6278)
transition diagram for markdown files contents transition diagram for generated file

The generated letters column is derived entirely from the total letters, consonants, and vowel counts of the Blog markdown column. The generation matched the total letters in the markdown files, and the proportions of consonants and vowels to 3 significant places, which means the random selection of letters works.

The transitions of the randomly selected letters closely match the proportions of vowels and consonants. The randomly selected text has 63% consonants, 37% vowels. About 63% of each transition (from C or V) is to C. About 37% of each transition (from C or V) is to V. The transitions work out.

I used a given ratio of consonants-to-vowels, 813080/480877 = 1.6908. A randomly-chosen sequence of letters that has the consonant-to-vowel ratio of 1.6908 does not have the same CC, CV, VV, VC probabilities as human words.

Question 2 - transition diagram generating text

Here’s the transition diagram for the text in this blog’s markdown files:

transition diagram for markdown files contents

I wrote a small program that runs a state machine with two states, C and V. Based on the Golang built-in random number generator, the program decides whether to transition from C state to V state 51.04% of the transitions, and from C state back to C state 48.96% of the time. Similar for the transitions out of the V state. The program outputs a vowel (‘a’) when it transitions to the V state, and a consonant (‘b’) when it transitions to the C state. Running the program for the same number of transitions as there are letters in the markdown files will let me count overlapping pairs.

Because there’s a random number generator involved, individual runs of the transition diagram program won’t be exactly identical. Here’s the results of one run:

Blog markdown Transition diagram generated text
Total letters 1293957 1293957
Consonants 813080 (0.6284) 813513 (0.6287)
Vowels 480877 (0.3716) 480444 (0.3713)
CC 398057 (0.4896) 398784 (0.4902)
CV 415023 (0.5104) 414729 (0.5098)
VV 65854 (0.1369) 65715 (0.1368)
VC 415023 (0.8631) 414729 (0.8632)

A typical run of the transition diagram program converges to a consonant-to-vowel ratio at about 11,000 transitions. Veritasium says Andrey Markov used 20,000 characters. Markov probably got a converged ratio, but it must have been tedious.

A program that can “execute” a text sample’s transition diagram can generate more text that has overlapping pair probabilities that are almost the same as those of the original text.

Overall, it appears that Veritasium outlined a valid procedure, and is telling the truth about the transition diagram.