Verifying Veritasium's Markov Chains
The Veritasium chap made a video about Markov Chains. Veritasium outlines Markov’s original text-based experiment. The experiment seems like something that could be educational to duplicate.
As Veritasium described Markov’s original experiment, the experimenter chooses a sample of text, then:
- Removes all punctuation, including newlines.
- Classifies each letter as a vowel or a consonant.
- Counts “overlapping pairs” of classified letters.
- Determines probabilities of categories of pairs.
Take this as the text sample:
The experiment seems like something that could be educational to duplicate.
Step 1: remove any non-letters:
Theexperimentseemslikesomethingthatcouldbeeducationaltoduplicate
Step 2: classify as vowel or consonant
CCVVCCVCVCVCCCVVCCCVCVCVCVCCVCCCCVCCVVCCCVVCVCVCVVCVCCVCVCCVCVCV
Step 3: break into overlapping pairs
The first 10 “overlapping pairs” are:
CC, CV, VV, VC, CC, CV, VC, CV, VC, CC
Step 4: Count the overlapping pairs
There are 64 letters in the example, so there are 64 total pairs.
| Pair | Count | Probability |
|---|---|---|
| CC | 16 | 0.4324 |
| CV | 21 | 0.5676 |
| VV | 5 | 0.1852 |
| VC | 22 | 0.8148 |
Step 5: Draw a transition diagram
Each pair transforms into a transition
from a state (C or V) to a state,
which might be the same state.

The probability for each pair gets calculated like this:
For the pair CC, that’s a transition while in C state back to C state. 16/(16+21) = 0.4324 . Each state should have a sum of exit probabilities of 1.00.
Andrey Markov must have been a very smart man. This is quite a procedure to invent.
Try the procedure
I think there are two questions to try to answer.
- For a given ratio of consonants-to-vowels, does a randomly-generated sequence of letters have the same CC, CV, VV, VC probabilities as human words?
- For the transition diagram calculated for a sample text, can the transition diagram be used to generate text that has overlapping pair probabilities the same or very similar to those of the original text?
Naturally, I wrote some programs to do all the counting, random letter choice, and “executing” the transition diagrams.
Question 1 - overlapping pairs probabilities
I’m going to try the contents of all the markdown files of this very blog. I’m going to generate randomly-selected letters based on the consonant-to-vowel ratio of the markdown files’ contents.
| Blog markdown | Generated letters | |
|---|---|---|
| Total letters | 1293957 | 1293957 |
| Consonants | 813080 (0.6284) | 812783 (0.6281) |
| Vowels | 480877 (0.3716) | 481174 (0.3719) |
| CC | 398057 (0.4896) | 510680 (0.6283) |
| CV | 415023 (0.5104) | 302103 (0.3717) |
| VV | 65854 (0.1369) | 179070 (0.3722) |
| VC | 415023 (0.8631) | 302104 (0.6278) |
![]() |
![]() |
The generated letters column is derived entirely
from the total letters, consonants, and vowel counts
of the Blog markdown column.
The generation matched the total letters
in the markdown files,
and the proportions of consonants and vowels
to 3 significant places,
which means the random selection of letters works.
The transitions of the randomly selected letters closely match the proportions of vowels and consonants. The randomly selected text has 63% consonants, 37% vowels. About 63% of each transition (from C or V) is to C. About 37% of each transition (from C or V) is to V. The transitions work out.
I used a given ratio of consonants-to-vowels, 813080/480877 = 1.6908. A randomly-chosen sequence of letters that has the consonant-to-vowel ratio of 1.6908 does not have the same CC, CV, VV, VC probabilities as human words.
Question 2 - transition diagram generating text
Here’s the transition diagram for the text in this blog’s markdown files:

I wrote a small program that runs a state machine
with two states, C and V.
Based on the Golang built-in random number generator,
the program decides whether to transition
from C state to V state 51.04% of the transitions,
and from C state back to C state 48.96% of the time.
Similar for the transitions out of the V state.
The program outputs a vowel (‘a’) when it transitions
to the V state,
and a consonant (‘b’) when it transitions to the C state.
Running the program for the same number of transitions
as there are letters in the markdown files
will let me count overlapping pairs.
Because there’s a random number generator involved, individual runs of the transition diagram program won’t be exactly identical. Here’s the results of one run:
| Blog markdown | Transition diagram generated text | |
|---|---|---|
| Total letters | 1293957 | 1293957 |
| Consonants | 813080 (0.6284) | 813513 (0.6287) |
| Vowels | 480877 (0.3716) | 480444 (0.3713) |
| CC | 398057 (0.4896) | 398784 (0.4902) |
| CV | 415023 (0.5104) | 414729 (0.5098) |
| VV | 65854 (0.1369) | 65715 (0.1368) |
| VC | 415023 (0.8631) | 414729 (0.8632) |
A typical run of the transition diagram program converges to a consonant-to-vowel ratio at about 11,000 transitions. Veritasium says Andrey Markov used 20,000 characters. Markov probably got a converged ratio, but it must have been tedious.
A program that can “execute” a text sample’s transition diagram can generate more text that has overlapping pair probabilities that are almost the same as those of the original text.
Overall, it appears that Veritasium outlined a valid procedure, and is telling the truth about the transition diagram.
