Rob Pike's Simple Regular Expressions

I came across an article by Brian Kernighan. The article is about a piece of C code written by Rob Pike, apparently for a book Pike and Kernighan wrote together, The Practice of Programming.

The code implements a subset of regular expression text matching:

c    matches any literal character c
.    matches any single character
^    matches the beginning of the input string
$    matches the end of the input string
*    matches zero or more occurrences of the previous character

I believe Kernighan and Pike also used the code in an article in the legendary Dr Dobbs Journal.

I had read The Practice of Programming book, and seen the article back in the day. Discovering Kernighan’s article compelled me to try the code, because I firmly believe you have to do something to understand that thing.

In the case of source code and computer stuff, I will re-type what appears on line, so that I can understand it on a line-by-line basis.

I often do minor modifications to ensure I’m really understanding the code. Maybe I use SQLite instead of MySQL, or add an operation. This time, I decided to use Go instead of C. I also cut-n-pasted the C code into it’s own file, so I could see if some constructs (.*, a*b*) worked in the original.

Porting to Go

My source code repo.

Sometimes C code can easily transliterate to Go and vice versa, but this is not one of them. Pike’s code uses lots of C idioms, like == '\0' for end-of-string, and *text++, which even the best of us have to think hard about. It also uses two do-while loops, which are not part of Go. I had to do something different for both do-while loops. I believe I wrote the Go code in such a way that UTF-8 strings will work, not just ASCII one-byte-per-letter strings.

I got a wild hair and wrote Go unit testing that works with go test. I felt compelled to put all kinds of cases in table-driven testing, which uncovered a bug in my first cut at a Go version. Originally .* did not work, and two Kleene closures in a row failed (ab*c*d would not match abbbccccd). Part of the problem involved the transliteration of a do-while loop condition involving *text++. I had a traditional for-loop with an index variable, and I “advanced” the text to be matched via text = text[1:]. The interaction was incorrect.

Pike’s (ca 1999) C was compact, even by the C standards of the day. I count 32 lines including comments and 3 empty lines between functions. My Go transliteration is about 48 lines. Most of the extra lines are due to Go’s syntax requiring curly braces around the clauses of if statements, even for one line clauses. Pike’s code only has curly braces around the body of the do-while loops.