Dyck Languages
There’s an older programming job interview question:
Given a string of round, curly, and square open and closing brackets, return whether the brackets are balanced (well-formed).
For example, given the string “([])”, you should return true.
Given the string “([)]” or “((()”, you should return false.
I got this from the old Daily Coding Problem email list as “Daily Coding Problem: Problem #712 [Easy]”, and I’m certain those folks sent it out several times while they were a going concern.
When I initially tried this problem, I thought it was a coding problem that had been around for a while. I thought it was one of the coding problems that has a clever and a non-clever solution, and you had to have seen the clever solution before, because it isn’t obvious. In that way, I thought it was similar to the Xor linked list coding problem.
I googled for the “Given a string of round, …” problem definition. I found out that deciding whether strings of parentheses, brackets, braces, etc etc are balanced or not is the problem of recognizing words in Dyck Languages. It’s a well-studied problem, and has applications in a lot of fields.
Definition
The definitions of Dyck Languages are almost comical in their complexity.
Formal Language
Wikipedia says:
Let Σ = { ‘[’, ‘]’ } be the alphabet consisting of the symbols ‘[’ and ‘]’. Let Σ* denote its Kleene Closure. The Dyck Language is defined as:
{u ∈ Σ* | all prefixes of u contain no more ‘]’ than ‘[’, and the number of ‘[’ in u equals the number of ‘]’}
The “all prefixes” clause gets written a lot of ways by different people. Wolfram MathWorld has a particularly twisty variation. Planet Math has a definition that looks like this but for multiple types of parens/braces/brackets.
The above defines the Dyck-1 language, which has matching balanced parentheses, no other brackets, braces, etc etc. The coding problem is an example of a “typed” Dyck language according to Wikipedia.
Abstract Algebra
The (λ-free) Dyck Languages Dn(Dn+) on the alphabet {a1,a1’, …, an,an’} is the set of all (nonnull) words in the free semigroup generated by An = {a1,a1’, …, an,an’} which reduce to the identity λ the null word, under the sole relationship an,an’ = λ, 1 ≤ i ≤ n
That’s from Language Recognition by Marking Automata, R. Ritchie, F. Springsteel, Information and Control 1 May 1972 That’s the only paper I can find about “marking automata”. Marking automata must have been a dead end.
There’s two parallel definitions in that. A definition of Dn, which does not include zero-length words, and Dn+, which does include zero-length words. They use λ to denote a zero-length word, where the formal language folks use ε in similar circumstances.
Context Free Grammar
The grammar has 6 tokens ‘(’, ‘)’, ‘{’, ‘}’, ‘[’, ‘]’. The grammar has a few productions:
- start → PAIRLIST
- PAIRLIST → ε | PAIR | PAIRLIST PAIR
- PAIR → ‘(’ PAIRLIST ‘)’ | ‘[’ PAIRLIST ‘]’ | ‘{’ PAIRLIST ‘}’
That works for the coding problem above. You could certainly have more pairs of parentheses-like symbols, angle brackets, say, or guillemots, or open and closing “smart quotes”.
Implementations
Naturally, I have tried this problem a few times:
Code repo 1 is my 2020 first cut at the problem. My interview analysis is still good, but now I think the interviewer has to be prepared for some nerd to look at the problem and give an oral dissertation on Dyck Languages.
Code repo 2 has three programs in it:
reduction.go, which finds adjacent matching pairs of parens/braces/brackets, then cuts them out of the original string. This is exactly Ritchie and Springsteel’s definition of a Dyck language as some kind of abstract algebra.counter.go. This program has a single counter, and recognizes Dyck-1 words. It iterates over the characters of the original string, left to right. For every ‘(’ character, it increments the counter. For every ‘)’ character, it decrements the counter. If the counter ever becomes negative, there are too many ‘)’ characters. If the counter is not 0 (zero) when the program has examined all characters, there too many ‘(’ characters. This program examines all prefixes (left-to-right iteration) of the putative Dyck-1 word as the Wikipedia definition describes. The count staying non-negative ensures the “all prefixes “contain no more ‘)’ than ‘(’” part of the definition. A zero value count at the end of the string ensures the same number of ‘(’ and ‘)’ characters.balanced2.go- if you squint hard enough, you can see a formal language recognizer for the context free grammar (above) in this code, albeit a primitive one. This is the solution that you find most often by googling the coding problem statement. Wikipedia hints at it on the Dyck Language page. This version handles arbitrary numbers of arbitrary pairs of matching characters.depth.go- Ritchie and Springsteel’s 1972 paper suggested this algorithm, in their sketch proof of Theorem 2, although the “expanded” proof of Theorem 2 in the appendix doesn’t describe it at all. The only place I can find a similar implementation is a stack exchange answer. Bjartur Thorlacius didn’t get the correct answer check mark, and he’s only accumulated 2 points since 2017. No good deed goes unpunished.
My depth.go program relies on the observation that an opening character
increases the “depth” of nesting, and a closing character decreases the depth
of nesting.
Balanced parens/braces/brackets have an opening character
and the matching closing character at the same depth of nesting.
You can imagine [{}([])] (a balanced Dyck Language word) having nesting
depth like this:
1 [ ]
2 { } ( )
3 [ ]
For each nesting depth, the program iterates left-to-right across the putative Dyck word’s characters. If it finds an opening character, it saves the depth and the matching closing character. It continues to iterate until it finds the next character at the same depth. If that’s same depth character isn’t the matching closing character, the putative Dyck word isn’t a Dyck word and is unbalanced.
For putative Dyck word ((([{}{}{()})))), the nesting depth works out like this:
1 ( )
2 ( )
3 ( )
4 [ )
5 {}{}{ }
6 ()
The mismatch occurs at nesting depth 4: [ does not match )
The American Mathematical Society Blog has an interesting visualization of nesting depth.
You could view this as a variation on my program 3. balanced2.go.
Instead of pushing an opening character on a stack,
the program keeps track of nesting depth.
Finding a character at the same nesting depth is analogous
to finding a closing character.
All the intervening characters have been pushed (current depth incremented),
closing characters found and matching top-of-stack characters popped.
Learning Considerations
Grinding leetcode or doing many “Daily Coding Problems” seems to mostly promote a surface level understanding. You get code for a single toy problem, isolated from all other knowledge, divorced from learning.
I fell into this trap in 2020. I didn’t follow up on this problem at all. I found a reference, Language Recognition by Marking Automata by idly googled the coding problem prompt 5 years after I first did the problem. That paper was mentioned in the answers to a stack exchange question. Trying to understand that paper led me to Dyck Languages, and a true third algorithm.
It appears that some coding problems can increase understanding, but you’ve got to dig deep. In the case of this problem, I did several things:
- Generalize: I rewrote my programs to use more types of parens.
- Explore - look for and understand other algorithms, read references
- Discover theoretical justifications for the algorithms, find similarities in implementations
Software Engineering Aspects
Line counts aren’t a great measure of difficulty or complexity, but for simple programs, they’re what we’ve got.
| Program | Line count |
|---|---|
| counter.go | 36 |
| balanced2.go | 55 |
| reduction.go | 66 |
| depth.go | 86 |
depth.go was the hardest algorithm to figure out.
I felt surprise when I noticed balanced2.go and reduction.go
(without comments) had about the same line count.
Reduction of matched pairs to zero-length strings seemed heavy handed,
and simplistic.
The line count says otherwise.