TF-IDF

I read an article about Term Frequency-Inverse Document Frequency on John D. Cook Consulting’s blog.

I decided to understand Term Frequency-Inverse Document Frequency by writing code to calculate TF-IDF myself. I needed some text divided into “documents” or “books” or something. John D. Cook used the now public domain King James Version Bible. I used this blog. I took each published post as an individual document.

I also decided to do all the TF-IDF calculation and text manipulation in bash shell scripts. I’ve written enough programs in enough language to know that shell scripts are almost always good enough, and the flexibility helps when you know how to program, but don’t know exactly what to program.


Code repo


I had some difficulties, both technical and conceptual.

Technical difficulty - text from Hugo markdown format files

This blog is a set of static files generated by Hugo software. Hugo blog posts are text files, containing pieces in two different formats. The front matter in my particular blog is YAML format, a human-friendly data serialization format. The rest of the blog post, the actual text of a post, is a semi-structured plain text format, markdown. I don’t want to consider words in the front matter, and I want to remove the characters and strings used to denote formatting in markdown.

I looked around, considered using a bunch of sed stream editor text modifications, but settled on pandoc. pandoc has worked well for me in other endeavors, and I already had it installed. Running pandoc on all 346 (at the time) blog post markdown files was a little time consuming. I intended to do all the text manipulation and calculation in a single script file, but the almost 2 minutes it took to run pandoc on all the blog posts made me break up the single large script into 3 smaller scripts.

Conceptual difficulty - understanding what “term frequency” means

I initially did not read the Term Frequency-Inverse Document Frequency article closely enough. “Term frequency” is is the relative frequency of term t within document d . I missed the “within document d” part. I wrote calculations that used a count of occurrences of a word and counted all words. Dividing the count of occurrences by the count of all words yields the overall term frequency, not frequency with a particular document. Frequencies of terms or words over my entire 346 blog posts is pretty small, and caused some arithmetic problems in shell scripts, as well as giving nonsensical “important” terms. I had to rewrite one of my scripts to do per-document frequencies.

The Benefits of Shell Scripts

I still believe in shell scripts. The Linux shells have some goofy syntax, and they emphasize processing text files, but in this case, that’s exactly what’s required.

File handling, and file I/O is vastly simpler in these shell scripts than it would be if written in any other programming language. It was quite easy to break up the calculations into multiple files once I made the decision to do that.

Shell scripts are succinct: the 3 files are only 66 lines total, which includes some boilerplate and some blank lines.

I think there’s a few aspects of note in my code.

The scripts do counting 3 different ways: uniq -c, wc -l and a small awk script awk '{sum += $1} END {print sum}') uniq -c counts sequentially identical strings, wc -l counts items where each item is a line in a file, awk script sums a stream of text representations of numbers.

I used join where I might otherwise use a map Go type, or a Perl/PHP/awk associative array. In this case, I wanted join to match the terms in the IDF file with terms in per-document TF files, associating two numerical values with a single term. Other times, I’ve wanted join to match all terms in both files, a missing line from one file or the other indicates a problem. Shell scripting is still an art, and requires imagination.

Conclusions

TF-IDF ratings do pick out the words that seem important to me.

“hawkeye” is the highest TF-IDF word in my post about M*A*S*H

“fortune” is a word with high TF-IDF in my fortune cookies posts, which is appropriate.

“list” or “lists” or “linked” is the highest TF-IDF word in my mergesort posts, also quite appropriate.

I’m not completely certain what one would do with these per-document important words. Maybe you could use the top N of each document as part of the indexing.