8 Essential Tools for Text Processing on Linux (and Windows!)

Text files are a very versatile and convenient data format. Linux offers a set of tools that make it easy to manipulate them and are a great addition to your arsenal. Furthermore, with the Windows Subsytem for Linux these are readily available on Windows.

I didn’t use much Linux for a long time. These were the old Microsoft days where it was all about Windows. But a few years ago I decided to make a change and worked for Facebook for two and a half years. There I got re-acquainted with the Linux shell and worked extensively with the tools I’ll discuss. I’m now back at Microsoft (why?) and still use these tools all the time!

All of these tools take their input from either a file name or the standard input and print results to the standard output. This makes them very easy to combine through the pipe shell operator (|). For the examples below we’ll be using the Jeopardy dataset, converted to TSV.

Cut

Cut lets you slice "vertically" through data. A simple case is selecting a subset of columns out of a TSV file, but cut also supports custom separators and even character ranges as we’ll see below.

Head/Tail

These are two complementary utilities that let you print the first (head) or last (tail) n lines from a file. By default these will take 10 lines but you can specify a different number.

Now using cut and head we can display the first three columns and 5 rows:

$ cut -f 1-3 jeop.tsv | head -5
Show Number	 Air Date	 Round
4680	12/31/2004	Jeopardy!
4680	12/31/2004	Jeopardy!
4680	12/31/2004	Jeopardy!
4680	12/31/2004	Jeopardy!

Hmm, but what if we don’t want the header? Tail has a handy option for starting at a certain line, e.g. 2:

$ cut -f 1-3 jeop.tsv | head -5 | tail -n +2
4680	12/31/2004	Jeopardy!
4680	12/31/2004	Jeopardy!
4680	12/31/2004	Jeopardy!
4680	12/31/2004	Jeopardy!

Another very useful option in tail is -f or ‘follow’. In this mode tail will continuously monitor a file and print any lines that are added to it, so it can be used to watch a log file that another process is generating.

Grep

Grep let’s you evaluate regular expressions over a body of text. Let’s try to find questions about sports:

$ cut -f 6 jeop.tsv | grep -E " sports? " | head -5 | cut -c 1-80
This sport has an under-17 World Cup every 2 years; Haris Seferovic starred for
Ronaldo Luiz Nazario de Lima began playing this sport for Brazil's national team
In 1986 Mexico scored as the first country to host this international sports com
"Named for a sport that embodies high society, this Ralph Lauren co. was hacked
They're the sports teams of Fresno State as well as Georgia

First we cut to get the question column (6), we then apply the regular expression and finally we display the first 80 characters (cut -c) of the first (head) 5 matches.

wc

wc stands for word-count, and by default it displays the number of new lines, words and bytes in a file:

$ wc jeop.tsv
  216931  5156810 33633170 jeop.tsv

A very useful switch is -l, which outputs just the number of lines. Combining with our previous example we can find how many questions contain ‘sport’ or ‘sports’:

$ cut -f 6 jeop.tsv | grep -E " sports? " | wc -l
603

Sort

This one is pretty self explanatory. It’ll sort it’s input on a line by line basis. We can take a peak at the questions sorted by show number:

$ sort -n jeop.tsv | cut -f 1-4 | head -5
Show Number      Air Date        Round   Category
1       9/10/1984       Double Jeopardy!        4-LETTER WORDS
1       9/10/1984       Double Jeopardy!        4-LETTER WORDS
1       9/10/1984       Double Jeopardy!        4-LETTER WORDS
1       9/10/1984       Double Jeopardy!        4-LETTER WORDS

Here -n instructs sort to interpret the string as a number for ordering purposes.

Uniq

Given a set of lines, print only the unique occurrences. My favorite flag is -c, which outputs the count for each value. Note that uniq requires the input to be sorted. We can use to it find the top 10 categories:

$ cut -f 4 jeop.tsv | sort | uniq -c | sort -r -n | head -10
    547 BEFORE & AFTER
    519 SCIENCE
    496 LITERATURE
    418 AMERICAN HISTORY
    401 POTPOURRI
    377 WORLD HISTORY
    371 WORD ORIGINS
    351 COLLEGES & UNIVERSITIES
    349 HISTORY
    342 SPORTS

Sed

Sed’s (stream editor) most common use is to do string replacement operations, through it’s s (substitution) command:

sed "s/<search regex>/<replacement>/g" file.txt

The final g instructs sed to replace all occurrences of the regex, not just the first one. For example, I could replace all double-double quotes in the Jeopardy questions, such as:

"In 1963, live on ""The Art Linkletter Show"", this company …"

with single quotes:

$ sed "s/\"\"/'/g" jeop.tsv | head -5 | cut -f 6 | cut -c 1-80
 Question
"For the last 8 years of his life, Galileo was under house arrest for espousing
"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons wi
"The city of Yuma in this state has a record average of 4,055 hours of sunshine
"In 1963, live on 'The Art Linkletter Show', this company served its billionth b

sed has many other uses and options, such as operating only on certain lines or even adding/deleting lines from a file.

I hope this has wet your appetite and I encourage you to explore the various options and uses for these tools. They are powerful, fun and should be part of every engineer’s tool set!

Cut

Head/Tail

Grep

wc

Sort

Uniq

Sed

Footer