Text files are a very versatile and convenient data format. Linux offers a set of tools that make it easy to manipulate them and are a great addition to your arsenal. Furthermore, with the Windows Subsytem for Linux these are readily available on Windows.
I didn’t use much Linux for a long time. These were the old Microsoft days where it was all about Windows. But a few years ago I decided to make a change and worked for Facebook for two and a half years. There I got re-acquainted with the Linux shell and worked extensively with the tools I’ll discuss. I’m now back at Microsoft (why?) and still use these tools all the time!
All of these tools take their input from either a file name or the standard input and print results to the standard output. This makes them very easy to combine through the pipe shell operator (|). For the examples below we’ll be using the Jeopardy dataset, converted to TSV.
Cut
Cut lets you slice "vertically" through data. A simple case is selecting a subset of columns out of a TSV file, but cut also supports custom separators and even character ranges as we’ll see below.
Head/Tail
These are two complementary utilities that let you print the first (head) or last (tail) n lines from a file. By default these will take 10 lines but you can specify a different number.
Now using cut and head we can display the first three columns and 5 rows:
$ cut -f 1-3 jeop.tsv | head -5
Show Number Air Date Round
4680 12/31/2004 Jeopardy!
4680 12/31/2004 Jeopardy!
4680 12/31/2004 Jeopardy!
4680 12/31/2004 Jeopardy!
Hmm, but what if we don’t want the header? Tail has a handy option for starting at a certain line, e.g. 2:
$ cut -f 1-3 jeop.tsv | head -5 | tail -n +2
4680 12/31/2004 Jeopardy!
4680 12/31/2004 Jeopardy!
4680 12/31/2004 Jeopardy!
4680 12/31/2004 Jeopardy!
Another very useful option in tail is -f or ‘follow’. In this mode tail will continuously monitor a file and print any lines that are added to it, so it can be used to watch a log file that another process is generating.
Grep
Grep let’s you evaluate regular expressions over a body of text. Let’s try to find questions about sports:
$ cut -f 6 jeop.tsv | grep -E " sports? " | head -5 | cut -c 1-80
This sport has an under-17 World Cup every 2 years; Haris Seferovic starred for
Ronaldo Luiz Nazario de Lima began playing this sport for Brazil's national team
In 1986 Mexico scored as the first country to host this international sports com
"Named for a sport that embodies high society, this Ralph Lauren co. was hacked
They're the sports teams of Fresno State as well as Georgia
First we cut to get the question column (6), we then apply the regular expression and finally we display the first 80 characters (cut -c) of the first (head) 5 matches.
wc
wc stands for word-count, and by default it displays the number of new lines, words and bytes in a file:
$ wc jeop.tsv
216931 5156810 33633170 jeop.tsv
A very useful switch is -l, which outputs just the number of lines. Combining with our previous example we can find how many questions contain ‘sport’ or ‘sports’:
$ cut -f 6 jeop.tsv | grep -E " sports? " | wc -l
603
Sort
This one is pretty self explanatory. It’ll sort it’s input on a line by line basis. We can take a peak at the questions sorted by show number:
$ sort -n jeop.tsv | cut -f 1-4 | head -5
Show Number Air Date Round Category
1 9/10/1984 Double Jeopardy! 4-LETTER WORDS
1 9/10/1984 Double Jeopardy! 4-LETTER WORDS
1 9/10/1984 Double Jeopardy! 4-LETTER WORDS
1 9/10/1984 Double Jeopardy! 4-LETTER WORDS
Here -n instructs sort to interpret the string as a number for ordering purposes.
Uniq
Given a set of lines, print only the unique occurrences. My favorite flag is -c, which outputs the count for each value. Note that uniq requires the input to be sorted. We can use to it find the top 10 categories:
$ cut -f 4 jeop.tsv | sort | uniq -c | sort -r -n | head -10
547 BEFORE & AFTER
519 SCIENCE
496 LITERATURE
418 AMERICAN HISTORY
401 POTPOURRI
377 WORLD HISTORY
371 WORD ORIGINS
351 COLLEGES & UNIVERSITIES
349 HISTORY
342 SPORTS
Sed
Sed’s (stream editor) most common use is to do string replacement operations, through it’s s (substitution) command:
sed "s/<search regex>/<replacement>/g" file.txt
The final g instructs sed to replace all occurrences of the regex, not just the first one. For example, I could replace all double-double quotes in the Jeopardy questions, such as:
"In 1963, live on ""The Art Linkletter Show"", this company …"
with single quotes:
$ sed "s/\"\"/'/g" jeop.tsv | head -5 | cut -f 6 | cut -c 1-80
Question
"For the last 8 years of his life, Galileo was under house arrest for espousing
"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons wi
"The city of Yuma in this state has a record average of 4,055 hours of sunshine
"In 1963, live on 'The Art Linkletter Show', this company served its billionth b
sed has many other uses and options, such as operating only on certain lines or even adding/deleting lines from a file.
I hope this has wet your appetite and I encourage you to explore the various options and uses for these tools. They are powerful, fun and should be part of every engineer’s tool set!