Building Tools with the Unix Shell#
Wisdom comes from experience. Experience is often a result of lack of wisdom.
— Terry Pratchett
The shell’s greatest strength is that it lets us combine programs to create pipelines that can handle large volumes of data. This lesson shows how to do that, and how to repeat commands to process as many files as we want automatically.
We’ll be continuing to work in the zipf
project,
which should contain the following files after the previous chapter:
(zipf
is located inthe projects
folder of directory in which this chapter resides.)
zipf/
└── data
├── README.md
├── dracula.txt
├── frankenstein.txt
├── jane_eyre.txt
├── moby_dick.txt
├── sense_and_sensibility.txt
├── sherlock_holmes.txt
└── time_machine.txt
Combining Commands#
To see how the shell lets us combine commands,
let’s go into the zipf/data
directory
and count the number of lines in each file once again:
$ cd projects/zipf/data
$ wc -l *.txt
15975 dracula.txt
7832 frankenstein.txt
21054 jane_eyre.txt
22331 moby_dick.txt
13028 sense_and_sensibility.txt
13053 sherlock_holmes.txt
3582 time_machine.txt
96855 total
Which of these books is shortest? We can check by eye when there are only seven files, but what if there were eight thousand?
Our first step toward a solution is to run this command:
$ wc -l *.txt > lengths.txt
The greater-than symbol >
tells the shell
to redirect the command’s output to a file instead of printing it.
Nothing appears on the screen;
instead,
everything that would have appeared has gone into the file lengths.txt
.
The shell creates this file if it doesn’t exist,
or overwrites it if it already exists.
We can print the contents of lengths.txt
using cat
(man page),
which is short for concatenate
(because if we give it the names of several files
it will print them all in order):
$ cat lengths.txt
15975 dracula.txt
7832 frankenstein.txt
21054 jane_eyre.txt
22331 moby_dick.txt
13028 sense_and_sensibility.txt
13053 sherlock_holmes.txt
3582 time_machine.txt
96855 total
We can now use sort
(man page) to sort the lines in this file:
$ sort -n lengths.txt
3582 time_machine.txt
7832 frankenstein.txt
13028 sense_and_sensibility.txt
13053 sherlock_holmes.txt
15975 dracula.txt
21054 jane_eyre.txt
22331 moby_dick.txt
96855 total
Just to be safe,
we use sort
’s -n
option to specify that we want to sort numerically.
Without it,
sort
would order things alphabetically
so that 10
would come before 2
.
sort
does not change lengths.txt
.
Instead,
it sends its output to the screen just as wc
did.
We can therefore put the sorted list of lines in another temporary file called sorted-lengths.txt
using >
once again:
$ sort -n lengths.txt > sorted-lengths.txt
Redirecting to the Same File
It’s tempting to send the output of
sort
back to the file it reads:$ sort -n lengths.txt > lengths.txtHowever, all this does is wipe out the contents of
lengths.txt
. The reason is that when the shell sees the redirection, it opens the file on the right of the>
for writing, which erases anything that file contained. It then runssort
, which finds itself reading from a newly empty file.
Creating intermediate files with names like lengths.txt
and sorted-lengths.txt
works,
but keeping track of those files and cleaning them up when they’re no longer needed is a burden.
Let’s delete the two files we just created:
rm lengths.txt sorted-lengths.txt
We can produce the same result more safely and with less typing using a pipe:
$ wc -l *.txt | sort -n
3582 time_machine.txt
7832 frankenstein.txt
13028 sense_and_sensibility.txt
13053 sherlock_holmes.txt
15975 dracula.txt
21054 jane_eyre.txt
22331 moby_dick.txt
96855 total
The vertical bar |
between the wc
and sort
commands
tells the shell that we want to use the output of the command on the left
as the input to the command on the right.
Running a command with a file as input has a clear flow of information: the command performs a task on that file and prints the output to the screen (see part part b below). When using pipes, however, the information flows differently after the first (upstream) command. The downstream command doesn’t read from a file. Instead, it reads the output of the upstream command (See part b).
We can use |
to build pipes of any length.
For example,
we can use the command head
(man page) to get just the first three lines of sorted data,
which shows us the three shortest books:
$ wc -l *.txt | sort -n | head -n 3
3582 time_machine.txt
7832 frankenstein.txt
13028 sense_and_sensibility.txt
Options Can Have Values
When we write
head -n 3
, the value 3 is not input tohead
. Instead, it is associated with the option-n
. Many options take values like this, such as the names of input files or the background color to use in a plot. Some versions ofhead
may allow you to usehead -3
as a shortcut, though this can be confusing if other options are included.
We could always redirect the output to a file
by adding > shortest.txt
to the end of the pipeline,
thereby retaining our answer for later reference.
In practice, most Unix users would create this pipeline step by step, just as we have: by starting with a single command and adding others one by one, checking the output after each change. The shell makes this easy by letting us move up and down in our command history with the ↑ and ↓ keys. We can also edit old commands to create new ones, so a very common sequence is:
Run a command and check its output.
Use ↑ to bring it up again.
Add the pipe symbol
|
and another command to the end of the line.Run the pipe and check its output.
Use ↑ to bring it up again.
And so on.
How Pipes Work#
In order to use pipes and redirection effectively, we need to know a little about how they work. When a computer runs a program—any program—it creates a process in memory to hold the program’s instructions and data. Every process in Unix has an input channel called standard input (stdin) and an output channel called standard output (stdout). (By now you may be surprised that their names are so memorable, but don’t worry: most Unix programmers call them “stdin” and “stdout”, which are pronounced “stuh-Din” and “stuh-Dout”).
The shell is a program like any other, and like any other, it runs inside a process. Under normal circumstances its standard input is connected to our keyboard and its standard output to our screen, so it reads what we type and displays its output for us to see (a). When we tell the shell to run a program, it creates a new process and temporarily reconnects the keyboard and stream to that process’s standard input and output (b).
If we provide one or more files for the command to read,
as with sort lengths.txt
,
the program reads data from those files.
If we don’t provide any filenames,
though,
the Unix convention is for the program to read from standard input.
We can test this by running sort
on its own,
typing in a few lines of text,
and then pressing Ctrl+D to signal the end of input .
sort
will then sort and print whatever we typed:
$ sort
one
two
three
four
^D
four
one
three
two
Redirection with >
tells the shell to connect the program’s standard output to a file
instead of the screen (see c in the above figure).
When we create a pipe like wc *.txt | sort
,
the shell creates one process for each command so that wc
and sort
will run simultaneously,
and then connects the standard output of wc
directly to the standard input of sort
(see d in the above figure).
wc
doesn’t know whether its output is going to the screen,
another program,
or to a file via >
.
Equally,
sort
doesn’t know if its input is coming from the keyboard or another process;
it just knows that it has to read, sort, and print.
Why Isn’t It Doing Anything?
What happens if a command is supposed to process a file but we don’t give it a filename? For example, what if we type:
$ wc -lbut don’t type
*.txt
(or anything else) after the command? Sincewc
doesn’t have any filenames, it assumes it is supposed to read from the keyboard, so it waits for us to type in some data. It doesn’t tell us this; it just sits and waits.This mistake can be hard to spot, particularly if we put the filename at the end of the pipeline:
$ wc -l | sort moby_dick.txtIn this case,
sort
ignores standard input and reads the data in the file, butwc
still just sits there waiting for input.If we make this mistake, we can end the program by typing Ctrl+C. We can also use this to interrupt programs that are taking a long time to run or are trying to connect to a website that isn’t responding.
Just as we can redirect standard output with >
,
we can connect standard input to a file using <
.
In the case of a single file,
this has the same effect as providing the file’s name to the wc
command:
$ wc < moby_dick.txt
22331 215832 1276222
If we try to use redirection with a wildcard, though, the shell doesn’t concatenate all of the matching files:
$ wc < *.txt
-bash: *.txt: ambiguous redirect
It also doesn’t print the error message to standard output, which we can prove by redirecting:
$ wc < *.txt > all.txt
-bash: *.txt: ambiguous redirect
$ cat all.txt
cat: all.txt: No such file or directory
Instead, every process has a second output channel called standard error (stderr). Programs use it for error messages so that their attempts to tell us something has gone wrong don’t vanish silently into an output file. There are ways to redirect standard error, but doing so is almost always a bad idea.
Repeating Commands on Many Files#
A loop is a way to repeat a set of commands for each item in a list. We can use them to build complex workflows out of simple pieces, and (like wildcards) they reduce the typing we have to do and the number of mistakes we might make.
Let’s suppose that we want to take a section out of each book
whose name starts with the letter “s” in the data
directory.
More specifically,
suppose we want to get the first 8 lines of each book
after the 9 lines of license information that appear at the start of the file.
If we only cared about one file,
we could write a pipeline to take the first 17 lines
and then take the last 8 of those:
$ head -n 17 sense_and_sensibility.txt | tail -n 8
Title: Sense and Sensibility
Author: Jane Austen
Editor:
Release Date: May 25, 2008 [EBook #161]
Posting Date:
Last updated: February 11, 2015
Language: English
If we try to use a wildcard to select files, we only get 8 lines of output, not the 16 we expect:
$ head -n 17 s*.txt | tail -n 8
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Editor:
Release Date: April 18, 2011 [EBook #1661]
Posting Date: November 29, 2002
Latest Update:
Language: English
The problem is that head
is producing a single stream of output
containing 17 lines for each file
(along with a header telling us the file’s name):
$ head -n 17 s*.txt
==> sense_and_sensibility.txt <==
The Project Gutenberg EBook of Sense and Sensibility, by ...
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it ...
re-use it under the terms of the Project Gutenberg License ...
with this eBook or online at www.gutenberg.net
Title: Sense and Sensibility
Author: Jane Austen
Editor:
Release Date: May 25, 2008 [EBook #161]
Posting Date:
Last updated: February 11, 2015
Language: English
==> sherlock_holmes.txt <==
Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur
Conan Doyle
This eBook is for the use of anyone anywhere at no cost and ...
almost no restrictions whatsoever. You may copy it, give ...
re-use it under the terms of the Project Gutenberg License ...
with this eBook or online at www.gutenberg.net
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Editor:
Release Date: April 18, 2011 [EBook #1661]
Posting Date: November 29, 2002
Latest Update:
Language: English
Let’s try this instead:
$ for filename in sense_and_sensibility.txt sherlock_holmes.txt
> do
> head -n 17 $filename | tail -n 8
> done
Title: Sense and Sensibility
Author: Jane Austen
Editor:
Release Date: May 25, 2008 [EBook #161]
Posting Date:
Last updated: February 11, 2015
Language: English
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Editor:
Release Date: April 18, 2011 [EBook #1661]
Posting Date: November 29, 2002
Latest Update:
Language: English
As the output shows, the loop runs our pipeline once for each file. There is a lot going on here, so we will break it down into pieces:
The keywords
for
,in
,do
, anddone
create the loop, and must always appear in that order.filename
is a variable just like a variable in Python. At any moment it contains a value, but that value can change over time.The loop runs once for each item in the list. Each time it runs, it assigns the next item to the variable. In this case,
filename
will besense_and_sensibility.txt
the first time around the loop andsherlock_holmes.txt
the second time.The commands that the loop executes are called the body of the loop and appear between the keywords
do
anddone
. Those commands use the current value of the variablefilename
, but to get it, we must put a dollar sign$
in front of the variable’s name. If we forget and usefilename
instead of$filename
, the shell will think that we are referring to a file that is actually calledfilename
.The shell prompt changes from
$
to a continuation prompt>
as we type in our loop to remind us that we haven’t finished typing a complete command yet. We don’t type the>
, just as we don’t type the$
. The continuation prompt>
has nothing to do with redirection; it’s used because there are only so many punctuation symbols available.
Continuation Prompts May Differ Too
As mentioned in Chapter The Basics of the Unix Shell, there is variation in how different shells look and operate. If you noticed the second, third, and fourth code lines in your for loop were prefaced with
for
, it’s not because you did something wrong! That difference is one of the ways in whichzsh
differs frombash
.
It is very common to use a wildcard to select a set of files and then loop over that set to run commands:
$ for filename in s*.txt
> do
> head -n 17 $filename | tail -n 8
> done
Title: Sense and Sensibility
Author: Jane Austen
Editor:
Release Date: May 25, 2008 [EBook #161]
Posting Date:
Last updated: February 11, 2015
Language: English
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Editor:
Release Date: April 18, 2011 [EBook #1661]
Posting Date: November 29, 2002
Latest Update:
Language: English
Variable Names#
We should always choose meaningful names for variables,
but we should remember that those names don’t mean anything to the computer.
For example,
we have called our loop variable filename
to make its purpose clear to human readers,
but we could equally well write our loop as:
$ for x in s*.txt
> do
> head -n 17 $x | tail -n 8
> done
or as:
$ for username in s*.txt
> do
> head -n 17 $username | tail -n 8
> done
Don’t do this.
Programs are only useful if people can understand them,
so meaningless names like x
and misleading names like username
increase the odds of misunderstanding.
Redoing Things#
Loops are useful if we know in advance what we want to repeat,
but we can also repeat commands that we have run recently.
One way is to use ↑ and ↓
to go up and down in our command history as described earlier.
Another is to use history
(man page)
to get a list of the last few hundred commands we have run:
$ history
551 wc -l *.txt | sort -n
552 wc -l *.txt | sort -n | head -n 3
553 wc -l *.txt | sort -n | head -n 1 > shortest.txt
We can use an exclamation mark !
followed by a number
to repeat a recent command:
$ !552
wc -l *.txt | sort -n | head -n 3
3582 time_machine.txt
7832 frankenstein.txt
13028 sense_and_sensibility.txt
The shell prints the command it is going to re-run to standard error
before executing it,
so that (for example) !572 > results.txt
puts the command’s output in a file
without also writing the command to the file.
Having an accurate record of the things we have done and a simple way to repeat them are two of the main reasons people use the Unix shell. In fact, being able to repeat history is such a powerful idea that the shell gives us several ways to do it:
!head
re-runs the most recent command starting withhead
, while!wc
re-runs the most recent starting withwc
.If we type Ctrl+R (for reverse search) the shell searches backward through its history for whatever we type next. If we don’t like the first thing it finds, we can type Ctrl+R again to go further back.
If we use history
, ↑, or Ctrl+R
we will quickly notice that loops don’t have to be broken across lines.
Instead,
their parts can be separated with semi-colons:
$ for filename in s*.txt; do head -n 17 $filename | tail -n 8;
done
This is fairly readable,
though it becomes more challenging if our for loop includes multiple commands.
For example,
we may choose to include the echo
(man page) command,
which prints its arguments to the screen,
so we can keep track of progress or for debugging.
Compare this:
$ for filename in s*.txt
> do
> echo $filename
> head -n 17 $filename | tail -n 8
> done
with this:
$ for filename in s*.txt; do echo $filename; head -n 17 $filename
| tail -n 8; done
Even experienced users have a tendency to (incorrectly)
put the semi-colon after do
instead of before it.
If our loop contains multiple commands,
though,
the multi-line format is much easier to read and troubleshoot.
Note that (depending on the size of your shell window)
the format separated by semi-colons may be printed onto more than one line,
as shown in the previous code example.
You can tell whether code entered into your shell
is intended to be run as a single line based on the prompt:
both the original command prompt ($
) and the continuation prompt (>
)
indicate the code is on separate lines;
the absence of either in shell commands indicates it is a single line of code.
Creating New Filenames Automatically#
Suppose we want to create a backup copy of each book whose name ends in “e”.
If we don’t want to change the files’ names,
we can do this with cp
:
$ cd ~/zipf
$ mkdir backup
$ cp data/*e.txt backup
$ ls backup
jane_eyre.txt time_machine.txt
Warnings
If you attempt to re-execute the code chunk above, you’ll end up with an error after the second line:
mkdir: backup: File existsThis warning isn’t necessarily a cause for alarm. It lets you know that the command couldn’t be completed, but will not prevent you from proceeding.
But what if we want to append the extension .bak
to the files’ names?
cp
can do this for a single file:
$ cp data/time_machine.txt backup/time_machine.txt.bak
but not for all the files at once:
$ cp data/*e.txt backup/*e.txt.bak
cp: target 'backup/*e.txt.bak' is not a directory
backup/*e.txt.bak
doesn’t match anything—those files don’t yet exist—so
after the shell expands the *
wildcards,
what we are actually asking cp
to do is:
$ cp data/jane_eyre.txt data/time_machine.txt backup/*e.bak
This doesn’t work because cp
only understands how to do two things:
copy a single file to create another file,
or copy a bunch of files into a directory.
If we give it more than two names as arguments,
it expects the last one to be a directory.
Since backup/*e.bak
is not,
cp
reports an error.
Instead,
let’s use a loop to copy files to the backup directory
and append the .bak
suffix:
$ cd data
$ for filename in *e.txt
> do
> cp $filename ../backup/$filename.bak
> done
$ ls ../backup
jane_eyre.txt.bak time_machine.txt.bak
Summary#
The shell’s greatest strength is the way it combines a few powerful ideas with pipes and loops. The next chapter will show how we can make our work more reproducible by saving commands in files that we can run over and over again.
Exercises#
The exercises below involve creating and moving new files, as well as considering hypothetical files. Please note that if you create or move any files or directories in your Zipf’s Law project, you may want to reorganize your files following the outline at the beginning of the next chapter. If you accidentally delete necessary files and have cloned the book from the git repository, you can revert changes under a given subdirectory using -
git checkout -- .
The .
specifies the current directory of the shell session.
What does >>
mean?#
We have seen the use of >
, but there is a similar operator >>
which works slightly differently.
We’ll learn about the differences between these two operators by printing some strings.
We can use the echo
command to print strings as shown below:
$ echo The echo command prints text
The echo command prints text
Now test the commands below to reveal the difference between the two operators:
$ echo hello > testfile01.txt
and:
$ echo hello >> testfile02.txt
Hint: Try executing each command twice in a row and then examining the output files.
Appending data#
Given the following commands,
what will be included in the file extracted.txt
:
$ head -n 3 dracula.txt > extracted.txt
$ tail -n 2 dracula.txt >> extracted.txt
The first three lines of
dracula.txt
The last two lines of
dracula.txt
The first three lines and the last two lines of
dracula.txt
The second and third lines of
dracula.txt
Piping commands#
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
Why does uniq
only remove adjacent duplicates?#
The command uniq
(man page) removes adjacent duplicated lines from its input.
Consider a hypothetical file genres.txt
containing the following data:
science fiction
fantasy
science fiction
fantasy
science fiction
science fiction
Running the command uniq genres.txt
produces:
science fiction
fantasy
science fiction
fantasy
science fiction
Why do you think uniq
only removes adjacent duplicated lines?
(Hint: think about very large datasets.) What other command could
you combine with it in a pipe to remove all duplicated lines?
Pipe reading comprehension#
A file called titles.txt
contains a list of book titles and publication years:
Dracula,1897
Frankenstein,1818
Jane Eyre,1847
Moby Dick,1851
Sense and Sensibility,1811
The Adventures of Sherlock Holmes,1892
The Invisible Man,1897
The Time Machine,1895
Wuthering Heights,1847
What text passes through each of the pipes and the final redirect in the pipeline below?
$ cat titles.txt | head -n 5 | tail -n 3 | sort -r > final.txt
Hint: build the pipeline up one command at a time to test your understanding
Pipe construction#
For the file titles.txt
from the previous exercise, consider the following command:
$ cut -d , -f 2 titles.txt
What does the cut
(man page) command (and its options) accomplish?
Which pipe?#
Consider the same titles.txt
from the previous exercises.
The uniq
command has a -c
option which gives a count of the
number of times a line occurs in its input.
If titles.txt
was in your working directory,
what command would you use to produce
a table that shows the total count of each publication year in the file?
sort titles.txt | uniq -c
sort -t, -k2,2 titles.txt | uniq -c
cut -d, -f 2 titles.txt | uniq -c
cut -d, -f 2 titles.txt | sort | uniq -c
cut -d, -f 2 titles.txt | sort | uniq -c | wc -l
Doing a dry run#
A loop is a way to do many things at once—or to make many mistakes at
once if it does the wrong thing. One way to check what a loop would do
is to echo
the commands it would run instead of actually running them.
Suppose we want to preview the commands the following loop will execute
without actually running those commands
(analyze
is a hypothetical command):
$ for file in *.txt
> do
> analyze $file > analyzed-$file
> done
What is the difference between the two loops below, and which one would we want to run?
$ for file in *.txt
> do
> echo analyze $file > analyzed-$file
> done
or:
$ for file in *.txt
> do
> echo "analyze $file > analyzed-$file"
> done
Variables in loops#
Given the files in data/
,
what is the output of the following code?
$ for datafile in *.txt
> do
> ls *.txt
> done
Now, what is the output of the following code?
$ for datafile in *.txt
> do
> ls $datafile
> done
Why do these two loops give different outputs?
Limiting sets of files#
What would be the output of running the following loop in your data/
directory?
$ for filename in d*
> do
> ls $filename
> done
How would the output differ from using this command instead?
$ for filename in *d*
> do
> ls $filename
> done
Saving to a file in a loop#
Consider running the following loop in the data/
directory:
for book in *.txt
> do
> echo $book
> head -n 16 $book > headers.txt
> done
Why would the following loop be preferable?
for book in *.txt
> do
> head -n 16 $book >> headers.txt
> done
Why does history
record commands before running them?#
If you run the command:
$ history | tail -n 5 > recent.sh
the last command in the file is the history
command itself, i.e.,
the shell has added history
to the command log before actually
running it. In fact, the shell always adds commands to the log
before running them. Why do you think it does this?
Key Points#
cat
displays the contents of its inputs.head
displays the first few lines of its input.tail
displays the last few lines of its input.sort
sorts its inputs.Use the up-arrow key to scroll up through previous commands to edit and repeat them.
Use
history
to display recent commands and!number
to repeat a command by number.Every process in Unix has an input channel called \gref{standard input}{stdin} and an output channel called \gref{standard output}{stdin}.
>
redirects a command’s output to a file, overwriting any existing content.>>
appends a command’s output to a file.<
operator redirects input to a command.A \gref{pipe}{pipe_shell}
|
sends the output of the command on the left to the input of the command on the right.A
for
loop repeats commands once for every thing in a list.Every
for
loop must have a variable to refer to the thing it is currently operating on and a \gref{body}{loop_body} containing commands to execute.Use
$name
or${name}
to get the value of a variable.
Acknowledgments and License#
This section has largely been taken from Research Software Engineering with Python: Building Software that Makes Research Possible github by Damien Irving, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte Wickham, and Greg Wilson under at a Creative Commons Attribution 4.0 International License (CC-BY 4.0).