Unix text filters, part 1 of 3: grep
After the preliminary post on regular expressions, we are ready to begin this series on text filters.
This time we’ll explore
grep, the most simple kind of filter:
given a bunch of lines of text, print out only those that match a
I will only describe a few basic options. All that I mention here
is POSIX-standard, with the exception of the option
-o. This means
that the content of this post is valid in pretty much any UNIX-like
OS, but check your manual pages before copy-pasting my code - I can
always make mistakes.
Without further ado, let’s dive in!
If you are familiar with how (UNIX) programs read from standard
output and write to standard output, the idea behing
easily explained: the command
$ grep PATTERN
will read lines from standard input and write to standard output
only those that contain the given
PATTERN. If you specify file
names after the pattern
$ grep PATTERN file1 file2 ...
grep will read those files instead of standard input. The
can also be a regular expression.
In other words, you can use
grep to look for certain pieces of
text in a file or in the output of another command. If you do not
understand all of this is about, start reading from the Examples
section below to get an idea.
Now let’s see how you can tune
grep’s behavior to your needs.
What to match:
A common use of
grep, especially for non-programming tasks, is
to look for occurrences of a specific word in a long text. In
this case one usually does not care if the word is all lowercase
or capitalized, for example because at the beginning of a sentence.
If you find yourself in this situation, you can use the
Sometimes it easier to spell out what you do not want to match -
for example, say you want all non-empty lines of a given file. In
this case you can use the
-v option to invert the behavior of
grep, such as:
$ grep -v "^$" file
"^$" is a regular expression that matches all lines where the
beginning of the line (in regex language,
^) is immediately followed
by the end of the line (
$); in other words, empty lines.
More on patterns:
Up to now I have not specified what kind of regular expression
grep uses. By default it uses basic regular expressions, but it
uses extended regular expressions if called with the
Equivalentrly, you can use the command
egrep. If you want to
turn off regular expressions altogether, you can use
grep -F (or
If you want to select lines that match any of a number of patterns,
you can use the
$ grep -e PATTERN1 -e PATTERN2 -e ... [file1 file2 ...]
Alternatively you can write your pattern in a file, one per line, and use:
$ grep -f PATTERN_FILE [file1 file2 ...]
Grepping multiple files:
Sometimes I use
grep to find occurrences of a certain string in
a bunch of files, for example with
$ grep "word" *
When used with multiple input files like this,
grep will precede
each output line with the name of the file that contains it. If the
-n is used, the line number is also shown. If
-l is used,
only the name of the file is shown, and each file is shown at most
If you do not want to print the file names at all, you can always
$ cat file1 file2 ... | grep
But if anyone asks, you did not learn this from me - UUOC (Useless Use Of Cat) is a considered a crime in some circles.
Update 2023-09-02: I have just discovered that the the
-h option can
be used to hide the file names, so no need for piping cats. However,
though present both in OpenBSD’s and GNU’s versions of
option is not POSIX standard.
Matching only part of a line:
You may not always want the full line containing a piece of text.
Sometimes you just want a specific part of a line, and you know
exactly how to match it with a regular expression. In this case you can
-o option - we’ll see an example below.
-o is not POSIX-standard. It is ubiquitous though, and it
should be present in pretty much any version of
Now that we now the basics, let’s see some exciting applications
Nah, I am kidding, they are not exciting. But they are useful. Boring, but useful.
Filter command output
Probably my first use of
grep was to filter out irrelevant part of
some command’s output. Say for example you are troubleshooting a
problem with your webcam: you can use
dmesg to check what your
operating system knows about it, but most of the output is useless
to your specific problem. No worries, you can pipe
$ dmesg | grep video acpivideo0 at acpi0: VGA_ acpivout0 at acpivideo0: LCDD uvideo0 at uhub0 port 6 configuration 1 interface 0 "JMICRON TECHNOLOGIES CO., LTD. USB2.0 UVC VGA WebCam" rev 2.00/2.04 addr 2 video0 at uvideo0
Look stuff up in files
Sometimes you may want to search something in a bunch of files. Let’s say for example I want to check in which of my old blog posts I have mentioned “Linux”:
$ grep -l Linux src/blog/*/* src/blog/2022-05-29-man/man.md src/blog/2022-08-14-website/website.md src/blog/2022-09-10-netbooks/netbooks.md src/blog/2023-01-28-windows-desktop/windows-desktop.md src/blog/2023-02-25-job-control/job-control.md src/blog/2023-02-25-job-control/jobs-diagram.pdf
Or say I am working on one of my software projects, and I do not remember where a certain function is defined:
$ grep -n "^apply_move(" src/*.c src/moves.c:206:apply_move(Move m, Cube cube)
Note: the command above works because, when I write C code, I write
function names on a newline. See also
this older post for another example
that takes advantage of this, this time using
Looking for URLs in a piece of text is a common enough operation
for me that I saved it into a script
for ease of use, that I called
urlgrep. URLs can be complicated,
so for a long time I used a regular expression copied from somewhere
on the internet.
Now now that I am more familiar with
grep and regular expressions, I have
written my own - it does not work perfectly, but at least I understand it
and I can keep tweaking it if I find errors.
Let’s build it together! What does a URL look like? It usually starts with
either a protocol followed by a colon, or with
www.. Then a bunch of
valid characters follow. There are probably more rules to it, but to keep
is simple we can start like this (using extended regular expressions):
For protocols we can use
I have thrown
mailto in there because it is quite common in links web
pages. The valid characters are:
(Yes, these ones I actually copied somewhere online). Finally we can find all URLs with
$ egrep -o "$regex"
As I mentioned above there are some problems with this. For example if a URL is not terminated by a space, the characters following it may be grepped too. For example:
$ urlgrep <src/blog/2022-05-21-blogs/blogs.md https://en.wikipedia.org/wiki/Hypertext). https://caseymuratori.com/blog_0031) https://en.wikipedia.org/wiki/Netbook) https://developer.mozilla.org/en-US/Learn) https://www.romanzolotarev.com/website.html).
This is not technically a problem, because parentheses and dots are allowed as part of a URL. But it is practically a problem, because most URLs will only contain matching pairs of parentheses.
grep is a must-know for anyone who wants to be proficient with the
UNIX command line. Luckily, it is also pretty easy to learn.
Moreover, being familiar with
grep makes it easy to learn more
advanced tools, such as
awk: the “read one line, process
it, print something” idea is common to all three of them.
Stay tuned for the part 2: