UNIX text filters, part 1 of 3: grep
After the preliminary post on regular expressions, we are ready to begin this series on text filters.
This time we’ll explore grep
, the most simple kind of filter:
given a bunch of lines of text, print out only those that match a
certain criterion.
I will only describe a few basic options. All that I mention here
is POSIX-standard, with the exception of the option -o
. This means
that the content of this post is valid in pretty much any UNIX-like
OS, but check your manual pages before copy-pasting my code - I can
always make mistakes.
Without further ado, let’s dive in!
Standard usage
If you are familiar with how (UNIX) programs read from standard
output and write to standard output, the idea behing grep
is
easily explained: the command
$ grep PATTERN
will read lines from standard input and write to standard output
only those that contain the given PATTERN
. If you specify file
names after the pattern
$ grep PATTERN file1 file2 ...
grep
will read those files instead of standard input. The PATTERN
can also be a regular expression.
In other words, you can use grep
to look for certain pieces of
text in a file or in the output of another command. If you do not
understand all of this is about, start reading from the Examples
section below to get an idea.
Now let’s see how you can tune grep
’s behavior to your needs.
What to match: -i
, -v
A common use of grep
, especially for non-programming tasks, is
to look for occurrences of a specific word in a long text. In
this case one usually does not care if the word is all lowercase
or capitalized, for example because at the beginning of a sentence.
If you find yourself in this situation, you can use the -i
option
to make grep
case-insensitive.
Sometimes it easier to spell out what you do not want to match -
for example, say you want all non-empty lines of a given file. In
this case you can use the -v
option to invert the behavior of
grep
, such as:
$ grep -v "^$" file
Here "^$"
is a regular expression that matches all lines where the
beginning of the line (in regex language, ^
) is immediately followed
by the end of the line ($
); in other words, empty lines.
More on patterns: -E
, -e
, -F
, -f
Up to now I have not specified what kind of regular expression
grep
uses. By default it uses basic regular expressions, but it
uses extended regular expressions if called with the -E
option.
Equivalentrly, you can use the command egrep
. If you want to
turn off regular expressions altogether, you can use grep -F
(or
fgrep
).
If you want to select lines that match any of a number of patterns,
you can use the -e
option:
$ grep -e PATTERN1 -e PATTERN2 -e ... [file1 file2 ...]
Alternatively you can write your pattern in a file, one per line, and use:
$ grep -f PATTERN_FILE [file1 file2 ...]
Grepping multiple files: -l
, -n
Sometimes I use grep
to find occurrences of a certain string in
a bunch of files, for example with
$ grep "word" *
When used with multiple input files like this, grep
will precede
each output line with the name of the file that contains it. If the
option -n
is used, the line number is also shown. If -l
is used,
only the name of the file is shown, and each file is shown at most
once.
If you do not want to print the file names at all, you can always
cat
into grep
:
$ cat file1 file2 ... | grep
But if anyone asks, you did not learn this from me - UUOC (Useless Use Of Cat) is a considered a crime in some circles.
Update 2023-09-02: I have just discovered that the the -h
option can
be used to hide the file names, so no need for piping cats. However,
though present both in OpenBSD’s and GNU’s versions of grep
, this
option is not POSIX standard.
Matching only part of a line: -o
You may not always want the full line containing a piece of text.
Sometimes you just want a specific part of a line, and you know
exactly how to match it with a regular expression. In this case you can
use the -o
option - we’ll see an example below.
The -o
is not POSIX-standard. It is ubiquitous though, and it
should be present in pretty much any version of grep
.
Examples
Now that we now the basics, let’s see some exciting applications
of grep
!
Nah, I am kidding, they are not exciting. But they are useful. Boring, but useful.
Filter command output
Probably my first use of grep
was to filter out irrelevant part of
some command’s output. Say for example you are troubleshooting a
problem with your webcam: you can use dmesg
to check what your
operating system knows about it, but most of the output is useless
to your specific problem. No worries, you can pipe dmesg
into
grep
:
$ dmesg | grep video
acpivideo0 at acpi0: VGA_
acpivout0 at acpivideo0: LCDD
uvideo0 at uhub0 port 6 configuration 1 interface 0 "JMICRON TECHNOLOGIES CO., LTD. USB2.0 UVC VGA WebCam" rev 2.00/2.04 addr 2
video0 at uvideo0
Look stuff up in files
Sometimes you may want to search something in a bunch of files. Let’s say for example I want to check in which of my old blog posts I have mentioned “Linux”:
$ grep -l Linux src/blog/*/*
src/blog/2022-05-29-man/man.md
src/blog/2022-08-14-website/website.md
src/blog/2022-09-10-netbooks/netbooks.md
src/blog/2023-01-28-windows-desktop/windows-desktop.md
src/blog/2023-02-25-job-control/job-control.md
src/blog/2023-02-25-job-control/jobs-diagram.pdf
Or say I am working on one of my software projects, and I do not remember where a certain function is defined:
$ grep -n "^apply_move(" src/*.c
src/moves.c:206:apply_move(Move m, Cube cube)
Note: the command above works because, when I write C code, I write
function names on a newline. See also
this older post for another example
that takes advantage of this, this time using sed
.
Grepping URLs
Looking for URLs in a piece of text is a common enough operation
for me that I saved it into a script
for ease of use, that I called urlgrep
. URLs can be complicated,
so for a long time I used a regular expression copied from somewhere
on the internet.
Now now that I am more familiar with grep
and regular expressions, I have
written my own - it does not work perfectly, but at least I understand it
and I can keep tweaking it if I find errors.
Let’s build it together! What does a URL look like? It usually starts with
either a protocol followed by a colon, or with www.
. Then a bunch of
valid characters follow. There are probably more rules to it, but to keep
is simple we can start like this (using extended regular expressions):
regex="(($protocols):|www\.)[$valid_chars]+"
For protocols we can use
protocols='http|https|ftp|sftp|gemini|mailto'
I have thrown mailto
in there because it is quite common in links web
pages. The valid characters are:
valid_chars="][a-zA-Z0-9_~/?#@!$&'()*+=.,;:-"
(Yes, these ones I actually copied somewhere online). Finally we can find all URLs with
$ egrep -o "$regex"
As I mentioned above there are some problems with this. For example if a URL is not terminated by a space, the characters following it may be grepped too. For example:
$ urlgrep <src/blog/2022-05-21-blogs/blogs.md
https://en.wikipedia.org/wiki/Hypertext).
https://caseymuratori.com/blog_0031)
https://en.wikipedia.org/wiki/Netbook)
https://developer.mozilla.org/en-US/Learn)
https://www.romanzolotarev.com/website.html).
This is not technically a problem, because parentheses and dots are allowed as part of a URL. But it is practically a problem, because most URLs will only contain matching pairs of parentheses.
Conclusion
grep
is a must-know for anyone who wants to be proficient with the
UNIX command line. Luckily, it is also pretty easy to learn.
Moreover, being familiar with grep
makes it easy to learn more
advanced tools, such as sed
and awk
: the “read one line, process
it, print something” idea is common to all three of them.
Stay tuned for the part 2: sed
!
Next in the series: sed