UNIX text filters, part 0 of 3: regular expressions

This post is part of a series

One of the most important features of UNIX and its descendants, if not the most important feature, is input / output redirection: the output of a command can be displayed to the user, written to a file or used as the input for another command seamlessly, without the the program knowing which of these things is happening. This is possible because most UNIX programs use plain text as their input/output language, which is understood equally well by the three types of users - humans, files and other running programs.

Since this is such a fundamental feature of UNIX, I thought it would be nice to go through some of the standard tools that help the user take advantage of it. At first I thought of doing this as part of my man page reading club series, but in the end I decided to give them their own space. My other series has also been going on for more than a year now, so it is a good time to end it and start a new one.

Let me then introduce you to: UNIX text filters.

Text filters

For the purpose of this blog series, a text filter is a program that reads plain text from standard input and writes a modified, or filtered version of the same text to standard output. According to the introductory paragraph, this definition includes most UNIX programs; but we are going to focus on the following three, in increasing order of complexity:

grep
sed
awk

In order to unleash the true power of these tools, we first need to grasp the basics of regular expressions. And what better way to do it than following the dedicated OpenBSD manual page?

(Extended) regular expressions

Regular expressions, or regexes for short, are a convenient way to describe text patterns. They are commonly used to solve genering string matching problems, such as determining if a given piece of text is a valid URL. Many standard UNIX tools, including the three we are going to cover in this series, support regexes.

Let’s deal with the nasty part first: even within POSIX, there is not one single standard for regular expressions; there are at least two of them: Basic Regular Expressions (BREs) and Extended Regular Expressions (ERE). As it always happens when there is more than one standard for the same thing, other people decided to come up with another version to replace all previous “standards”, so we have also PCREs, and probably more. Things got out of hand quickly.

In this post I am going to follow the structure of re_format(7) and present extended regular expresssions first. After that I’ll point out the differences with basic regular expressions.

The goal is not to provide a complete guide to regexes, but rather an introduction to the most important features, glossing over the nasty edge-cases. Keep also in mind that I am in no way an expert on the subject: we are learning together, here!

The basics

You can think of a regular expression as a pattern, or a rule, that describes which strings are “valid” (they are matched by the regular expression) and which are not. As a trivial example, the regular expression hello matches only the string “hello”. A less trivial example is the regex .* that matches any string. I’ll explain why in a second.

Beware not to confuse regular expressions with shell globs, i.e. the rules for shell command expansion. Although they use similar symbols to achieve a similar goal, they are not the same thing. See my post on sh(1) or glob(7) for an explanation on shell globs.

General structure and terminology

A general regex looks something like this:

piece piece piece ... | piece piece piece ... | ...

A sequence of pieces is called a branch, and a regex is a sequence of branches separated by pipes |. Pieces are not separated by spaces, they are simply concatenated.

The pipes | are read “or”: a regex matches a given string if any of its branches does. A branch matches a given string if the latter can be written as a sequence of strings, each matching one of the pieces, in the given order.

Before going into what pieces are exactly, consider the following example:

hello|world

This regex matches both the string “hello” and the string “world”, and nothing else. The pieces are the single letters composing the two words, and as you can see they are juxtaposed without spaces.

But what else is a valid piece? In general, a piece is made up of an atom, optionally followed by a multiplier.

Atoms

As we have already seen, the most simple kind of atom is a single character. The most general kind of atom, on the other hand, is a whole regular expression enclosed in parentheses (). Yes, regexes are recursive.

There are some special characters: for example, a single dot . matches any single character. The characters ^ and $ match an empty string at the beginning and at the end of a line, respectively. If you want to match a special character as if it was regular, say because you want to match strings that represent values in the dollar currency, you can escape them with a backslash. For example \$ matches the string “$”.

The last kind of atoms are bracket expressions, which consist of lists of characters enclosed in brackets []. A simple list of characters in brackets, like [xyz], matches any character in the list, unless the first character is a ^, in which case it matches every character not in the list. Two characters separated by a dash - denote a range: for example [a-z] matches every lowercase letter and [1-7] matches all digits from 1 to 7.

You can also use cetain special sets of characters, like [:lower:] to match every lowercase letter (same as [a-z]), [:alnum:] to match every alphanumeric character or [:digit:] to match every decimal digit. Check the man page for the full list.

Multipliers

The term “multiplier” does not appear anywhere in the manual page, I made it up. But I think it fits, so I’ll keep using it.

Multipliers allow you to match an atom repeating a specified or unspecified amount of times. The most general one is the bound multiplier, which consists of one or two comma-separated numbers enclosed in braces {}.

In its most simple form, the multiplier {n} repeats the multiplied atom n times. For example, the regex a{7} is equivalent to the regex aaaaaaa (and it matches the string “aaaaaaa”).

The form {n,m} matches any number between n and m of copies of the preceeding atom. For example a{2,4} is equivalent to aa|aaa|aaaa. If the integer m is not specified, the multiplied atom matches any string that consists of at least n copies of the atom.

Now we can explain very quickly the more common multipliers +, * and ?: they are equivalent to {1,}, {0,} and {0,1} respectively. That is to say, + matches at least one copy of the atom, * matches any number of copies (including none) and ? matches either one copy or none.

Basic regular expressions

Basic regular expressions are less powerful than their extended counterpart (with one exception, see below) and require more backslashes, but it is worth knowing them, because they are used by default in some programs (for example ed(1)). The main differences between EREs and BREs are:

BREs consist of one single branch, i.e. there is no |.
Multipliers + and ? do not exist.
You need to escape parentheses  and braces \{\} to use them with their special meaning.

There is one feature of BREs, called back-reference, that is absent in EREs. Apparently it makes the implementation much more complex, and it makes BREs more powerful. I noticed the author of the manual page despises back-references, so I am not going to learn them out of respect for them.

Conclusion

Regexes are a powerful tool, and they are more than worth knowing. But, quoting from the manual page:

     Having two kinds of REs is a botch.

I hope you enjoyed this post, despite the lack of practical examples. If you want to see more applications of regular expressions, stay tuned for the next entries on grep, sed and awk!

Next in the series: grep