UNIX text filters, part 2.4 of 3: cut
This post is part of a series
Have you ever had to extract a bunch of data from a
CSV file?
CSV is a common file format where multiple values are stored
in each line of a plain text file, separated by a comma or some
other separator.
In most cases it is quite a simple file format to deal with, unless
you want to write a generic parser that has to take into account
all the special cases. But let’s say you just want to write a quick
and dirty shell script to read some values out of a single file.
With cut
you can get the job done pretty quickly!
cut
Getting straight to the point, if you want to print columns 1, 3
and 4 of each line of myfile.csv
you can use:
$ cut -f 1,3,4 -d , myfile.csv
Let’s break this down.
Fields, characters and bytes
The -f
option tells cut
that you want to read lines field-by-field,
where fields are are separated by the argument to the -d
option.
In our example the separator is a comma, but you can use any
character. If unspecified, the separator defaults to a TAB.
Instead of -f
you could use -c
(character) or -b
(byte). If
you pick one of these, the separator is not to be specified, and
instead of field-by-field the rows are read character-by-character
or byte-by-byte. The difference between a byte and a character
depends on your
locale,
more specifically on the value of the environment variable LC_CTYPE
.
Picking columns
Columns are 1-based, hence the argument 1,3,4
gets, surprise
surprise, the first, third and fourth columns of each line. The
order you write the column indices does not matter: if you write
3,4,1
you still get the columns in the order they appear in the
original file. If you repeat some indices, e.g. 1,3,4,1
, the
repeated column is printed only once.
You can also use ranges: for example 1,2,5-10
will print the first
column, the second, and all the ones from the fifth to the tenth;
as another example, -3
will print the first 3 columns - unbounded
ranges are interpreted as “from the start” and “until the end”.
Examples
Let see some examples!
Simple csv parsing
Let’s say myfile.csv
is the following:
2024-01-13,-,4.50,out
2024-02-04,groceries,52.42,out
2024-02-20,reimbursement,89.99,in
2024-03-10,stuff,1.01,out
Then running the following command command:
$ cut -f 3,4 -d , myfile.csv
will result in:
4.50,out
52.42,out
89.99,in
1.01,out
Fixed-width table
Say you have a table like this in table.txt
:
| WCA ID | Type | Result | Days |
---------------------------------------
| 1982THAI01 | Single | 22.95 | 7749 |
| 2014CZAP01 | Single | 0.49 | 2443 |
| 2011TRON02 | Single | 16 | 1747 |
| 2015GORN01 | Single | 0.91 | 1673 |
| 2015DUYU01 | Single | 3.47 | 1660 |
| 2009ZEMD01 | Single | 6.88 | 1617 |
and you want to print out only the first and last columns. These
columns are from character 2 to 13 and 33 to 38 respectively, or
1-14 and 32-29 if you include the borders. So you can select them
with the -b
or -c
option (they are equivalent in this case)
like this:
$ cut -c 1-13,32-39 table.txt
and you will get:
| WCA ID | Days |
---------------------
| 1982THAI01 | 7749 |
| 2014CZAP01 | 2443 |
| 2011TRON02 | 1747 |
| 2015GORN01 | 1673 |
| 2015DUYU01 | 1660 |
| 2009ZEMD01 | 1617 |
Since the ranges start at 1 and end at the last index, the following command would produce the same result:
$ cut -c -13,32- table.txt
Conclusion
I have not used cut
much until today, the main reason being that
the rare times I needed to parse a csv file I usually had to do
something more complicated with the data than just printing it out.
For this reason I have always relied on more complete languages,
like C or Python, rather than shell scripting. But cut
is definitely
a convenient tool to be familiar with, given how simple it is!
Next in the series: expand and unexpand