UNIX text filters, part 2.4 of 3: cut

This post is part of a series

Have you ever had to extract a bunch of data from a CSV file? CSV is a common file format where multiple values are stored in each line of a plain text file, separated by a comma or some other separator. In most cases it is quite a simple file format to deal with, unless you want to write a generic parser that has to take into account all the special cases. But let’s say you just want to write a quick and dirty shell script to read some values out of a single file. With cut you can get the job done pretty quickly!

cut

Getting straight to the point, if you want to print columns 1, 3 and 4 of each line of myfile.csv you can use:

$ cut -f 1,3,4 -d , myfile.csv

Let’s break this down.

Fields, characters and bytes

The -f option tells cut that you want to read lines field-by-field, where fields are are separated by the argument to the -d option. In our example the separator is a comma, but you can use any character. If unspecified, the separator defaults to a TAB.

Instead of -f you could use -c (character) or -b (byte). If you pick one of these, the separator is not to be specified, and instead of field-by-field the rows are read character-by-character or byte-by-byte. The difference between a byte and a character depends on your locale, more specifically on the value of the environment variable LC_CTYPE.

Picking columns

Columns are 1-based, hence the argument 1,3,4 gets, surprise surprise, the first, third and fourth columns of each line. The order you write the column indices does not matter: if you write 3,4,1 you still get the columns in the order they appear in the original file. If you repeat some indices, e.g. 1,3,4,1, the repeated column is printed only once.

You can also use ranges: for example 1,2,5-10 will print the first column, the second, and all the ones from the fifth to the tenth; as another example, -3 will print the first 3 columns - unbounded ranges are interpreted as “from the start” and “until the end”.

Examples

Let see some examples!

Simple csv parsing

Let’s say myfile.csv is the following:

2024-01-13,-,4.50,out
2024-02-04,groceries,52.42,out
2024-02-20,reimbursement,89.99,in
2024-03-10,stuff,1.01,out

Then running the following command command:

$ cut -f 3,4 -d , myfile.csv

will result in:

4.50,out
52.42,out
89.99,in
1.01,out

Fixed-width table

Say you have a table like this in table.txt:

|   WCA ID   |  Type  | Result | Days |
---------------------------------------
| 1982THAI01 | Single |  22.95 | 7749 |
| 2014CZAP01 | Single |   0.49 | 2443 |
| 2011TRON02 | Single |     16 | 1747 |
| 2015GORN01 | Single |   0.91 | 1673 |
| 2015DUYU01 | Single |   3.47 | 1660 |
| 2009ZEMD01 | Single |   6.88 | 1617 |

and you want to print out only the first and last columns. These columns are from character 2 to 13 and 33 to 38 respectively, or 1-14 and 32-29 if you include the borders. So you can select them with the -b or -c option (they are equivalent in this case) like this:

$ cut -c 1-13,32-39 table.txt

and you will get:

|   WCA ID   | Days |
---------------------
| 1982THAI01 | 7749 |
| 2014CZAP01 | 2443 |
| 2011TRON02 | 1747 |
| 2015GORN01 | 1673 |
| 2015DUYU01 | 1660 |
| 2009ZEMD01 | 1617 |

Since the ranges start at 1 and end at the last index, the following command would produce the same result:

$ cut -c -13,32- table.txt

Conclusion

I have not used cut much until today, the main reason being that the rare times I needed to parse a csv file I usually had to do something more complicated with the data than just printing it out. For this reason I have always relied on more complete languages, like C or Python, rather than shell scripting. But cut is definitely a convenient tool to be familiar with, given how simple it is!

Next in the series: expand and unexpand