Text manipulation - cut, awk and sed

Tokenize strings using cut

If you know the exact delimiter (e.g. a tab) you can use cut. The format is as follows:

cut -d[delimiter] -f[field_number] [filename]

Where [delimiter] is the delimiter, [field_number] is the number of the field we want and [filename] is the name of the file containing the text.

E.g. To get the third field in some text, where the fields are separated by a comma:

me@pc ~/tmp $ cat cut-test.txt 
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
me@pc ~/tmp $ cut -d, -f3 cut-test.txt 
3
8
13

Of course you can also pipe text into the command. So echo "1,2,3" | cut -d, -f2 will output 2.

Gotcha - exact delimiter length

cut will use the exact length of the delimiter when splitting strings, so for example you can’t specify a delimiter consisting of a single space, give cut input in which the fields are separated by 2 spaces, and expect it to accurately pick out the fields.

E.g. When the fields are separated by 2 spaces:

me@pc ~/tmp $ echo "1  2  3" | cut -d' ' -f2

me@pc ~/tmp $ echo "1  2  3" | cut -d' ' -f3
2

Tokenize strings using awk

If you’ve got some text where the fields are separated by a known character, where the number of such characters is unknown, you can use awk.

For example, the output of the df command separates the fields using spaces:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda6             10080488   7564840   2003580  80% /
none                    256964         0    256964   0% /dev/shm
/dev/hda5             10231392   9026744   1204648  89% /mnt/store

Because the number of delimiting spaces differs on each line, and cannot be guaranteed, cut cannot be used. Instead, awk '{print $[n]}' can be used to output the [n]th field.

E.g:

me@pc ~/tmp $ df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/hda6             10080488   7564840   2003580  80% /
none                    256964         0    256964   0% /dev/shm
/dev/hda5             10231392   9026744   1204648  89% /mnt/store
me@pc ~/tmp $ df | awk '{print $3}'
Used
7564840
0
9026744

Remove a line from some text using sed

You can use the sed command to remove lines from a file / some input. sed '[n]d' will remove the [n]th line.

E.g: To remove the second line from the df output given above…

me@pc ~/tmp $ df | sed '2d'
Filesystem           1K-blocks      Used Available Use% Mounted on
none                    256964         0    256964   0% /dev/shm
/dev/hda5             10231392   9026744   1204648  89% /mnt/store

Escape spaces with sed

To escape all space characters in [file] by preceding them with a backslash:

sed "s/ /\\\ /g" [file]

It writes to stdout.

E.g:

me@pc ~/tmp $ cat sed-test.txt 
one two three
four five
me@pc ~/tmp $ sed "s/ /\\\ /g" sed-test.txt 
one\ two\ three
four\ five

References

Last modified: 10/04/2016 Tags: , ,

This website is a personal resource. Nothing here is guaranteed correct or complete, so use at your own risk and try not to delete the Internet. -Stephan

Site Info

Privacy policy

Go to top