Tokenize strings using cut
If you know the exact delimiter (e.g. a tab) you can use cut
. The format is as follows:
cut -d[delimiter] -f[field_number] [filename]
Where [delimiter]
is the delimiter, [field_number]
is the number of the field we want and [filename]
is the name of the file containing the text.
E.g. To get the third field in some text, where the fields are separated by a comma:
me@pc ~/tmp $ cat cut-test.txt
1,2,3,4,5
6,7,8,9,10
11,12,13,14,15
me@pc ~/tmp $ cut -d, -f3 cut-test.txt
3
8
13
Of course you can also pipe text into the command. So echo "1,2,3" | cut -d, -f2
will output 2
.
Gotcha - exact delimiter length
cut
will use the exact length of the delimiter when splitting strings, so for example you can’t specify a delimiter consisting of a
single space, give cut
input in which the fields are separated by 2 spaces, and expect it to accurately pick out the fields.
E.g. When the fields are separated by 2 spaces:
me@pc ~/tmp $ echo "1 2 3" | cut -d' ' -f2
me@pc ~/tmp $ echo "1 2 3" | cut -d' ' -f3
2
Tokenize strings using awk
If you’ve got some text where the fields are separated by a known character, where the number of such characters is unknown, you can use awk
.
For example, the output of the df
command separates the fields using spaces:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/hda6 10080488 7564840 2003580 80% /
none 256964 0 256964 0% /dev/shm
/dev/hda5 10231392 9026744 1204648 89% /mnt/store
Because the number of delimiting spaces differs on each line, and cannot be guaranteed, cut
cannot be used. Instead, awk '{print $[n]}'
can
be used to output the [n]
th field.
E.g:
me@pc ~/tmp $ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/hda6 10080488 7564840 2003580 80% /
none 256964 0 256964 0% /dev/shm
/dev/hda5 10231392 9026744 1204648 89% /mnt/store
me@pc ~/tmp $ df | awk '{print $3}'
Used
7564840
0
9026744
Remove a line from some text using sed
You can use the sed
command to remove lines from a file / some input. sed '[n]d'
will remove the [n]
th line.
E.g: To remove the second line from the df
output given above…
me@pc ~/tmp $ df | sed '2d'
Filesystem 1K-blocks Used Available Use% Mounted on
none 256964 0 256964 0% /dev/shm
/dev/hda5 10231392 9026744 1204648 89% /mnt/store
Escape spaces with sed
To escape all space characters in [file] by preceding them with a backslash:
sed "s/ /\\\ /g" [file]
It writes to stdout.
E.g:
me@pc ~/tmp $ cat sed-test.txt
one two three
four five
me@pc ~/tmp $ sed "s/ /\\\ /g" sed-test.txt
one\ two\ three
four\ five