Americas

  • United States
sandra_henrystocker
Unix Dweeb

Extracting substrings on Linux

How-To
May 16, 20225 mins
Linux

Email takeover  >  Puppeteer hands manipulating strings
Credit: Spencer Whalen / Getty Images

There are many ways to extract substrings from lines of text using Linux and doing so can be extremely useful when preparing scripts that may be used to process large amounts of data. This post describes ways you can take advantage of the commands that make extracting substrings easy.

Using bash parameter expansion

When using bash parameter expansion, you can specify the starting and ending positions for the text that you want to extract. For example, you can create a variable by assigning it a value and then use syntax like that shown below to select a portion of it.

$ string="Happy days are here again"
$ echo ${string:1:10}
appy days
$ echo ${string:0:9}
Happy days

Note that the example above makes it clear that this technique starts position numbering at 0. So, in the next example, the 7 represents the eighth character in the string and the -2 means to drop the last 2 characters. As a result, the substring in the first example below has a single character and the second has all but the last two.

$ string="1234567890"
$ echo ${string:7:-2}
8
$ echo ${string:0:-2}
12345678

In this next example, we first create a variable using “set –” and then use echo to display the eighth and ninth characters. In other words, it starts with the eighth character (7) and then displays two characters.

$ set -- 01234567890abcdef
$ echo ${1:7:2}
78

NOTE: You could display the string created with the set command by simply using the command “echo $1”. This is what is referenced by the “1” in the example above.

$ set -- 01234567890abcdef
$ echo $1
01234567890abcdef

Using cut

The cut command can be used in several ways to yank substrings from text. The -c option allows you to select the character positions to be displayed. For cut, character numbering starts at 1.

$ echo "12345" | cut -c 1-3
123

In this next example, we select the last two words by character position. If you select more characters than are available, it doesn’t affect the output.

$ echo "Have some fun" | cut -c 6-13
some fun
$ cut -c 6-13 

In addition, you can pipe text to the cut command or use the cut command to work with text in a file. Just be sure that the positions work for every line.

$ cat myfile                        $ cut -c 6-15 myfile
Have some fun                       some fun
Grab your lunch                     your lunch
Take nice nap                       nice nap

The cut command can also work with delimiters and this often makes it a lot easier to use with files in which the words or fields don't line up precisely. To work with a file of mailing addresses, for example, you could do this to pull out the third field in the comma-separated addresses:

$ cat addresses                     $ cut -d, -f3 addresses
6803 Gravel Road,Hurlock,MD         MD
121 Blueberry Drive,Outback,VA      VA
1427 N 12th Street,Reading,PA       PA
2001 Turtle Road,Baker,WV           WV
264 Dakota Street,Groton,CT         CT
111 Mindless Circle,Celery,TX       TX
1089 Plymouth Drive,Rahway,NJ       NJ
949 Endless Lane,Hoboken,NJ         NJ
2001 Turtle Road,Outback,VA         VA

You can select multiple fields by specifying a range (e.g., "2-3") or a sequence (e.g., "2,3") as shown below.

$ cut -d, -f2-3 addresses           $ cut -d, -f2,3 addresses
Hurlock,MD                          Hurlock,MD
Outback,VA                          Outback,VA
Reading,PA                          Reading,PA
Baker,WV                            Baker,WV
Groton,CT                           Groton,CT
Celery,TX                           Celery,TX
Rahway,NJ                           Rahway,NJ
Hoboken,NJ                          Hoboken,NJ
Outback,VA                          Outback,VA

Using awk

The awk command can also be used to extract substrings. Here's an example of pulling text from a supplied phrase:

$ awk '{print substr($0,6,8)}' 

The $0 represents the complete phrase.

To work with a file with delimited fields, use the -F (field delimiter) option. In this case, the delimiter is a comma. Use -F':' if the file is colon-delimited.

$ awk -F',' '{print $3}' addresses | sort | uniq
CT
MD
NJ
PA
TX
VA
WV

If your fields are separated with both a comma and a space, that is no problem for awk. Just specify that in the command like this:

$ awk -F', ' '{print $3}' addresses | sort | uniq
CT
MD
NJ
PA
TX
VA
WV

In fact, if you want the awk command to work regardless of whether fields are separated with just commas or both commas and blanks, you can do this:

$ awk -F', ?' '{print $3}' addresses | sort | uniq
CT
MD
NJ
PA
TX
VA
WV

Using awk, you can also display two fields by using syntax like this:

$ awk -F',' '{print $2,$3}' addresses | sort | uniq
Baker WV
Celery TX
Groton CT
Hoboken NJ
Hurlock MD
Outback VA
Rahway NJ
Reading PA

Using expr

To use the expr command, type “expr substr” followed by your string, the start position and the string length.

$ expr substr "Have some fun" 6 8
some fun
$ str="Have some fun"
$ expr substr "$str" 6 8
some fun

Wrap-Up

There are lots of ways to extract substrings on Linux, but each of the commands you might use has its own quirks and its own advantages.

sandra_henrystocker
Unix Dweeb

Sandra Henry-Stocker has been administering Unix systems for more than 30 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She lives in the mountains in Virginia where, when not working with or writing about Unix, she's chasing the bears away from her bird feeders.

The opinions expressed in this blog are those of Sandra Henry-Stocker and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.