CHAPTER 4

TRANSFORMING DATA WITH sed

In the prior chapter, we learned how to reduce a stream of data to only the contents that interested us. In this chapter, we learn how to transform that data using the Unix sed utility, which is an acronym for “stream editor.”

The first part of this chapter contains basic examples of the sed command, such as replacing and deleting strings, numbers, and letters. The second part of this chapter discusses various switches that are available for the sed command, along with an example of replacing multiple delimiters with a single delimiter in a dataset.

In the final section you will see a number of examples of how to perform stream-oriented processing on datasets, bringing the capabilities of sed together with the commands and regular expressions from prior chapters to accomplish difficult tasks with relatively simple code.

What Is the sed Command?

The name sed is an acronym for “stream editor,” and the utility derives many of its commands from the ed line-editor (ed was the first UNIX text editor). The sed command is a “non-interactive” stream-oriented editor that can be used to automate editing via shell scripts. This ability to modify an entire stream of data (which can be the contents of multiple files, in a manner similar to how grep behaves) as if you were inside an editor is not common in modern programming languages. This behavior allows some capabilities not easily duplicated elsewhere, while behaving exactly like any other command (grep, cat, ls, find, and so forth) in how it can accept data, output data, and pattern match with regular expressions.

Some of the more common uses for sed include: print matching lines, delete matching lines, and find/replace matching strings or regular expressions.

The sed Execution Cycle

Whenever you invoke the sed command, an execution cycle refers to various options that are specified and executed until the end of the file/input is reached. Specifically, an execution cycle performs the following steps:

Reads an entire line from stdin/file.

Removes any trailing newline.

Places the line in its pattern buffer.

Modifies the pattern buffer according to the supplied commands.

Prints the pattern buffer to stdout.

Matching String Patterns Using sed

The sed command requires you to specify a string in order to match the lines in a file. For example, suppose that the file numbers.txt contains the following lines:

1
2
123
3
five
4

The following sed command prints all the lines that contain the string 3:

cat numbers.txt |sed –n "/3/p"

Another way to produce the same result:

sed –n "/3/p" numbers.txt

In both cases the output of the preceding commands is as follows:

123
3

As we saw earlier with other commands, it is always more efficient to just read in the file using the sed command than to pipe it in with a different command. You can “feed” it data from another command, provided that other command adds value (such as adding line numbers, removing blank lines, or other similar helpful activities).

The –n option suppresses all output, and the p option prints the matching line. If you omit the –n option, then every line is printed, and the p option causes the matching line to be printed again. Hence, you can issue the following command:

sed "/3/p" numbers.txt

The output (the data to the right of the colon) is as follows. Note that the labels to the left of the colon show the source of the data, to illustrate the “one row at a time” behavior of sed.

Basic stream output :1
Basic stream output :2
Basic stream output :123
Pattern Matched text:123
Basic stream output :3
Pattern Matched text:3
Basic stream output :five
Basic stream output :4

It is also possible to match two patterns and print everything between the lines that match:

sed –n "/123/,/five/p" numbers.txt

The output of the preceding command (all lines between 123 and five, inclusive) is here:

123
3
five

Substituting String Patterns Using sed

The examples in this section illustrate how to use sed to substitute new text for an existing text pattern.

x="abc"
echo $x |sed "s/abc/def/"

The output of the preceding code snippet is here:

def

In the prior command you have instructed sed to substitute ("s) the first text pattern (/abc) with the second pattern (/def) and no further instructions (/").

Deleting a text pattern is simply a matter of leaving the second pattern empty:

echo "abcdefabc" |sed "s/abc//"

The result is here:

defabc

As you see, this only removes the first occurrence of the pattern. You can remove all the occurrences of the pattern by adding the “global” terminal instruction (/g"):

echo "abcdefabc" |sed "s/abc//g"

The result of the preceding command is here:

def

Note that we are operating directly on the main stream with this command, as we are not using the -n tag. You can also suppress the main stream with -n and print the substitution, achieving the same output if you use the terminal p (print) instruction:

echo "abcdefabc" |sed -n "s/abc//gp"
def

For substitutions, either syntax will do, but that is not always true of other commands.

You can also remove digits instead of letters, by using the numeric metacharacters as your regular expression match pattern (from Chapter 1):

ls svcc1234.txt |sed "s/[0-9]//g"
ls svcc1234.txt |sed –n "s/[0-9]//gp"

The result of either of the two preceding commands is here:

svcc.txt

Recall that the file columns4.txt contains the following text:

123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
one two three
four five

The following sed command is instructed to identify the rows between 1 and 3, inclusive ("1,3), and delete (d") them from the output:

cat columns4.txt | sed "1,3d"

The output is here:

five 123 six
one two three
four five

The following sed command deletes a range of lines, starting from the line that matches 123 and continuing through the file until reaching the line that matches the string five (and also deleting all the intermediate lines). The syntax should be familiar from the earlier matching example:

sed "/123/,/five/d" columns4.txt

The output is here:

one two three
four five

Replacing Vowels from a String or a File

The following code snippet shows you how simple it is to replace multiple vowels from a string using the sed command:

echo "hello" | sed "s/[aeio]/u/g"

The output from the preceding code snippet is here:

Hullu

Deleting Multiple Digits and Letters from a String

Suppose that we have a variable x that is defined as follows:

x="a123zAB 10x b 20 c 300 d 40w00"

Recall that an integer consists of one or more digits, so it matches the regular expression [0-9]+, which matches one or more digits. However, you need to specify the regular expression [0-9]* in order to remove every number from the variable x:

echo $x | sed "s/[0-9]//g"

The output of the preceding command is here:

azAB x b  c  d w

The following command removes all lowercase letters from the variable x:

echo $x | sed "s/[a-z]*//g"

The output of the preceding command is here:

123AB 10  20  300  4000

The following command removes all lowercase and uppercase letters from the variable x:

echo $x | sed "s/[a-z][A-Z]*//g"

The output of the preceding command is here:

123 10  20  300  4000

Search and Replace with sed

The previous section showed you how to delete a range of rows of a text file, based on a start line and end line, using either a numeric range or a pair of strings. As deleting is just substituting an empty result for what you match, it should now be clear that a replace activity involves populating that part of the command with something that achieves your desired outcome. This section contains various examples that illustrate how to get the exact substitution you desire.

The following examples illustrate how to convert lowercase abc to uppercase ABC in sed:

echo "abc" |sed "s/abc/ABC/"

The output of the preceding command is here (which only works on one case of abc):

ABC
echo "abcdefabc" |sed "s/abc/ABC/g"

The output of the preceding command is here (/g” means works on every case of abc):

ABCdefABC

The following sed expression performs three consecutive substitutions, using -e to string them together. It changes exactly one (the first) a to A, one b to B, one c to C:

echo "abcde" |sed -e "s/a/A/" -e "s/b/B/" -e "s/c/C/"

The output of the preceding command is here:

ABCde

Obviously, you can use the following sed expression that combines the three substitutions into one substitution:

echo "abcde" |sed "s/abc/ABC/"

Nevertheless, the –e switch is useful when you need to perform more complex substitutions that cannot be combined into a single substitution.

The “/” character is not the only delimiter that sed supports, which is useful when strings contain the “/” character. For example, you can reverse the order of /aa/bb/cc/ with this command:

echo "/aa/bb/cc" |sed -n "s#/aa/bb/cc#/cc/bb/aa/#p"

The output of the preceding sed command is here:

/cc/bb/aa/

The following examples illustrate how to use the “w” terminal command instruction to write the sed output to both standard output and also to a named file upper1 if the match succeeds:

echo "abcdefabc" |sed "s/abc/ABC/wupper1"
ABCdefabc

If you examine the contents of the text file upper1 you will see that it contains the same string ABCdefabc that is displayed on the screen. This two-stream behavior that we noticed earlier with the print (“p”) terminal command is unusual, but sometimes useful. It is more common to simply send the standard output to a file using the “>” syntax, as shown in the following (both syntaxes work for a replace operation), but in that case nothing is written to the terminal screen. The previous syntax allows both at the same time:

echo "abcdefabc" | sed "s/abc/ABC/" > upper1
echo "abcdefabc" | sed -n "s/abc/ABC/p" > upper1

Listing 4.1 displays the contents of update2.sh that replace the occurrence of the string hello with the string goodbye in the files with the suffix txt in the current directory.

LISTING 4.1 update2.sh
for f in `ls *txt`
do
  newfile="${f}_new"
  cat $f | sed -n "s/hello/goodbye/gp" > $newfile
  mv $newfile $f
done

Listing 4.1 contains a for loop that iterates over the list of text files with the txt suffix. For each such file, initialize the variable newfile that is created by appending the string _new to the first file (represented by the variable f). Next, replace the occurrences of hello with the string goodbye in each file f, and redirect the output to $newfile. Finally, rename $newfile to $f using the mv command.

If you want to perform the update in matching files in all subdirectories, replace the “for” statement with the following:

for f in `find . –print |grep "txt$"`

Datasets with Multiple Delimiters

Listing 4.2 displays the contents of the dataset delim1.txt, which contains multiple delimiters “|”, “:”, and “^”. Listing 4.3 displays the contents of delimiter1.sh, which illustrates how to replace the various delimiters in delimiter1.txt with a single comma delimiter “,”.

LISTING 4.2 delimiter1.txt
1000|Jane:Edwards^Sales
2000|Tom:Smith^Development
3000|Dave:Del Ray^Marketing
LISTING 4.3 delimiter1.sh
inputfile="delimiter1.txt"
cat $inputfile | sed -e 's/:/,/' -e 's/|/,/' -e 's/\^/,/'

As you can see, the second line in Listing 4.3 is simple yet very powerful: you can extend the sed command with as many delimiters as you require in order to create a dataset with a single delimiter between values. The output from Listing 4.3 is shown here:

1000,Jane,Edwards,Sales
2000,Tom,Smith,Development
3000,Dave,Del Ray,Marketing

Do keep in mind that this kind of transformation can be a bit unsafe unless you have checked that your new delimiter is not already in use. For that a grep command is useful (you want result to be zero):

grep -c ',' $inputfile
0

Useful Switches in sed

The three command line switches -n, -e, and -i are useful when you specify them with the sed command.

As a review, specify -n when you want to suppress the printing of the basic stream output:

sed -n 's/foo/bar/'

Specify -n and end with /p' when you want to match the result only:

sed -n 's/foo/bar/p'

We briefly touched on using -e to do multiple substitutions, but it can also be used to combine other commands. This syntax lets us separate the commands in the last example:

sed -n -e 's/foo/bar/' -e 'p'

A more advanced example that hints at the flexibility of sed involves the insertion of a character after a fixed number of positions. For example, consider the following code snippet:

echo "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | sed "s/.\{3\}/&\n/g"

The output from the preceding command is here:

ABCnDEFnGHInJKLnMNOnPQRnSTUnVWXnYZ

While the above example does not seem especially useful, consider a large text stream with no line breaks (everything on one line). You could use something like this to insert newline characters, or something else to break the data into easier to process chunks. It is possible to work through exactly what sed is doing by looking at each element of the command and comparing it to the output, even if you don’t know the syntax. (Tip: sometimes you will encounter very complex instructions for sed without any documentation in the code: try not to be that person when coding.)

The output is changing after every three characters and we know dot (.) matches any single character, so .{3} must be telling it to do that (with escape slashes \ because brackets are a special character for sed, and it won’t interpret it properly if we just leave it as .{3}. The “n” is clear enough in the replacement column, so the “&\” must be somehow telling it to insert a character instead of replacing it. The terminal g command of course means to repeat. To clarify and confirm those guesses, take what you could infer and perform an Internet search.

Working with Datasets

The sed utility is very useful for manipulating the contents of text files. For example, you can print ranges of lines, and subsets of lines that match a regular expression. You can also perform search-and-replace on the lines in a text file. This section contains examples that illustrate how to perform such functionality.

Printing Lines

Listing 4.4 displays the contents of test4.txt (doubled-spaced lines) that are used for several examples in this section.

LISTING 4.4 test4.txt
abc

def

abc

abc

The following code snippet prints the first 3 lines in test4.txt (we used this syntax before when deleting rows; it is equally useful for printing):

cat test4.txt |sed -n "1,3p"

The output of the preceding code snippet is here (the second line is blank):

abc

def

The following code snippet prints lines 3 through 5 in test4.txt:

cat test4.txt |sed -n "3,5p"

The output of the preceding code snippet is here:

def

abc

The following code snippet takes advantage of the basic output stream and the second match stream to duplicate every line (including blank lines) in test4.txt:

cat test4.txt |sed "p"

The output of the preceding code snippet is here:

abc
abc

def
def

abc
abc

abc
abc

The following code snippet prints the first three lines and then capitalizes the string abc, duplicating ABC in the final output because we did not use -n and did end with /p" in the second sed command. Remember that /p" only prints the text that matched the sed command, where the basic output prints the whole file, which is why def does not get duplicated:

cat test4.txt |sed -n "1,3p" |sed "s/abc/ABC/p"
ABC
ABC

def

Character Classes and sed

You can also use regular expressions with sed. As a reminder, here are the contents of columns4.txt:

123 ONE TWO
456 three four
ONE TWO THREE FOUR
five 123 six
one two three
four five

As our first example involving sed and character classes, the following code snippet illustrates how to match lines that contain lowercase letters:

cat columns4.txt | sed -n '/[0-9]/p'

The output from the preceding snippet is here:

one two three
one two
one two three four
one
one three
one four

The following code snippet illustrates how to match lines that contain lowercase letters:

cat columns4.txt | sed -n '/[a-z]/p'

The output from the preceding snippet is here:

123 ONE TWO
456 three four
five 123 six

The following code snippet illustrates how to match lines that contain the numbers 4, 5, or 6:

cat columns4.txt | sed -n '/[4-6]/p'

The output from the preceding snippet is here:

456 three four

The following code snippet illustrates how to match lines that start with any two characters followed by EE:

cat columns4.txt | sed -n '/^.\{2\}EE*/p'

The output from the preceding snippet is here:

ONE TWO THREE FOUR

Removing Control Characters

Listing 4.5 displays the contents of controlchars.txt that we used before in Chapter 2. Control characters of any kind can be removed by sed just like any other character.

LISTING 4.5 controlchars.txt
1 carriage return^M
2 carriage return^M
1 tab character^I

The following command removes the carriage return and the tab characters from the text file ControlChars.txt:

cat controlChars.txt | sed "s/^M//" |sed "s/   //"

You cannot see the tab character in the second sed command in the preceding code snippet; however, if you redirect the output to the file nocontrol1.txt, you can see that there are no embedded control characters in this new file by typing the following command:

cat –t nocontrol1.txt

Counting Words in a Dataset

Listing 4.6 displays the contents of WordCountInFile.sh, which illustrates how to combine various bash commands in order to count the words (and their occurrences) in a file.

LISTING 4.6 wordcountinfile.sh

# The file is fed to the “tr” command, which changes uppercase to lowercase

# sed removes commas and periods, then changes whitespace to newlines

# uniq needs each word on its own line to count the words properly

# Uniq converts data to unique words and the number of times they appeared

# The final sort orders the data by the wordcount.

cat "$1" | xargs -n1 | tr A-Z a-z | \
sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\ /g' | \
sort | uniq -c | sort -nr

The previous command performs the following operations:

List each word in each line of the file,

shift characters to lowercase,

filter out periods and commas,

change space between words to linefeed, and

remove duplicates, prefix occurrence count, and sort numerically.

Back References in sed

In the chapter describing grep you learned about back references, and similar functionality is available with the sed command. The main difference is that the back references can also be used in the replacement section of the command.

The following sed command matches the consecutive “a” letters and prints four of them:

echo "aa" |sed -n "s#\([a-z]\)\1#\1\1\1\1#p"

The output of the preceding code snippet is here:

aaaa

The following sed command replaces all duplicate pairs of letters with the letters aa:

echo "aa/bb/cc" |sed -n "s#\(aa\)/\(bb\)/\(cc\)#\1/\1/\1/#p"

The output of the previous sed command is here (note the trailing “/” character):

aa/aa/aa/

The following command inserts a comma in a four-digit number:

echo "1234" |sed -n "s@\([0-9]\)\([0-9]\)\([0-9]\)\
([0-9]\)@\1,\2\3\4@p"

The preceding sed command uses the @ character as a delimiter. The character class [0-9] matches one single digit. Since there are four digits in the input string 1234, the character class [0-9] is repeated 4 times, and the value of each digit is stored in \1, \2, \3, and \4. The output from the preceding sed command is here:

1,234

A more general sed expression that can insert a comma in five-digit numbers is here:

echo "12345" | sed 's/\([0-9]\{3\}\)$/,\1/g;s/^,//'

The output of the preceding command is here:

12,345

Displaying Only “Pure” Words in a Dataset

In the previous chapter we solved this task using the egrep command, and this section shows you how to solve this task using the sed command. For simplicity, let’s work with a text string, and that way we can see the intermediate results as we work toward the solution. The approach will be similar to the code block shown earlier that counted unique words. Let’s initialize the variable x as shown here:

x="ghi abc Ghi 123 #def5 123z"

The first step is to split x into one word per line by replacing space with newlines:

echo $x |tr -s ' ' '\n'

The output is here:

ghi
abc
Ghi
123
#def5
123z

The second step is to invoke old with the regular expression ^[a-zA-Z]+, which matches any string consisting of one or more uppercase and/or lowercase letters (and nothing else). Note that the -E switch is needed to parse this kind of regular expression in sed, as it uses some of the newer/ modern regular expression syntax not available when sed was new.

echo $x |tr -s ' ' '\n' |sed -nE "s/(^[a-zA-Z]
[a-zA-Z]*$)/\1/p"

The output is here:

ghi
abc
Ghi

If you also want to sort the output and print only the unique words, pipe the result to the sort and uniq commands:

echo $x |tr -s ' ' '\n' |sed -nE "s/(^[a-zA-Z]
[a-zA-Z]*$)/\1/p"|sort|uniq

The output is here:

Ghi
abc
ghi

If you want to extract only the integers in the variable x, use this command:

echo $x |tr -s ' ' '\n' |sed -nE "s/(^[0-9][0-9]*$)/\1/p"
|sort|uniq

The output is here:

123

If you want to extract alphanumeric words from the variable x, use this command:

echo $x |tr -s ' ' '\n' |sed -nE "s/(^[0-9a-zA-Z]
[0-9a-zA-Z]*$)/\1/p"|sort|uniq

The output is here:

123
123z
Ghi
abc
ghi

Now you can replace echo $x with a dataset in order to retrieve only alphabetic strings from that dataset.

One-Line sed Commands

This section is intended to show a lot of the more useful problems you can solve with a single line of sed, and to expose you to yet more switches and arguments that you can mix and match to solve related tasks.

Moreover, sed supports other options (which are beyond the scope of this book) to perform many other tasks, some of which are sophisticated and correspondingly complex. If you encounter something that none of the examples in this chapter cover, but it seems like the sort of thing sed might do, the odds are decent that it does: an Internet search along the lines of “how do I do <xxx> in sed” will likely either point you in the right direction, or at least to an alternative bash command that will be helpful.

Listing 4.7 displays the contents of data4.txt that are referenced in some of the sed commands in this section. Note that some examples contain options that have not been discussed earlier in this chapter: they are included in case you need the desired functionality (and you can find more details by reading online tutorials).

LISTING 4.7 data4.txt
hello world4
        hello world5 two
 hello world6 three
                hello world4 four
line five
line six
line seven

Print the first line of data4.txt with this command:

sed q < data4.txt

The output is here:

  hello world3

Print the first three lines of data4.txt with this command:

sed 3q < data4.txt

The output is here:

  hello world4
   hello world5 two
hello world6 three

Print the last line of data4.txt with this command:

sed '$!d' < data4.txt

The output is here:

line seven

You can also use this snippet to print the last line:

sed -n '$p' < data4.txt

Print the last two lines of data4.txt with this command:

sed '$!N;$!D' <data4.txt

The output is here:

line six
line seven

Print the lines of data4.txt that do not contain world with this command:

sed '/world/d' < data4.txt

The output is here:

line five
line six
line seven

Print duplicates of the lines in data4.txt that contain the word world with this command:

sed '/world/p' < data4.txt

The output is here:

  hello world4
  hello world4
   hello world5 two
   hello world5 two
hello world6 three
hello world6 three
          hello world4 four
          hello world4 four
line five
line six
line seven

Print the fifth line of data4.txt with this command:

sed -n '5p' < data4.txt

The output is here:

line five

Print the contents of data4.txt and duplicate line five with this command:

sed '5p' < data4.txt

The output is here:

  hello world4
   hello world5 two
hello world6 three
          hello world4 four
line five
line five
line six
line seven

Print lines four through six of data4.txt with this command:

sed –n '4,6p' < data4.txt

The output is here:

        hello world4 four
line five
line six

Delete lines four through six of data4.txt with this command:

sed '4,6d' < data4.txt

The output is here:

  hello world4
   hello world5 two
 hello world6 three
line seven

Delete the section of lines between world6 and six in data4.txt with this command:

sed '/world6/,/six/d' < data4.txt

The output is here:

   hello world4
    hello world5 two
line seven

Print the section of lines between world6 and six of data4.txt with this command:

sed -n '/world6/,/six/p' < data4.txt

The output is here:

hello world6 three
           hello world4 four
line five
line six

Print the contents of data4.txt and duplicate the section of lines between world6 and six with this command:

sed '/world6/,/six/p' < data4.txt

The output is here:

  hello world4
   hello world5 two
hello world6 three
hello world6 three
          hello world4 four
          hello world4 four
line five
line five
line six
line six
line seven

Delete the even-numbered lines in data4.txt with this command:

sed 'n;d;' <data4.txt

The output is here:

   hello world4
 hello world6 three
line five
line seven

Replace letters a through m with a “,” with this command:

sed "s/[a-m]/,/g" <data4.txt

The output is here:

  ,,,,o wor,,4
    ,,,,o wor,,5 two
,,,,o wor,,6 t,r,,
           ,,,,o wor,,4 ,our
,,n, ,,v,
,,n, s,x
,,n, s,v,n

Replace letters a through m with the characters “,@#” with this command:

sed "s/[a-m]/,@#/g" <data4.txt

The output is here:

  ,@#,@#,@#,@#o wor,@#,@#4
    ,@#,@#,@#,@#o wor,@#,@#5 two
 ,@#,@#,@#,@#o wor,@#,@#6 t,@#r,@#,@#
         ,@#,@#,@#,@#o wor,@#,@#4 ,@#our
,@#,@#n,@# ,@#,@#v,@#
,@#,@#n,@# s,@#x
,@#,@#n,@# s,@#v,@#n

The sed command does not recognize escape sequences such as \t, which means that you must literally insert a tab on your console. In the case of the bash shell, enter the control character ^V and then press the <TAB> key in order to insert a <TAB> character.

Delete the tab characters in data4.txt with this command:

sed 's/ //g' <data4.txt

The output is here:

   hello world4
hello world5 two
 hello world6 three
hello world4 four
line five
line six
line seven

Delete the tab characters and blank spaces in data4.txt with this command:

sed 's/ //g' <data4.txt

The output is here:

helloworld4
helloworld5two
helloworld6three
helloworld4four
linefive
linesix
lineseven

Replace every line of data4.txt with the word pasta with this command:

sed 's/.*/\pasta/' < data4.txt

The output is here:

pasta
pasta
pasta
pasta
pasta
pasta
pasta

Insert two blank lines after the third line and one blank line after the fifth line in data4.txt with this command:

sed '3G;3G;5G' < data4.txt

The output is here:

  hello world4
   hello world5 two
hello world6 three

       hello world4 four
line five

line six
line seven

Insert a blank line after every line of data4.txt with this command:

sed G < data4.txt

The output is here:

  hello world4

    hello world5 two

hello world6 three

       hello world4 four

line five

line six

line seven

Insert a blank line after every other line of data4.txt with this command:

sed n\;G < data4.txt

The output is here:

  hello world4
   hello world5 two
 hello world6 three
        hello world4 four
line five
line six

line seven

Reverse the lines in data4.txt with this command:

sed '1! G; h;$!d' < data4.txt

The output of the preceding sed command is here:

line seven
line six
line five
         hello world4 four
 hello world6 three
    hello world5 two
   hello world4

Summary

This chapter introduced you to the sed utility, illustrating the basic tasks of data transformation: allowing additions, removal, and mutation of data by matching individual patterns, or matching the position of the rows in a file, or a combination of the two.

Moreover, we showed that sed not only uses regular expressions to match data, similar to the grep command, but can also use regular expressions to describe how to transform the data. Finally, there was a list of examples showing both the versatility of the sed command, and hopefully communicating the sense that it is an even more flexible and powerful utility than we can show in a single chapter.