sed
In the prior chapter, we learned how to reduce a stream of data to only the contents that interested us. In this chapter, we learn how to transform that data using the Unix sed
utility, which is an acronym for “stream editor.”
The first part of this chapter contains basic examples of the sed
command, such as replacing and deleting strings, numbers, and letters. The second part of this chapter discusses various switches that are available for the sed
command, along with an example of replacing multiple delimiters with a single delimiter in a dataset.
In the final section you will see a number of examples of how to perform stream-oriented processing on datasets, bringing the capabilities of sed
together with the commands and regular expressions from prior chapters to accomplish difficult tasks with relatively simple code.
sed
Command?The name sed
is an acronym for “stream editor,” and the utility derives many of its commands from the ed
line-editor (ed
was the first UNIX text editor). The sed
command is a “non-interactive” stream-oriented editor that can be used to automate editing via shell scripts. This ability to modify an entire stream of data (which can be the contents of multiple files, in a manner similar to how grep
behaves) as if you were inside an editor is not common in modern programming languages. This behavior allows some capabilities not easily duplicated elsewhere, while behaving exactly like any other command (grep, cat, ls, find
, and so forth) in how it can accept data, output data, and pattern match with regular expressions.
Some of the more common uses for sed
include: print matching lines, delete matching lines, and find/replace matching strings or regular expressions.
sed
Execution CycleWhenever you invoke the sed
command, an execution cycle refers to various options that are specified and executed until the end of the file/input is reached. Specifically, an execution cycle performs the following steps:
Reads an entire line from stdin/file.
Removes any trailing newline.
Places the line in its pattern buffer.
Modifies the pattern buffer according to the supplied commands.
Prints the pattern buffer to stdout.
sed
The sed
command requires you to specify a string in order to match the lines in a file. For example, suppose that the file numbers.txt
contains the following lines:
1 2 123 3 five 4
The following sed
command prints all the lines that contain the string 3:
cat numbers.txt |sed –n "/3/p"
Another way to produce the same result:
sed –n "/3/p" numbers.txt
In both cases the output of the preceding commands is as follows:
123 3
As we saw earlier with other commands, it is always more efficient to just read in the file using the sed
command than to pipe it in with a different command. You can “feed” it data from another command, provided that other command adds value (such as adding line numbers, removing blank lines, or other similar helpful activities).
The –n
option suppresses all output, and the p
option prints the matching line. If you omit the –n option, then every line is printed, and the p
option causes the matching line to be printed again. Hence, you can issue the following command:
sed "/3/p" numbers.txt
The output (the data to the right of the colon) is as follows. Note that the labels to the left of the colon show the source of the data, to illustrate the “one row at a time” behavior of sed
.
Basic stream output :1 Basic stream output :2 Basic stream output :123 Pattern Matched text:123 Basic stream output :3 Pattern Matched text:3 Basic stream output :five Basic stream output :4
It is also possible to match two patterns and print everything between the lines that match:
sed –n "/123/,/five/p" numbers.txt
The output of the preceding command (all lines between 123 and five, inclusive) is here:
123 3 five
sed
The examples in this section illustrate how to use sed
to substitute new text for an existing text pattern.
x="abc" echo $x |sed "s/abc/def/"
The output of the preceding code snippet is here:
def
In the prior command you have instructed sed
to substitute ("s)
the first text pattern (/abc)
with the second pattern (/def)
and no further instructions (/")
.
Deleting a text pattern is simply a matter of leaving the second pattern empty:
echo "abcdefabc" |sed "s/abc//"
The result is here:
defabc
As you see, this only removes the first occurrence of the pattern. You can remove all the occurrences of the pattern by adding the “global” terminal instruction (/g"
):
echo "abcdefabc" |sed "s/abc//g"
The result of the preceding command is here:
def
Note that we are operating directly on the main stream with this command, as we are not using the -n
tag. You can also suppress the main stream with -n
and print the substitution, achieving the same output if you use the terminal p
(print) instruction:
echo "abcdefabc" |sed -n "s/abc//gp" def
For substitutions, either syntax will do, but that is not always true of other commands.
You can also remove digits instead of letters, by using the numeric metacharacters as your regular expression match pattern (from Chapter 1):
ls svcc1234.txt |sed "s/[0-9]//g" ls svcc1234.txt |sed –n "s/[0-9]//gp"
The result of either of the two preceding commands is here:
svcc.txt
Recall that the file columns4.txt
contains the following text:
123 ONE TWO 456 three four ONE TWO THREE FOUR five 123 six one two three four five
The following sed
command is instructed to identify the rows between 1 and 3, inclusive ("1,3)
, and delete (d")
them from the output:
cat columns4.txt | sed "1,3d"
The output is here:
five 123 six one two three four five
The following sed
command deletes a range of lines, starting from the line that matches 123
and continuing through the file until reaching the line that matches the string five
(and also deleting all the intermediate lines). The syntax should be familiar from the earlier matching example:
sed "/123/,/five/d" columns4.txt
The output is here:
one two three four five
The following code snippet shows you how simple it is to replace multiple vowels from a string using the sed command:
echo "hello" | sed "s/[aeio]/u/g"
The output from the preceding code snippet is here:
Hullu
Suppose that we have a variable x
that is defined as follows:
x="a123zAB 10x b 20 c 300 d 40w00"
Recall that an integer consists of one or more digits, so it matches the regular expression [0-9]+
, which matches one or more digits. However, you need to specify the regular expression [0-9]* in order to remove every number from the variable x:
echo $x | sed "s/[0-9]//g"
The output of the preceding command is here:
azAB x b c d w
The following command removes all lowercase letters from the variable x:
echo $x | sed "s/[a-z]*//g"
The output of the preceding command is here:
123AB 10 20 300 4000
The following command removes all lowercase and uppercase letters from the variable x:
echo $x | sed "s/[a-z][A-Z]*//g"
The output of the preceding command is here:
123 10 20 300 4000
sed
The previous section showed you how to delete a range of rows of a text file, based on a start line and end line, using either a numeric range or a pair of strings. As deleting is just substituting an empty result for what you match, it should now be clear that a replace activity involves populating that part of the command with something that achieves your desired outcome. This section contains various examples that illustrate how to get the exact substitution you desire.
The following examples illustrate how to convert lowercase abc
to uppercase ABC
in sed:
echo "abc" |sed "s/abc/ABC/"
The output of the preceding command is here (which only works on one case of abc
):
ABC echo "abcdefabc" |sed "s/abc/ABC/g"
The output of the preceding command is here (/g” means works on every case of abc
):
ABCdefABC
The following sed
expression performs three consecutive substitutions, using -e
to string them together. It changes exactly one (the first) a to A
, one b to B
, one c to C
:
echo "abcde" |sed -e "s/a/A/" -e "s/b/B/" -e "s/c/C/"
The output of the preceding command is here:
ABCde
Obviously, you can use the following sed
expression that combines the three substitutions into one substitution:
echo "abcde" |sed "s/abc/ABC/"
Nevertheless, the –e
switch is useful when you need to perform more complex substitutions that cannot be combined into a single substitution.
The “/
” character is not the only delimiter that sed
supports, which is useful when strings contain the “/
” character. For example, you can reverse the order of /aa/bb/cc/
with this command:
echo "/aa/bb/cc" |sed -n "s#/aa/bb/cc#/cc/bb/aa/#p"
The output of the preceding sed
command is here:
/cc/bb/aa/
The following examples illustrate how to use the “w
” terminal command instruction to write the sed
output to both standard output and also to a named file upper1
if the match succeeds:
echo "abcdefabc" |sed "s/abc/ABC/wupper1" ABCdefabc
If you examine the contents of the text file upper1
you will see that it contains the same string ABCdefabc
that is displayed on the screen. This two-stream behavior that we noticed earlier with the print (“p”) terminal command is unusual, but sometimes useful. It is more common to simply send the standard output to a file using the “>” syntax, as shown in the following (both syntaxes work for a replace operation), but in that case nothing is written to the terminal screen. The previous syntax allows both at the same time:
echo "abcdefabc" | sed "s/abc/ABC/" > upper1 echo "abcdefabc" | sed -n "s/abc/ABC/p" > upper1
Listing 4.1 displays the contents of update2.sh
that replace the occurrence of the string hello
with the string goodbye
in the files with the suffix txt
in the current directory.
for f in `ls *txt` do newfile="${f}_new" cat $f | sed -n "s/hello/goodbye/gp" > $newfile mv $newfile $f done
Listing 4.1 contains a for
loop that iterates over the list of text files with the txt suffix. For each such file, initialize the variable newfile
that is created by appending the string _new
to the first file (represented by the variable f
). Next, replace the occurrences of hello with the string goodbye in each file f
, and redirect the output to $newfile.
Finally, rename $newfile
to $f
using the mv
command.
If you want to perform the update in matching files in all subdirectories, replace the “for” statement with the following:
for f in `find . –print |grep "txt$"`
Listing 4.2 displays the contents of the dataset delim1.txt
, which contains multiple delimiters “|”, “:”, and “^”. Listing 4.3 displays the contents of delimiter1.sh
, which illustrates how to replace the various delimiters in delimiter1.txt
with a single comma delimiter “,”.
1000|Jane:Edwards^Sales 2000|Tom:Smith^Development 3000|Dave:Del Ray^Marketing
inputfile="delimiter1.txt" cat $inputfile | sed -e 's/:/,/' -e 's/|/,/' -e 's/\^/,/'
As you can see, the second line in Listing 4.3 is simple yet very powerful: you can extend the sed
command with as many delimiters as you require in order to create a dataset with a single delimiter between values. The output from Listing 4.3 is shown here:
1000,Jane,Edwards,Sales 2000,Tom,Smith,Development 3000,Dave,Del Ray,Marketing
Do keep in mind that this kind of transformation can be a bit unsafe unless you have checked that your new delimiter is not already in use. For that a grep command is useful (you want result to be zero):
grep -c ',' $inputfile 0
sed
The three command line switches -n, -e
, and -i
are useful when you specify them with the sed
command.
As a review, specify -n
when you want to suppress the printing of the basic stream output:
sed -n 's/foo/bar/'
Specify -n
and end with /p'
when you want to match the result only:
sed -n 's/foo/bar/p'
We briefly touched on using -e
to do multiple substitutions, but it can also be used to combine other commands. This syntax lets us separate the commands in the last example:
sed -n -e 's/foo/bar/' -e 'p'
A more advanced example that hints at the flexibility of sed
involves the insertion of a character after a fixed number of positions. For example, consider the following code snippet:
echo "ABCDEFGHIJKLMNOPQRSTUVWXYZ" | sed "s/.\{3\}/&\n/g"
The output from the preceding command is here:
ABCnDEFnGHInJKLnMNOnPQRnSTUnVWXnYZ
While the above example does not seem especially useful, consider a large text stream with no line breaks (everything on one line). You could use something like this to insert newline characters, or something else to break the data into easier to process chunks. It is possible to work through exactly what sed
is doing by looking at each element of the command and comparing it to the output, even if you don’t know the syntax. (Tip: sometimes you will encounter very complex instructions for sed
without any documentation in the code: try not to be that person when coding.)
The output is changing after every three characters and we know dot (.) matches any single character, so .{3}
must be telling it to do that (with escape slashes \ because brackets are a special character for sed
, and it won’t interpret it properly if we just leave it as .{3}
. The “n
” is clear enough in the replacement column, so the “&\
” must be somehow telling it to insert a character instead of replacing it. The terminal g command of course means to repeat. To clarify and confirm those guesses, take what you could infer and perform an Internet search.
The sed
utility is very useful for manipulating the contents of text files. For example, you can print ranges of lines, and subsets of lines that match a regular expression. You can also perform search-and-replace on the lines in a text file. This section contains examples that illustrate how to perform such functionality.
Listing 4.4 displays the contents of test4.txt
(doubled-spaced lines) that are used for several examples in this section.
abc def abc abc
The following code snippet prints the first 3 lines in test4.txt
(we used this syntax before when deleting rows; it is equally useful for printing):
cat test4.txt |sed -n "1,3p"
The output of the preceding code snippet is here (the second line is blank):
abc def
The following code snippet prints lines 3 through 5 in test4.txt:
cat test4.txt |sed -n "3,5p"
The output of the preceding code snippet is here:
def abc
The following code snippet takes advantage of the basic output stream and the second match stream to duplicate every line (including blank lines) in test4.txt:
cat test4.txt |sed "p"
The output of the preceding code snippet is here:
abc abc def def abc abc abc abc
The following code snippet prints the first three lines and then capitalizes the string abc
, duplicating ABC
in the final output because we did not use -n
and did end with /p"
in the second sed
command. Remember that /p"
only prints the text that matched the sed
command, where the basic output prints the whole file, which is why def
does not get duplicated:
cat test4.txt |sed -n "1,3p" |sed "s/abc/ABC/p" ABC ABC def
sed
You can also use regular expressions with sed
. As a reminder, here are the contents of columns4.txt:
123 ONE TWO 456 three four ONE TWO THREE FOUR five 123 six one two three four five
As our first example involving sed and character classes, the following code snippet illustrates how to match lines that contain lowercase letters:
cat columns4.txt | sed -n '/[0-9]/p'
The output from the preceding snippet is here:
one two three one two one two three four one one three one four
The following code snippet illustrates how to match lines that contain lowercase letters:
cat columns4.txt | sed -n '/[a-z]/p'
The output from the preceding snippet is here:
123 ONE TWO 456 three four five 123 six
The following code snippet illustrates how to match lines that contain the numbers 4, 5, or 6:
cat columns4.txt | sed -n '/[4-6]/p'
The output from the preceding snippet is here:
456 three four
The following code snippet illustrates how to match lines that start with any two characters followed by EE:
cat columns4.txt | sed -n '/^.\{2\}EE*/p'
The output from the preceding snippet is here:
ONE TWO THREE FOUR
Listing 4.5 displays the contents of controlchars.txt
that we used before in Chapter 2. Control characters of any kind can be removed by sed
just like any other character.
1 carriage return^M 2 carriage return^M 1 tab character^I
The following command removes the carriage return and the tab characters from the text file ControlChars.txt:
cat controlChars.txt | sed "s/^M//" |sed "s/ //"
You cannot see the tab character in the second sed
command in the preceding code snippet; however, if you redirect the output to the file nocontrol1.txt
, you can see that there are no embedded control characters in this new file by typing the following command:
cat –t nocontrol1.txt
Listing 4.6 displays the contents of WordCountInFile.sh
, which illustrates how to combine various bash
commands in order to count the words (and their occurrences) in a file.
# The file is fed to the “tr” command, which changes uppercase to lowercase
# sed removes commas and periods, then changes whitespace to newlines
# uniq needs each word on its own line to count the words properly
# Uniq converts data to unique words and the number of times they appeared
# The final sort orders the data by the wordcount.
cat "$1" | xargs -n1 | tr A-Z a-z | \ sed -e 's/\.//g' -e 's/\,//g' -e 's/ /\ /g' | \ sort | uniq -c | sort -nr
The previous command performs the following operations:
List each word in each line of the file,
shift characters to lowercase,
filter out periods and commas,
change space between words to linefeed, and
remove duplicates, prefix occurrence count, and sort numerically.
sed
In the chapter describing grep
you learned about back references, and similar functionality is available with the sed
command. The main difference is that the back references can also be used in the replacement section of the command.
The following sed
command matches the consecutive “a” letters and prints four of them:
echo "aa" |sed -n "s#\([a-z]\)\1#\1\1\1\1#p"
The output of the preceding code snippet is here:
aaaa
The following sed
command replaces all duplicate pairs of letters with the letters aa:
echo "aa/bb/cc" |sed -n "s#\(aa\)/\(bb\)/\(cc\)#\1/\1/\1/#p"
The output of the previous sed
command is here (note the trailing “/
” character):
aa/aa/aa/
The following command inserts a comma in a four-digit number:
echo "1234" |sed -n "s@\([0-9]\)\([0-9]\)\([0-9]\)\ ([0-9]\)@\1,\2\3\4@p"
The preceding sed
command uses the @
character as a delimiter. The character class [0-9]
matches one single digit. Since there are four digits in the input string 1234
, the character class [0-9]
is repeated 4 times, and the value of each digit is stored in \1, \2, \3
, and \4.
The output from the preceding sed
command is here:
1,234
A more general sed
expression that can insert a comma in five-digit numbers is here:
echo "12345" | sed 's/\([0-9]\{3\}\)$/,\1/g;s/^,//'
The output of the preceding command is here:
12,345
In the previous chapter we solved this task using the egrep
command, and this section shows you how to solve this task using the sed
command. For simplicity, let’s work with a text string, and that way we can see the intermediate results as we work toward the solution. The approach will be similar to the code block shown earlier that counted unique words. Let’s initialize the variable x
as shown here:
x="ghi abc Ghi 123 #def5 123z"
The first step is to split x into one word per line by replacing space with newlines:
echo $x |tr -s ' ' '\n'
The output is here:
ghi abc Ghi 123 #def5 123z
The second step is to invoke old
with the regular expression ^[a-zA-Z]+
, which matches any string consisting of one or more uppercase and/or lowercase letters (and nothing else). Note that the -E
switch is needed to parse this kind of regular expression in sed
, as it uses some of the newer/ modern regular expression syntax not available when sed
was new.
echo $x |tr -s ' ' '\n' |sed -nE "s/(^[a-zA-Z] [a-zA-Z]*$)/\1/p"
The output is here:
ghi abc Ghi
If you also want to sort the output and print only the unique words, pipe the result to the sort
and uniq
commands:
echo $x |tr -s ' ' '\n' |sed -nE "s/(^[a-zA-Z] [a-zA-Z]*$)/\1/p"|sort|uniq
The output is here:
Ghi abc ghi
If you want to extract only the integers in the variable x
, use this command:
echo $x |tr -s ' ' '\n' |sed -nE "s/(^[0-9][0-9]*$)/\1/p" |sort|uniq
The output is here:
123
If you want to extract alphanumeric words from the variable x
, use this command:
echo $x |tr -s ' ' '\n' |sed -nE "s/(^[0-9a-zA-Z] [0-9a-zA-Z]*$)/\1/p"|sort|uniq
The output is here:
123 123z Ghi abc ghi
Now you can replace echo $x
with a dataset in order to retrieve only alphabetic strings from that dataset.
sed
CommandsThis section is intended to show a lot of the more useful problems you can solve with a single line of sed, and to expose you to yet more switches and arguments that you can mix and match to solve related tasks.
Moreover, sed
supports other options (which are beyond the scope of this book) to perform many other tasks, some of which are sophisticated and correspondingly complex. If you encounter something that none of the examples in this chapter cover, but it seems like the sort of thing sed
might do, the odds are decent that it does: an Internet search along the lines of “how do I do <xxx> in sed” will likely either point you in the right direction, or at least to an alternative bash command that will be helpful.
Listing 4.7 displays the contents of data4.txt
that are referenced in some of the sed
commands in this section. Note that some examples contain options that have not been discussed earlier in this chapter: they are included in case you need the desired functionality (and you can find more details by reading online tutorials).
hello world4 hello world5 two hello world6 three hello world4 four line five line six line seven
Print the first line of data4.txt
with this command:
sed q < data4.txt
The output is here:
hello world3
Print the first three lines of data4.txt
with this command:
sed 3q < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three
Print the last line of data4.txt
with this command:
sed '$!d' < data4.txt
The output is here:
line seven
You can also use this snippet to print the last line:
sed -n '$p' < data4.txt
Print the last two lines of data4.txt
with this command:
sed '$!N;$!D' <data4.txt
The output is here:
line six line seven
Print the lines of data4.txt
that do not contain world
with this command:
sed '/world/d' < data4.txt
The output is here:
line five line six line seven
Print duplicates of the lines in data4.txt
that contain the word world
with this command:
sed '/world/p' < data4.txt
The output is here:
hello world4 hello world4 hello world5 two hello world5 two hello world6 three hello world6 three hello world4 four hello world4 four line five line six line seven
Print the fifth line of data4.txt
with this command:
sed -n '5p' < data4.txt
The output is here:
line five
Print the contents of data4.txt
and duplicate line five with this command:
sed '5p' < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three hello world4 four line five line five line six line seven
Print lines four through six of data4.txt
with this command:
sed –n '4,6p' < data4.txt
The output is here:
hello world4 four line five line six
Delete lines four through six of data4.txt
with this command:
sed '4,6d' < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three line seven
Delete the section of lines between world6
and six
in data4.txt
with this command:
sed '/world6/,/six/d' < data4.txt
The output is here:
hello world4 hello world5 two line seven
Print the section of lines between world6
and six
of data4.txt
with this command:
sed -n '/world6/,/six/p' < data4.txt
The output is here:
hello world6 three hello world4 four line five line six
Print the contents of data4.txt
and duplicate the section of lines between world6
and six
with this command:
sed '/world6/,/six/p' < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three hello world6 three hello world4 four hello world4 four line five line five line six line six line seven
Delete the even-numbered lines in data4.txt
with this command:
sed 'n;d;' <data4.txt
The output is here:
hello world4 hello world6 three line five line seven
Replace letters a
through m
with a “,
” with this command:
sed "s/[a-m]/,/g" <data4.txt
The output is here:
,,,,o wor,,4 ,,,,o wor,,5 two ,,,,o wor,,6 t,r,, ,,,,o wor,,4 ,our ,,n, ,,v, ,,n, s,x ,,n, s,v,n
Replace letters a
through m
with the characters “,@#
” with this command:
sed "s/[a-m]/,@#/g" <data4.txt
,@#,@#,@#,@#o wor,@#,@#4 ,@#,@#,@#,@#o wor,@#,@#5 two ,@#,@#,@#,@#o wor,@#,@#6 t,@#r,@#,@# ,@#,@#,@#,@#o wor,@#,@#4 ,@#our ,@#,@#n,@# ,@#,@#v,@# ,@#,@#n,@# s,@#x ,@#,@#n,@# s,@#v,@#n
The sed
command does not recognize escape sequences such as \t
, which means that you must literally insert a tab on your console. In the case of the bash shell, enter the control character ^V
and then press the <TAB>
key in order to insert a <TAB>
character.
Delete the tab characters in data4.txt
with this command:
sed 's/ //g' <data4.txt
The output is here:
hello world4 hello world5 two hello world6 three hello world4 four line five line six line seven
Delete the tab characters and blank spaces in data4.txt
with this command:
sed 's/ //g' <data4.txt
The output is here:
helloworld4 helloworld5two helloworld6three helloworld4four linefive linesix lineseven
Replace every line of data4.txt
with the word pasta
with this command:
sed 's/.*/\pasta/' < data4.txt
pasta pasta pasta pasta pasta pasta pasta
Insert two blank lines after the third line and one blank line after the fifth line in data4.txt
with this command:
sed '3G;3G;5G' < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three hello world4 four line five line six line seven
Insert a blank line after every line of data4.txt
with this command:
sed G < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three hello world4 four line five line six line seven
Insert a blank line after every other line of data4.txt
with this command:
sed n\;G < data4.txt
The output is here:
hello world4 hello world5 two hello world6 three hello world4 four line five line six line seven
Reverse the lines in data4.txt
with this command:
sed '1! G; h;$!d' < data4.txt
The output of the preceding sed
command is here:
line seven line six line five hello world4 four hello world6 three hello world5 two hello world4
This chapter introduced you to the sed
utility, illustrating the basic tasks of data transformation: allowing additions, removal, and mutation of data by matching individual patterns, or matching the position of the rows in a file, or a combination of the two.
Moreover, we showed that sed
not only uses regular expressions to match data, similar to the grep
command, but can also use regular expressions to describe how to transform the data. Finally, there was a list of examples showing both the versatility of the sed
command, and hopefully communicating the sense that it is an even more flexible and powerful utility than we can show in a single chapter.