^ |
matches the beginning of the line |
$ |
matches the end of the line |
. |
Matches any single character |
(character)* |
match arbitrarily many occurences of (character) |
(character)? |
Match 0 or 1 instance of (character) |
[abcdef] |
Match any character enclosed in [] (in this
instance, a b c d e or f) ranges of characters such as
[a-z] are permitted. The behaviour
of this deserves more description. See the page on
grep
for more details about the syntax
of lists. |
[^abcdef] |
Match any character NOT enclosed in [] (in this instance, any character other than a b c d e or f) |
(character)\{m,n\} |
Match m-n repetitions of (character) |
(character)\{m,\} |
Match m or more repetitions of (character) |
(character)\{,n\} |
Match n or less (possibly 0) repetitions of (character) |
(character)\{n\} |
Match exactly n repetitions of (character) |
\(expression\) |
Group operator. |
\n |
Backreference - matches nth group |
expression1\|expression2 |
Matches expression1 or expression 2. Works with GNU sed, but this feature might not work with other forms of sed. |
/
is a special character in sed
. The reason for this will become very clear when studying
sed commands.
s
sommand. s
means "substitute" or search and replace. The format is
s/regular-expression/replacement text/{flags}
We won't discuss all the flags yet. The one we
use below is g
which means "replace all matches"
>cat file
I have three dogs and two cats
>sed -e 's/dog/cat/g' -e 's/cat/elephant/g' file
I have three elephants and two elephants
>
OK. So what happened ? Firsty,
sed read in the line of the file and
executed
s/dog/cat/g
which produced the following text:
I have three cats and two cats
and then the second command was performed
on the edited line
and the result was
I have three elephants and
two elephants
We actually have a name for the "current
text": it is called the pattern space
. So a precise definition of what sed
does is as follows :
sed reads the standard input into the pattern space, performs a sequence of editing commands on the pattern space, then writes the pattern space to STDOUT.
>sed -e 'command1'
-e 'command2' -e 'command3' file
>{shell command}|sed -e 'command1' -e 'command2'
>sed -f sedscript.sed file
>{shell command}|sed -f sedscript.sed
so
sed can read from a file or STDIN, and
the commands can be specified in a file or on the command line. Note the following
:
that if the commands are read from a file, trailing whitespace can be fatal, in particular, it will cause scripts to fail for no apparent reason. I recommend editing sed scripts with an editor such as vim which can show end of line characters so that you can "see" trailing white space at the end of line.
[
address1[ ,
address2]]
s/pattern/replacement/[flags]
The flags can be any of the following:
| n | replace nth instance of pattern with replacement |
g |
replace all instances of pattern with replacement |
p |
write pattern space to STDOUT if a succesful substitution takes place |
w
file |
Write the pattern space to file if a succesful substitution takes place |
If no flags are specified the
first match on the line is replaced. note that we will almost always use
the s
command with either the g
flag or no flag at all.
If one address is given, then
the substitution is applied to lines containing that address. An address
can be either a regular expression enclosed by forward slashes
/regex/ , or a line number . The
$
symbol can be used in place of a line number to denote the last line.
If two addresses are given seperated by a comma, then the substitution is applied to all lines between the two lines that match the pattern.
This requires some clarification in the case where both addresses are patterns, as there is some ambiguity here. more precisely, the substitution is applied to all lines from the first match of address1 to the first match of address2 and all lines from the first match of address1folowing the first match of address2 to the next match of address1 Don't worry if this seems very confusing (it is), the examples will clarify this.
[address1
[ , address2 ] ]d
And it deletes the content
of the pattern space. All following commands are skipped (after all, there's
very little you can do with an empty pattern space), and a new line is read
into the pattern space.
>cat file
http://pegasus.rutgers.edu/
>sed -e 's@http://www.foo.com@http://www.bar.net@' file
http://andromeda.rutgers.edu/
Note that we used a different
delimiter, @ for the substitution command. Sed permits several delimiters
for the s command including @%,;: these alternative delimiters are good for
substitutions which include strings such as filenames, as it makes your sed
code much more readable.
>cat file
the black cat was chased by the brown dog
>sed -e 's/black/white/g' file
the white cat was chased by the brown dog
That was pretty straight
forward. Now we move on to something more interesting.
>cat file
the black cat was chased by the brown dog.
the black cat was not chased by the brown dog
>sed -e '/not/s/black/white/g' file
the black cat was chased by the brown dog.
the white cat was not chased by the brown dog.
In this instance, the
substitution is only applied to lines matching the regular expression
not. Hence it is not applied to the
first line.
3a : This was pretty simple: we just deleted lines 1 to 2.>cat file
line 1 (one)
line 2 (two)
line 3 (three)
Example 4a
>sed -e '1,2d' file
line 3 (three)
Example 4b
>sed -e '3d' file
line 1 (one)
line 2 (two)
Example 4c
>sed -e '1,2s/line/LINE/' file
LINE 1 (one)
LINE 2 (two)
line 3 (three)
Example 4d
>sed -e '/^line.*one/s/line/LINE/' -e '/line/d' file
LINE 1 (one)
^line.*one
So the substitution is carried out, and the resulting pattern space looks
like this:
LINE 1
(one)
So now the second
command is executed, but since the pattern space does not match the regular
expression line
, the delete command is not executed.
>cat
file
hello
this text is wiped out
Wiped out
hello (also wiped out)
WiPed out TOO!
goodbye
(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this
hello
but this is
and so is this
and unless we find another g**dbye
every line to the end of the file gets deleted
>sed -e '/hello/,/goodbye/d' file
(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this
This illustrates
how the addressing works when two pattern addresses are specified. sed finds
the first match of the expression "hello", deleting every line read into
the pattern space until it gets to the first line after the expression "goodbye".
It doesn't apply the delete command to any more addresses until it comes
across the expression "hello" again. Since the expression "goodbye" is not
on any subsequent line, the delete command is applied to all remaining lines.
q command is very simple. It simply
quits. No more lines are read into the pattern space and the program terminates
and produces no more output.
In sed, curly braces, { } are used to group commands. They are used as follows:
address1[,address2]{
commands }
This example makes very good use of all the concepts outlined above.
For this, we use
a shell script, since we need to state the one long string
X several times (otherwise, we'd need
to repeat ourselves three times with a somewhat lengthy expression). Notice
that we use double quotes. This is so that
$X is expanded to the shell variable
name (which would not happen if we used single quotes). Also notice the
$1
on the end. The syntax to run this script is
script search_filename where
script is whatever you decided to
call it and search_filename
is the file you are trying to search.
$1 is the name the shell gives to the
first command line argument.
#!/bin/shAn important note: it is tempting to think of this:
X='word1\|word2\|word3|\word4|\word5'
sed -e "
/$X/!d
/$X/{
s/\($X\).*/\1/
s/.*\($X\)/\1/
q
}" $1
s/\($X\).*/\1/
s/.*\($X\)/\1/
as redundant,
and to try and shorten it with this:
s/.*\($X\).*/\1/
This is unlikely
to work. Why ? suppose we have a line
word1 word2 word3
we have
no way of knowing that $X
is going to match word1 , word2 or word3, so when we quote it (
\1 ) , we don't know what we are quoting.
What has been used to make sure there are no such problems in the correct implementation is this:
the * operator is greedy. That is, when there is ambiguity as to what (expression)* can match, it tries to match as much as possible.So in the example,
s/\($X\).*/\1/
, .*
tries to swallow as much of the line as possible. in particular, if the
line looks like
word1 word2 word3
then
we can be sure that .*
matches " word2 word3"
and hence $X
matches word1
.
s/pattern1/replacement/
does not work if the string spans more than one line.
Microsoft
Windows 95 with
Linux (I mean, just replace the text
!). Our first attempt is this:
s/Microsoft Windows 95/Linux/gUnfortunately, the script fails if our file looks like this:
Microsoft
Windows 95
Since
neither line matches the pattern microsoft
Windows 95
So
we need to do better. We need the "multiline next" or
N command.
The next command N appends the next line to the pattern space.So our second attempt is this:
NNow note that we have made reference to
N
s/Microsoft[ \t\n]*Windows[ \t\n]*95/Linux/g
\t and
\n . These are the tab and end of line
characters respectively. The end of line character only appears in multiline
patterns. In multiline patterns, it should also be noted that
^ and
$ match the beginning and end of the
pattern space.
The above is a start, but it breaks if we have a file that looks like this:
Foo
Microsoft
Windows
95
Why does it break ? Let's look at what the script does.
Foo \nMicrosoft
Foo \nMicrosoft\nWindows
Foo\nMicrosoft\nWindows
This doesn't match the search pattern, so no substitution is performed.
This is the main error in the script : once the end of the script is reached, the first line that * has not been read into the pattern space already * is read. It is NOT true that the Nth iteration of the script reads from the Nth line of the file.The following too N commands fail and the script exits without writing '95' to STDOUT.
So there are too things to be learned from this:
A better version is as follows:
/Microsoft[ \t]*$/{
N
}
/Microsoft[ \t\n]*Windows[ \t]*$/{
N
}
s/Microsoft[ \t\n]*Windows[ \t\n]*95/Linux/g
This only performs the search on extra lines when necessary.
Suppose we want to eliminate all text enclosed by a matching pair of delimiters This is a problem that comes up frequently. For example, removing html commands from html documents. We will use <angle brackets> in this example. So the task then is to eliminate anything between matching pairs of these brackets.
Our first attempt is shown as follows:
s/<[^>]*>//gBut this might break: the angle brackets might span more than one line, or there may be nested angle brackets. Actually, the latter is unlikely to happen if the html is correct. only possible to nest angle brackets inside html comments. ) But we will assume that it might happen anyway (since it makes the problem more fun) So here is the improved version.
:topA fine point: why didn't we replace the third line of the script with
/<.*>/{
s/<[^<>]*>//g
t top
}
/</{
N
b top
}
s/<[^>]*>//g
and removing the t command that follows ? Well consider this sample file:
<<hello>The desired output would be the empty set, since everything is enclosed in angled brackets. However, the output will look like this:
hello>
hello>since the first line matches the expression
<[^>]*> So the point is that
we have set up the script to recursively remove the contents of the innermost
matching pair of delimiters.