Many of the tasks a Systems Administrator will perform involve the manipulation of textual information. Some examples include manipulating system log files to generate reports and modifying shell programs. Manipulating textual information is something which UNIX is quite good at and provides a number of tools which make tasks like this quite simple, once you understand how to use the tools. The aim of this chapter is to provide you with an understanding of these tools
By the end of this chapter you should be
§ familiar with using regular expressions,
§ able to use regular expressions and ex commands to perform powerful text manipulation tasks.
Regular expressions provide a powerful method for matching patterns of characters. Regular expressions (REs) are understood by a number of commands including ed ex sed awk grep egrep, expr and even vi.
Some examples of regular expressions look like include
§
david
Will match any occurrence of the word david
§
[Dd]avid
Will match either david
or David
§
.avid
Will match any letter (.)
followed by avid
§
^david$
Will match any line that contains only david
§
d*avid
Will match avid, david,
ddavid dddavid
and any other word with repeated ds
followed by avid
§
^[^abcef]avid$
Will match any line with only five characters on the line, where the last four
characters must be avid
and the first character can be any character except abcef.
Each regular expression is a pattern. That pattern is used to match other text. The simplest example of how regular expressions are used by commands is the grep command.
The grep command was introduced in a previous chapter and is used to search through a file and find lines that contain particular patterns of characters. Once it finds such a line, by default, the grep command will display that line onto standard output. In that previous chapter you were told that grep stood for global regular expression pattern match. Hopefully you now know what a regular expression is.
This means that the patterns that grep searches for are regular expressions.
The following are some example command lines making use of the grep command and regular expressions
§
grep unix tmp.doc
find any lines contain unix
§
grep '[Uu]nix'
tmp.doc
find any lines containing either unix
or Unix. Notice
that the regular expression must be quoted. This is to prevent the shell
from treating the []
as shell special characters and performing file name substitution.
§
grep '[^aeiouAEIOU]*'
tmp.doc
Match any number of characters that do not contain a vowel.
§
grep '^abc$' tmp.doc
Match any line that contains only abc.
§
grep 'hel.' tmp.doc
Match hel followed by
any other character.
It is important that you realise that regular expressions are different from filename substitution. If you look in the previous examples using grep you will see that the regular expressions are sometimes quoted. One example of this is the command
grep '[^aeiouAEIOU]*'
tmp.doc
Remember that [^] and * are all shell special characters. If the quote characters (‘’) were not there the shell would perform filename substitution and replace these special characters with matching filenames.
In this example command we do not want this to happen. We want the shell to ignore these special characters and pass them to the grep command. The grep command understands regular expressions and will treat them as such.
Regular expressions have nothing to do with filename substitution, they are in fact completely different. Table 7.1 highlights the differences between regular expressions and filename substitution.
|
Filename
substitution |
Regular
expressions |
|
Performed by the shell |
Performed by individual commands |
|
used to match filenames |
Used to match patterns of characters in data files |
Table 7.1
Regular expressions versus filename substitution
Regular expressions use a number of special characters to match patterns of characters. Table 7.2 outlines these special characters and the patterns they match.
|
Character |
Matches |
|
c |
if c is any character other than \ [ . * ^ ] $ then it will match a single occurrence of that character |
|
\ |
remove the special meaning from the following character |
|
. |
any one character |
|
^ |
the start of a line |
|
$ |
the end of a line |
|
* |
0 or more matches of the previous RE |
|
[chars] |
any one character in chars a list of characters |
|
[^chars] |
any one character NOT in chars a list of characters |
Table 7.2
Regular expression characters
7.1
What will the following simple regular expressions match?
fred
[^D]aily
..^end$
he..o
he\.\.o
\$fred
$fred
Regular expressions are one area in which the heterogeneous nature of UNIX becomes apparent. Regular expressions can be divided into a number of different categories. Different programs on different platforms recognise different subsets of regular expressions.
Under Linux the commands that use regular expressions recognise three basic flavours of regular expressions
§
basic regular expressions,
Those listed in Table 7.2 plus the tagging concept introduced below.
§
extended regular expressions, and
Basic REs plus some additional constructs from Table 7.3.
§ command specific extensions. For example, under Linux sed recognises two additions to REs.
Extended regular expressions add the symbols in Table 7.3 to regular expressions.
|
Construct |
Purpose |
|
+ |
match one or more occurrences of the previous RE |
|
? |
match zero or one occurrences of the previous RE |
|
| |
match either one of two REs separated by the | |
|
\{n\} |
match exactly n occurrences of the previous RE |
|
\{n,\} |
match at least n occurrences of the previous RE |
|
\{n, m\} |
match between n and m occurrences of the previous RE |
Table 7.3
Extended regular expressions
Some examples with extended REs include
§
egrep 'a?' pattern
Match any line from pattern
with 0 or 1 a's. (all
lines in pattern)
§
egrep '(a|b)+'
pattern
Match any line that contains one more occurrences of an a
or a b
§
egrep '.\{2\}'
pattern
Match any line that contains the same two characters in a row.
§
egrep '.\{2,\}'
pattern
Match any line that contains at least two of the same character in a row.
7.2
Write grep commands that use REs to carry out the following.
1. Find any line starting with j
in the file /etc/passwd
(equivalent to asking to find any username that starts with j).
2. Find any user that has a
username that starts with j
and uses bash as
their login shell (if they use bash
their entry in /etc/passwd
will end with the full path for the bash
program).
3. Find any user that belongs to
a group with a group ID between 0 and 99 (group id is the fourth field on each
line in /etc/passwd).
Tagging is an extension to regular expressions which allows you to recognise a particular pattern and store it away for future use. For example, consider the regular expression
da\(vid\)
The portion of the RE surrounded by the \( and \) is being tagged. Any pattern of characters that matches the tagged RE, in this case vid, will be stored in a register. The commands that support tagging provide a number of registers in which character patterns can be stored.
It is possible to use the contents of a register in a RE. For example,
\(abc\)\1\1
The first part of this RE defines the pattern that will be tagged and placed into the first register (remember this pattern can be any regular expression). In this case the first register will contain abc. The 2 following \1 will be replaced by the contents of register number 1. So this particular example will match abcabcabc.
The \ characters must be used to remove the other meaning which the brackets and numbers have in a regular expression.
Some example REs using tagging include
§
\(david\)\1
This RE will match daviddavid.
It first matches david
and stores it into the first register (\(david\)).
It then matches the contents of the first register (\1).
§
\(.\)oo\1
Will match words such as noon,
moom.
For the remaining RE examples and exercises I'll be referring to a file called pattern. The following is the contents of pattern.
a
hellohello
goodbye
friend how hello
there how are you how are you
ab
bb
aaa
lll
Parameters
param
7.3
What will the following commands do
grep '\(a\)\1' pattern
grep '\(.*\)\1' pattern
grep '\( .*\)\1' pattern
So far you’ve been introduced to what regular expressions do and how they work. In this section you will be introduced to some of the commands which allow you to use regular expressions to achieve some quite powerful results.
In the days of yore UNIX did not have full screen editors. Instead the users of the day used the line editor ed. ed was the first UNIX editor and its impact can be seen in commands such as sed, awk, grep and a collection of editors including ex and vi.
vi was written by Bill Joy while he was a graduate student at the University of California at Berkeley (a University responsible for many UNIX innovations). Bill went on to do other things including being involved in the creation of Sun Microsystems.
vi is actually a full-screen version of ed. Whenever you use :wq to save and quit out of vi you are using a ed command.
All very exciting stuff but what does it mean to you a trainee Systems Administrator? It actually has at least three major impacts
§ by using vi you can become familiar with the ed commands
§ ed commands allow you to use regular expressions to manipulate and modify text
§ those same ed commands, with regular expressions, can be used with sed to perform all these tasks non-interactively (this means they can be automated).
Why would anyone ever want to use a line editor like ed?
Well in some instances the Systems Administrator doesn't have a choice. There are circumstances where you will not be able to use a full screen editor like vi. In these situations a line editor like ed or ex will be your only option.
One example of this is when you boot a Linux machine with installation boot and root disks. These disks usually don't have space for a full screen editor but they do have ed.
ed is a line editor that recognises a number of commands that can manipulate text. Both vi and sed recognise these same commands. In vi whenever you use the : command you are using ed commands. ed commands use the following format.
[ address [, address]] command [parameters]
(you should be aware that anything between [] is optional)
This means that every ed command consists of
§ 0 or more addresses that specify which lines the command should be performed upon,
§ a single character command, and
§ an optional parameter (depending on the command)
Some example ed commands include
§
1,$s/old/new/g
The address is 1,$
which specifies all lines. The command is the substitute
command. With the following text forming the parameters to the command. This
particular command will substitute all occurrences of the work old
with the word new for
all lines within the current file.
§
4d3
The address is line 4.
The command is delete.
The parameter 3
specifies how many lines to delete. This command will delete 3 lines starting
from line 4.
§
d
Same command, delete
but no address or parameters. The default address is the current line and the
default number of lines to delete is one. So this command deletes the current
line.
§
1,10w/tmp/hello
The address is from line 1
to line 10. The
command is write to
file. This command will write lines 1 to 10 into the file /tmp/hello
The ed family of editors keep track of the current line. By default any ed command is performed on the current line. Using the address mechanism it is possible to specify another line or a range of lines on which the command should be performed.
Table 7.4 summarises the possible formats for ed addresses.
|
Address |
Purpose |
|
. |
the current line |
|
$ |
the last line |
|
7 |
line 7, any number matches that line number |
|
a |
the line that has been marked as a |
|
/RE/ |
the next line matching the RE moving forward from the current line |
|
?RE? |
the next line matching the RE moving backward from the current line |
|
Address+n |
the line that is n lines after the line specified by address |
|
Address-n |
the line that is n lines before the line specified by address |
|
Address1, address2 |
a range of lines from address1 to address2 |
|
, |
the same as 1,$, i.e. the entire file from line 1 to the last line ($) |
|
; |
the same as .,$, i.e. from the current line (.) to the last line ($) |
Table 7.4
ed
addresses
Regular users of vi will be familiar with the ed commands w and q (write and quit). ed also recognises commands to delete lines of text, to replace characters with other characters and a number of other functions.
Table 7.5 summarises some of the ed commands and their formats. In Table 7.5 range can match any of the address formats outlined in Table 7.4.
|
Address |
Purpose |
|
linea |
the append command, allows the user to add text after line number line |
|
range d buffer count |
the delete command, delete the lines specified by range and count and place them into the buffer buffer |
|
range j count |
the join
command, takes the lines specified by range and count and
makes them one line |
|
q |
quit |
|
line
r file |
the read
command, read the contents of the file file and place them after
the line line |
|
sh |
start up a new shell |
|
range
s/RE/characters/options |
the substitute
command, find any characters that match RE and replace them with characters
but only in the range specified by range |
|
u |
the undo
command, |
|
range
w file |
the write
command, write to the file file all the lines specified by range
|
Table 7.5
ed
commands
Some more examples of ed commands include
§
5,10s/hello/HELLO/
replace the first occurrence of hello
with HELLO for all
lines between 5 and
10
§
5,10s/hello/HELLO/g
replace all occurrences of hello
with HELLO for all
lines between 5 and 10
§
1,$s/^\(.\{20,20\}\)\(.*\)$/\2\1/
for all lines in the file, take the first 20 characters and put them at the end
of the line
The last example deserves a bit more explanation. Let's break it down into its components
§
1,$s
The 1,$ is the range
for the command. In this case it is the whole file (from line 1 to the last
line). The command is substitute so we are going to replace some text with some
other text.
§
/^
The / indicates the
start of the RE. The ^
is a RE pattern and it is used to match the start of a line (see Table 7.2).
§
\(.\{20,20\}\)
This RE fragment .\{20,20\}
will match any 20 characters. By surrounding it with \(
\) those 20 characters will be stored in register 1.
§
\(.*\)$
The .* says match any
number of characters and surrounding it with \(
\) means those characters will be placed into the next available register
(register 2). The $
is the RE character that matches the end of the line. So this fragment takes all
the characters after the first 20 until the end of the line and places them into
register 2.
§
/\2\1/
This specifies what text should replace the characters matched by the previous
RE. In this case the \2
and the \1 refer to
registers 1 and 2. Remember from above that the first 20 characters on the line
have been placed into register 1 and the remainder of the line into register 2.
sed is a non-interactive version of ed. sed is given a sequence of ed commands and then performs those commands on its standard input or on files passes as parameters. It is an extremely useful tool for a Systems Administrator. The ed and vi commands are interactive which means they require a human being to perform the tasks. On the other had sed is non-interactive and can be used in shell programs which means tasks can be automated.
By default the sed command acts like a filter. It takes input from standard input and places output onto standard output. sed can be run using a number of different formats.
sed command [file-list]
sed [-e command] [-f command_file] [filelist]
command is one of the valid ed commands.
The -e command option can be used to specify multiple sed commands. For example,
sed –e '1,$s/david/DAVID/' –e '1,$s/bash/BASH/' /etc/passwd
The -f command_file tells sed to take its commands from the file command_file. That file will contain ed commands one to a line.
Some of the tasks you might use sed for include
§ change the username DAVID in the /etc/passwd to david
§ for any users that are currently using bash as their login shell change them over to the csh.
You could also use vi or ed to perform these same tasks. Note how the / in /bin/bash and /bin/csh have been quoted. This is because the / character is used by the substitute command to split the text to find and the text to replace it with. It is necessary to quote the / character so ed will treat it as a normal character.
sed 's/DAVID/david/' /etc/passwd
sed 's/david/DAVID/' -e 's/\/bin\/bash/\/bin\/csh/' /etc/passwd
sed -f commands /etc/passwd
The last example assumes that there is a file called commands that contains the following
s/david/DAVID/
s/\/bin\/bash/\/bin\/csh/
7.4
Perform the following tasks with both vi
and sed.
You have just written a history of the UNIX operating system but you referred
to UNIX as unix
throughout. Replace all occurrences of unix
with UNIX
You've just written a Pascal procedure using Write
instead of Writeln.
The procedure is part of a larger program. Replace Write
with Writeln for
all lines between the next occurrence of BEGIN and the following END
When you forward a mail message using the elm
mail program it automatically adds >
to the beginning of every line. Delete all occurrences of >
that start a line.
7.5
What do the following ed
commands do?
.+1,$d
1,$s/OSF/Open Software
Foundation/g
1,/end/s/\([a-z]*\)
\([0-9]*\)/\2 \1/
7.6
What are the following commands trying to do?
Will they work? If not why not?
sed –e 1,$s/^:/fred:/g /etc/passwd
sed '1,$s/david/DAVID/' '1,$s/bash/BASH/' /etc/passwd
Regular expressions (REs) are a powerful mechanism for matching patterns of characters. REs are understood by a number of commands including vi, grep, sed, ed, awk and Perl.
vi is just one of a family of editors starting with ed and including ex and sed. This entire family recognise ed commands that support the use of regular expressions to manipulate text.
You have been given responsibility for maintaining the 85321 WWW pages. These pages are spread through a large collection of directories and sub-directories. There are some modifications that must be made. Write commands using your choice of awk, sed , find or vi to
§ change the extensions of all .html files to .htm
§ where ever bl_ball.gif appears in a file, change it to rd_ball.gif
§ move all the files that haven't been modified for 28 days into the /usr/local/old directory
§ count the number of times the word 85321 occurs in all files ending in .html
It is often the case that specific users on a system continually use too much disk space. There are a number of solutions to this problem including quotas (talked about in a later chapter).
In the meantime you are going to implement another solution along the following lines. Maintain a file called disk.hog, each line of this file contains a username and the amount of disk space they are allowed to have. For example
jonesd 50000
okellys 10
Write a script called find_hog that is run once a day and performs the following tasks
§ for each user in disk.hog discover how much disk space they are using
§ if the amount of disk space exceeds the allowed amount write their username to a file offender
Hints: User's should only own files under their home directory. The command du -s directoryname can be used to find out how much disk space the directory directoryname and all its child directories use. The file /etc/passwd records the home directory for each user.
Use vi and awk to perform the following tasks with the file 85321.txt (the student numbers have been changed to protect the innocent). This file is available from the 85321 Web site/CD-ROM under the resource materials section for week 3. Unless specified assume each task starts with the original file.
§ remove the student number
§ switch the order for first name, last name
§ remove any student with the name david
Write commands to perform the four tasks outlined in the introduction to this chapter. They were
§ calculate how much disk space each user is using
§ calculate the amount of time each user has spent logged in (try the command last username and see what happens)
§ delete all the files owned by a particular user (be careful doing this one)
§ find all the files that are setuid