Chapter 7

Text Manipulation

Introduction

Many of the tasks a Systems Administrator will perform involve the manipulation of textual information.  Some examples include manipulating system log files to generate reports and modifying shell programs.  Manipulating textual information is something which UNIX is quite good at and provides a number of tools which make tasks like this quite simple, once you understand how to use the tools.  The aim of this chapter is to provide you with an understanding of these tools

By the end of this chapter you should be

§         familiar with using regular expressions,

§         able to use regular expressions and ex commands to perform powerful text manipulation tasks.

Regular expressions

Regular expressions provide a powerful method for matching patterns of characters. Regular expressions (REs) are understood by a number of commands including ed ex sed awk grep egrep, expr and even vi.

Some examples of regular expressions look like include

§         david
Will match any occurrence of the word david

§         [Dd]avid
Will match either david or David

§         .avid
Will match any letter (.) followed by avid

§         ^david$
Will match any line that contains only david

§         d*avid
Will match avid, david, ddavid dddavid and any other word with repeated ds followed by avid

§         ^[^abcef]avid$
Will match any line with only five characters on the line, where the last four characters must be avid and the first character can be any character except abcef.

Each regular expression is a pattern.  That pattern is used to match other text.  The simplest example of how regular expressions are used by commands is the grep command.

The grep command was introduced in a previous chapter and is used to search through a file and find lines that contain particular patterns of characters.  Once it finds such a line, by default, the grep command will display that line onto standard output. In that previous chapter you were told that grep stood for global regular expression pattern match. Hopefully you now know what a regular expression is.

This means that the patterns that grep searches for are regular expressions.

The following are some example command lines making use of the grep command and regular expressions

§         grep unix tmp.doc
find any lines contain unix

§         grep '[Uu]nix' tmp.doc
find any lines containing either unix or Unix. Notice that the regular expression must be quoted. This is to prevent the shell from treating the [] as shell special characters and performing file name substitution.

§         grep '[^aeiouAEIOU]*' tmp.doc
Match any number of characters that do not contain a vowel.

§         grep '^abc$' tmp.doc
Match any line that contains only abc.

§         grep 'hel.' tmp.doc
Match hel followed by any other character.

REs versus filename substitution

It is important that you realise that regular expressions are different from filename substitution.  If you look in the previous examples using grep you will see that the regular expressions are sometimes quoted.  One example of this is the command

grep '[^aeiouAEIOU]*' tmp.doc

Remember that [^] and * are all shell special characters.  If the quote characters (‘’) were not there the shell would perform filename substitution and replace these special characters with matching filenames.

In this example command we do not want this to happen.  We want the shell to ignore these special characters and pass them to the grep command.  The grep command understands regular expressions and will treat them as such.

Regular expressions have nothing to do with filename substitution, they are in fact completely different. Table 7.1 highlights the differences between regular expressions and filename substitution.


 

Filename substitution

Regular expressions

Performed by the shell

Performed by individual commands

used to match filenames

Used to match patterns of characters in data files

Table 7.1
Regular expressions versus filename substitution

How they work

Regular expressions use a number of special characters to match patterns of characters. Table 7.2 outlines these special characters and the patterns they match.

Character

Matches

c

if c is any character other than \ [ . * ^ ] $ then it will match a single occurrence of that character

\

remove the special meaning from the following character

.

any one character

^

the start of a line

$

the end of a line

*

0 or more matches of the previous RE

[chars]

any one character in chars a list of characters

[^chars]

any one character NOT in chars a list of characters

Table 7.2
Regular expression characters

Exercises

7.1         What will the following simple regular expressions match?
     fred
  [^D]aily
  ..^end$
  he..o
  he\.\.o
  \$fred
  $fred

Extensions to regular expressions

Regular expressions are one area in which the heterogeneous nature of UNIX becomes apparent. Regular expressions can be divided into a number of different categories. Different programs on different platforms recognise different subsets of regular expressions.

Under Linux the commands that use regular expressions recognise three basic flavours of regular expressions

§         basic regular expressions,
Those listed in Table 7.2 plus the tagging concept introduced below.

§         extended regular expressions, and
Basic REs plus some additional constructs from Table 7.3.

§         command specific extensions. For example, under Linux sed recognises two additions to REs.

Extended regular expressions add the symbols in Table 7.3 to regular expressions.

Construct

Purpose

+

match one or more occurrences of the previous RE

?

match zero or one occurrences of the previous RE

|

match either one of two REs separated by the |

\{n\}

match exactly n occurrences of the previous RE

\{n,\}

match at least n occurrences of the previous RE

\{n, m\}

match between n and m occurrences of the previous RE

Table 7.3
Extended regular expressions

Examples

Some examples with extended REs include

§         egrep 'a?' pattern
Match any line from pattern with 0 or 1 a's. (all lines in pattern)

§         egrep '(a|b)+' pattern
Match any line that contains one more occurrences of an a or a b

§         egrep '.\{2\}' pattern
Match any line that contains the same two characters in a row.

§         egrep '.\{2,\}' pattern
Match any line that contains at least two of the same character in a row.

Exercises

7.2         Write grep commands that use REs to carry out the following.
1.  Find any line starting with j in the file /etc/passwd (equivalent to asking to find any username that starts with j).
2.  Find any user that has a username that starts with j and uses bash as their login shell (if they use bash their entry in /etc/passwd will end with the full path for the bash program).
3.  Find any user that belongs to a group with a group ID between 0 and 99 (group id is the fourth field on each line in /etc/passwd).

Tagging

Tagging is an extension to regular expressions which allows you to recognise a particular pattern and store it away for future use. For example, consider the regular expression

da\(vid\)

The portion of the RE surrounded by the \( and \) is being tagged. Any pattern of characters that matches the tagged RE, in this case vid, will be stored in a register. The commands that support tagging provide a number of registers in which character patterns can be stored.

It is possible to use the contents of a register in a RE. For example,

\(abc\)\1\1

The first part of this RE defines the pattern that will be tagged and placed into the first register (remember this pattern can be any regular expression).  In this case the first register will contain abc. The 2 following \1 will be replaced by the contents of register number 1. So this particular example will match abcabcabc.

The \ characters must be used to remove the other meaning which the brackets and numbers have in a regular expression.

For example

Some example REs using tagging include

§         \(david\)\1
This RE will match daviddavid. It first matches david and stores it into the first register (\(david\)). It then matches the contents of the first register (\1).

§         \(.\)oo\1
Will match words such as noon, moom.

For the remaining RE examples and exercises I'll be referring to a file called pattern. The following is the contents of pattern.

a
hellohello
goodbye
friend how hello
there how are you how are you
ab
bb
aaa
lll
Parameters
param


Exercises

7.3         What will the following commands do
grep '\(a\)\1' pattern
grep '\(.*\)\1' pattern
grep '\( .*\)\1' pattern

ex, ed, sed and vi

So far you’ve been introduced to what regular expressions do and how they work.  In this section you will be introduced to some of the commands which allow you to use regular expressions to achieve some quite powerful results.

In the days of yore UNIX did not have full screen editors. Instead the users of the day used the line editor ed. ed was the first UNIX editor and its impact can be seen in commands such as sed, awk, grep and a collection of editors including ex and vi.

vi was written by Bill Joy while he was a graduate student at the University of California at Berkeley (a University responsible for many UNIX innovations). Bill went on to do other things including being involved in the creation of Sun Microsystems.

vi is actually a full-screen version of ed. Whenever you use :wq to save and quit out of vi you are using a ed command.

So???

All very exciting stuff but what does it mean to you a trainee Systems Administrator? It actually has at least three major impacts

§         by using vi you can become familiar with the ed commands

§         ed commands allow you to use regular expressions to manipulate and modify text

§         those same ed commands, with regular expressions, can be used with sed to perform all these tasks non-interactively (this means they can be automated).

Why use ed ?

Why would anyone ever want to use a line editor like ed?

Well in some instances the Systems Administrator doesn't have a choice. There are circumstances where you will not be able to use a full screen editor like vi. In these situations a line editor like ed or ex will be your only option.

One example of this is when you boot a Linux machine with installation boot and root disks. These disks usually don't have space for a full screen editor but they do have ed.


ed commands

ed is a line editor that recognises a number of commands that can manipulate text. Both vi and sed recognise these same commands.  In vi whenever you use the : command you are using ed commands. ed commands use the following format.

[ address [, address]] command [parameters]

(you should be aware that anything between [] is optional)

This means that every ed command consists of

§         0 or more addresses that specify which lines the command should be performed upon,

§         a single character command, and

§         an optional parameter (depending on the command)

For example

Some example ed commands include

§         1,$s/old/new/g
The address is 1,$ which specifies all lines. The command is the substitute command. With the following text forming the parameters to the command. This particular command will substitute all occurrences of the work old with the word new for all lines within the current file.

§         4d3
The address is line 4. The command is delete. The parameter 3 specifies how many lines to delete. This command will delete 3 lines starting from line 4.

§         d
Same command, delete but no address or parameters. The default address is the current line and the default number of lines to delete is one. So this command deletes the current line.

§         1,10w/tmp/hello
The address is from line 1 to line 10. The command is write to file. This command will write lines 1 to 10 into the file /tmp/hello

The current line

The ed family of editors keep track of the current line. By default any ed command is performed on the current line. Using the address mechanism it is possible to specify another line or a range of lines on which the command should be performed.

Table 7.4 summarises the possible formats for ed addresses.


 

Address

Purpose

.

the current line

$

the last line

7

line 7, any number matches that line number

a

the line that has been marked as a

/RE/

the next line matching the RE moving forward from the current line

?RE?

the next line matching the RE moving backward from the current line

Address+n

the line that is n lines after the line specified by address

Address-n

the line that is n lines before the line specified by address

Address1, address2

a range of lines from address1 to address2

,

the same as 1,$, i.e. the entire file from line 1 to the last line ($)

;

the same as .,$, i.e. from the current line (.) to the last line ($)

Table 7.4
ed addresses

ed commands

Regular users of vi will be familiar with the ed commands w and q (write and quit). ed also recognises commands to delete lines of text, to replace characters with other characters and a number of other functions.

Table 7.5 summarises some of the ed commands and their formats. In Table 7.5 range can match any of the address formats outlined in Table 7.4.


 

Address

Purpose

linea

the append command, allows the user to add text after line number line

range d buffer count

the delete command, delete the lines specified by range and count and place them into the buffer buffer

range j count

the join command, takes the lines specified by range and count and makes them one line

q

quit

line r file

the read command, read the contents of the file file and place them after the line line

sh

start up a new shell

range s/RE/characters/options

the substitute command, find any characters that match RE and replace them with characters but only in the range specified by range

u

the undo command,

range w file

the write command, write to the file file all the lines specified by range

Table 7.5
ed commands

For example

Some more examples of ed commands include

§         5,10s/hello/HELLO/
replace the first occurrence of hello with HELLO for all lines between 5 and 10

§         5,10s/hello/HELLO/g
replace all occurrences of hello with HELLO for all lines between 5 and 10

§         1,$s/^\(.\{20,20\}\)\(.*\)$/\2\1/
for all lines in the file, take the first 20 characters and put them at the end of the line

The last example

The last example deserves a bit more explanation. Let's break it down into its components

§         1,$s
The 1,$ is the range for the command. In this case it is the whole file (from line 1 to the last line). The command is substitute so we are going to replace some text with some other text.

§         /^
The / indicates the start of the RE. The ^ is a RE pattern and it is used to match the start of a line (see Table 7.2).

§         \(.\{20,20\}\)
This RE fragment .\{20,20\} will match any 20 characters. By surrounding it with \( \) those 20 characters will be stored in register 1.

§         \(.*\)$
The .* says match any number of characters and surrounding it with \( \) means those characters will be placed into the next available register (register 2). The $ is the RE character that matches the end of the line. So this fragment takes all the characters after the first 20 until the end of the line and places them into register 2.

§         /\2\1/
This specifies what text should replace the characters matched by the previous RE. In this case the \2 and the \1 refer to registers 1 and 2. Remember from above that the first 20 characters on the line have been placed into register 1 and the remainder of the line into register 2.

The sed command

sed is a non-interactive version of ed. sed is given a sequence of ed commands and then performs those commands on its standard input or on files passes as parameters.  It is an extremely useful tool for a Systems Administrator.  The ed and vi commands are interactive which means they require a human being to perform the tasks.  On the other had sed is non-interactive and can be used in shell programs which means tasks can be automated.

sed command format

By default the sed command acts like a filter. It takes input from standard input and places output onto standard output. sed can be run using a number of different formats.

sed command [file-list]
sed [-e command] [-f command_file] [filelist]  

command is one of the valid ed commands.

The -e command option can be used to specify multiple sed commands.  For example,

sed –e '1,$s/david/DAVID/' –e '1,$s/bash/BASH/' /etc/passwd

The -f command_file tells sed to take its commands from the file command_file. That file will contain ed commands one to a line.

For example

Some of the tasks you might use sed for include

§         change the username DAVID in the /etc/passwd to david

§         for any users that are currently using bash as their login shell change them over to the csh.

You could also use vi or ed to perform these same tasks. Note how the / in /bin/bash and /bin/csh have been quoted. This is because the / character is used by the substitute command to split the text to find and the text to replace it with. It is necessary to quote the / character so ed will treat it as a normal character.

sed 's/DAVID/david/' /etc/passwd
sed 's/david/DAVID/' -e 's/\/bin\/bash/\/bin\/csh/' /etc/passwd  
sed -f commands /etc/passwd

The last example assumes that there is a file called commands that contains the following

s/david/DAVID/
s/\/bin\/bash/\/bin\/csh/

Exercises

7.4         Perform the following tasks with both vi and sed.
You have just written a history of the UNIX operating system but you referred to UNIX as unix throughout. Replace all occurrences of unix with UNIX
You've just written a Pascal procedure using Write instead of Writeln. The procedure is part of a larger program. Replace Write with Writeln for all lines between the next occurrence of BEGIN and the following END
When you forward a mail message using the elm mail program it automatically adds > to the beginning of every line. Delete all occurrences of > that start a line.

7.5         What do the following ed commands do?
.+1,$d
1,$s/OSF/Open Software Foundation/g
1,/end/s/\([a-z]*\) \([0-9]*\)/\2 \1/

7.6         What are the following commands trying to do?  Will they work?  If not why not?
sed –e 1,$s/^:/fred:/g /etc/passwd
sed '1,$s/david/DAVID/' '1,$s/bash/BASH/' /etc/passwd

Conclusions

Regular expressions (REs) are a powerful mechanism for matching patterns of characters. REs are understood by a number of commands including vi, grep, sed, ed, awk and Perl.

vi is just one of a family of editors starting with ed and including ex and sed. This entire family recognise ed commands that support the use of regular expressions to manipulate text.


Review Questions

7.1

You have been given responsibility for maintaining the 85321 WWW pages. These pages are spread through a large collection of directories and sub-directories. There are some modifications that must be made. Write commands using your choice of awk, sed , find or vi to

§         change the extensions of all .html files to .htm

§         where ever bl_ball.gif appears in a file, change it to rd_ball.gif

§         move all the files that haven't been modified for 28 days into the /usr/local/old directory

§         count the number of times the word 85321 occurs in all files ending in .html

7.2

It is often the case that specific users on a system continually use too much disk space. There are a number of solutions to this problem including quotas (talked about in a later chapter).

In the meantime you are going to implement another solution along the following lines. Maintain a file called disk.hog, each line of this file contains a username and the amount of disk space they are allowed to have. For example

jonesd 50000
okellys 10

Write a script called find_hog that is run once a day and performs the following tasks

§         for each user in disk.hog discover how much disk space they are using

§         if the amount of disk space exceeds the allowed amount write their username to a file offender

Hints: User's should only own files under their home directory. The command du -s directoryname can be used to find out how much disk space the directory directoryname and all its child directories use. The file /etc/passwd records the home directory for each user.

7.3

Use vi and awk to perform the following tasks with the file 85321.txt (the student numbers have been changed to protect the innocent). This file is available from the 85321 Web site/CD-ROM under the resource materials section for week 3. Unless specified assume each task starts with the original file.

§         remove the student number

§         switch the order for first name, last name

§         remove any student with the name david

 

7.4

Write commands to perform the four tasks outlined in the introduction to this chapter. They were

§         calculate how much disk space each user is using

§         calculate the amount of time each user has spent logged in (try the command last username and see what happens)

§         delete all the files owned by a particular user (be careful doing this one)

§         find all the files that are setuid