Sed and awk are powertools for editing.


The key is to learn to recognize situations when using one of these tools can pay off.


Sed is a non interactive stream-editor. Sed is best for accomplishing:


1) automated editing actions to be performed on one or more files

2) to simplify the task of performing the same edits on multiple files

3) to write conversion programs


Awk is a pattern matching programming language. It is best suited for processing data which has some type of structure.


Awk allows a user to:


1) View a text file as a textural database made up of records and fields

2) Use variables to manipulate the database.

3) Use arithmetic and string operators

4) Use common programming constructs such as loops and conditionals

5) Generate formatted reports

6) Define functions

7) Execute UNIX commands from a script

8) Process the result of UNIX commands

9) Process command-line arguments more gracefully

10) Work more easily with multiple input streams


REGULAR EXPRESSIONS! REGULAR EXPRESSIONS!


. * [...] [^...] ^$ \

extended

+?|()


Writing a sed script:


Think through what you want to do before you do it.

Describe, unambiguously, a procedure to do it.

Test the procedure repeatedly before committing to any final changes.


How Sed Works


All editing commands in a script are applied in order to each line of input. Commands are applied to all lines (globally) unless line addressing restricts the lines affected by editing commands. The original input file is unchanged; the editing commands modify a copy of original input line and the copy is sent to standard output


Sed applies the entire script to the first input line before reading the second input line and applying the editing script to it. Because sed is always working with the latest version of the original line, any edit that is made changes the line for subsequent edits by the script. This means that a pattern that might have matched the original input line may no longer match the line after an edit has been made.


Syntax


A line address is optional with any command. It can be a pattern described as a regular expression surrounded by slashes, a line number, or a line addressing symbol. Most sed commands can accept two comma separated addresses that indicate a range of lines[address]. Some commands cannot be applied to a range of lines[line-address].


[address]command can be applied to a range of lines i.e. 5,7 or pattern match

[line-address]command pattern match or single line address, cannot be applied to a range of lines

[address]{

command1

command2

command3

}


There are 25 commands:


Substitution

[address]s/pattern/replacement/flags flags for s are n (nth occurrence) g (global) p (print) w file (write to file)

The following are some examples of using sed to perform substitutions. Remember that "." is the Regular expression symbol for the wildcard match of any one character.

ls -la Output of ls -la command which will be the input to the following examples.

ls -la | sed -e 's/r.x/asd/g' Example Output which replaces all occurances of r.x

ls -la | sed -e 's/r.x/asd/1'Example Output which replaces the first occurance of r.x

ls -la | sed -e 's/r.x/asd/2'Example Output which replaces the second occurance of r.x

ls -la | sed -ne 's/rwx/asd/p'Example Output where the -n command line switch is used with p flag.

ls -la | sed -ne 's/r.x/(&)/3pw ./ex/s3pw' Example Output. The -n command line switch to suppress output and is used with p flag to print the lines where a substitution is made. The "&" metacharacter is used to represent the character string that was matched. The w flag is used to also write the output to the file ./ex/s3pw (output will still be sent to the standard output as well as written to a file.)

ls -la | sed -e '/\./s/rwx/asd/g' Example Output. This shows a pattern match which precedes the "s" command. The "\" escape character is used to escape the "." which normally has the meaning of a wildcard charchter. Here the escape causes it to be interpreted as a literal ".". The substituttion routine will only be ran on lines with a pattern match. (In this case "lines containing a period")


Delete (lower case)

[address]/d

ls -la | sed -e '1,6d'Example Output which deletes the first thru third lines


The next three commands must be specified over multiple lines. This is a different syntax than all other sed commands.


append

[line-address]a\

text

ed -e '/bash/a\

> previous line had bash in it'


insert

[line-address]i\

text

ls -la | sed -e '/bin/i\

> text_string'


change

[address]c\

text

ls -la | sed -e '/bin/c\

> text_string'


list Displays the contents of the pattern space. This is for displaying non-printable characters as two digit ascii codes.

l

ls -la | sed -ne 'l'


transform

[address]y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/

ls -la | sed -e 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'Example Output which transforms all characters to Upper case.

ls -la | sed -e 'y/zabcdefghijklmnopqrstuvwxy/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'Example Output which transforms all characters to Upper case one alpha character misplaced.


print

p

ls -la | sed -ne '/^d/p'Example Output which prints only lines that match "^d". (^ is the regular expression symbol for matching the beginning of the line) This will print all lines that begin with a d or in our case are directories.

ls -la | sed -ne '/^[^d].w/p'Example Output which prints only lines that match "^[^d].w". (^ is the regular expression symbol for matching the beginning of the line) ([ ] is the regular expression method of listing a group of charcters to match and if the first character is the ^ it will invert the match. This will print all lines that DO NOT begin with a d and have the third character be a w. In this case that means files (not directories) that are writable by the owner.


print line number

[line-address]= The following script will print all lines with if preceded by it's line number

/if/{

=

p

}

ls -la | sed -nf ex/lnscr Example output


next

outputs the contents of the pattern space and then reads the next line of input without returning to the top of the script. Next is a flow control command which will cause current input line to be output and lines of a script not to be executed if [address] is true

[address]n

ls -la | nl | sed -f /home/dbartley/nextscr Which results in this Example output Notice that all lines are output but some lines have the first substitution executed and some lines have the last.


reading and writing file

[line-address]r file reads the contents of file into pattern space after the address line

[address]w file writes pattern space to file


quit need I say more?? If the flow of the program gets to the q command, execution is done.

q


Advanced sed Commands (Three categories)


Working with a multiline pattern space (N,D,P)


Multiline Next Appends the next line of the input to the pattern space. \n can now match the end of line. Only 1 '^,$' per new pattern space.

N


Multiline Delete Deletes a portion of pattern space. Up to the first embedded new line. With no new line read it returns to the top of the script.

D


Multiline Print Prints up to the embedded newline

P


Commands using the hold space to preserve the contents of the pattern space and make it available for subsequent commands. (H,h,G,g,x)


Hold h or H Copy or append contents of patterns space to hold space.

Get g or G Copy or append contents of hold space to pattern space.

Exchange x Swap the contents of hold space and pattern space.


Lowercase overwrites the current pattern space. Uppercase version appends to current pattern space.


Writing scripts that use branching and conditional instructions to change the flow of control. (:,b,t)


[address]b[label] Branch

[address]t[label] This appears directly after a sub

[address]{

s/pattern/string/1

t label

...

}

:[label] put this where you want control in script to go

label is optional, end of script is used as default.





AWK Named for Aho, Weinberger and Kernighan


Awk is like sed in that it operates on each line of the input from the beginning to the end of a script until it gets to the end of the input. Sed and awk both complete actions associated with an input line before it proceeds to the next. In addition, awk also gives the script writer the option to give commands as part of a BEGIN and an END statement.



Pattern Matching


A command can be executed based on a regular expression pattern match. For example:

/[0-9]+/ { print "That is an integer" }

/[A-Za-z]+/ { print "That is an character" }

/^$/ { print "This is a blank line" }


Comments in a script begin with a '#' and end with a new line and are allowed anywhere in the script.


Records and Fields


The input to awk is usually structured. The input is separated into words by delimiters. The default delimiter is a space(s) or tab(s). A script writer can refer to these words using the field operator $. For example:

echo a b c | awk '{print $2, $3, $1 }'

b c a


The field separator can be configured in a script with the command FS = "regular_expression"

for example

BEGIN { FS = ":" }

or on a command line by placing a -F followed by the field separator. For example

echo a,b,c | awk -F, '{print $2, $3, $1 }'

b c a

echo axbyc | awk -F"[xy]" '{print $2, $3, $1 }'

b c a # this is how to have multiple characters which can be the field separator.

# -F"\t+" one or more tab characters is the field separator


A pattern can also be matched in a field using the tilde (~) operator or the bang tilde (!~) operator. For example


$5 ~ /regexp/ { print "match found in 5th filed" }

$5 !~ /regexp/ { print "NO match found in 5th field." }


Example to record the time and internet address of my ppp connetction.

First I need a command to extract the ppp0 internet address from the command ifconfig. (Remember -F: makes the colon the field separator)

/sbin/ifconfig ppp0 | awk -F: '/inet/{print $2}' | awk '{print $1}'

Then I need a /etc/ppp/ip-up.local and a /etc/ppp/ip-down.local This creates a log file, /etc/ppp/ifsend, that logs uptime and downtime and it's ipaddress when it connetcts.

Example updown now to calculate how long the connection stays up and stays down.

sed -f sedupdownscr ifsend | awk -f awkscrif > updownout



Multiline records


A record can be interpreted across multiple lines by using the system variable RS. For example

BEGIN { FS = "\n"; RS = "" } would cause newlines to be the field separator and blank lines to be the record separator.


Other system variable which are typically set in the 'BEGIN' command are:

OFS output field separator: space by default, generated when a comma is used in a print statement

ORS output record separator: newline by default

NR number of the current input record, used to number the records when outputting

NF number of fields for the current record, $NF is the value of the last field

FILENAME name of the current input file

FNR number of the current input file


Arithmetic operators


+ - * / % (modulo) ^ or ** (exponentiation)


Assignment operators


++ -- += -= *= /= %= (^ or **)=


Relational operators

< > <= >= == != ~ (match) !~ (not match)


For example

NF == 6 { print "Number of fields = 6 and the last fields value is ",$6 } #The comma outputs OFS


Boolean Operators

|| (logical OR) && (logical AND) ! (logical NOT)


($1 < 2) || (NF >= $3) { print "value of the first field is greater than 2 OR the number of fields is greater than or equal 3" }


Formatted Printing

For example

printf( "%10s" "hello" ) to justify right in a block of ten caracters. ("%-10s") for left justify

printf( "%-15s\t%10d\n", $9, $5 ) to left justify a string in a block of 15 characters and right justify a decimal in a block of 10.


Parameters can be passed to the script on the command line.


awk -f scriptfile parm1=234 parm2=string datafile

Command line parameters are not available in the BEGIN procedure.


Conditional Statements


if( expression )

action1

else

action2


if( expression ){

statement1

statement2

}

else{

statement3

statement4

}


Example of an awk script with conditional if then else.

profile is a bash script that takes either "in" or "out" as the first command line argument. If in is used it will replace the export statement with an exit statement in the .bash_profile file of all normal users which will effectively lock them out.

Here is a standard .bash_profile.

Here is a .bash_profile after the command has been run.

Here is the output if you forget to put an in or out on the command line.

Here is the output if you enter the command correctly with an in.

Compare this with a shell script which accomplishes the same thing but requires a second sedremin script.


while(condition)

action


do

action

while( condition )


for( set_counter; test_counter; increment_counter )

action


for( set_counter; test_counter; increment_counter ){

statement

if( expression ){

statement

break

}

statement

}


for( set_counter; test_counter; increment_counter ){

statement

if( expression ){

statement

continue

}

statement

}


Array variable types are allowed.


Arrays are associative and are not required to be numeric like most programming languages. The variable MyArray[ "textvalue" ] is legitimate. The keyword (in): the expression ( textvalue in MyArray ) will return 1 if it exists and 0 if it does not. This gives the script writer the ability to determine if a value in an array has been set. The element of an array can be deleted using the delete command: delete MyArray[ "textvalue" ]. If you set the value of an array element to the null string "", then ( textvalue in MyArray ) will still return 1. Command line parameters are elements of the array ARGV with ARGC elements. These array elements are from ARGV[ 0 ] to ARGV[ ARGC-1 ]. The array ENVIRON is an array of the environment variables. ENVIRON[ PATH ] would evaluate to the value of the PATH variable in the shell that awk was ran from.


for ( variable in array )

dosomething with array[ variable ]


Functions


9 arithmetic functions are built in.


cos(x), exp(x), int(x), log(x), sin(x), sqrt(x), atan2(y,x), rand(), srand(x)


String functions


gsub(r,s,t) Globally sub s for regexp r in sting t. returns # of subs

index(s,t) Returns position of substring t in string s or 0 if not present

length(s) Returns length of strins s

match(s,r) Returns position in s where the regexp r begins or 0 if none present

Sets RSTART to start pos and RLENGTH to Length of match

split(s,a,sep) Parses string s into elements of array a using field separator sep;

returns number of elements. sep defaults to FS

sprintf("fmt",expr) Uses printf format specification for expr.

sub(r,s,t) Subs s for first match of regexp r in string t. Ret 1 if success else 0.

substr(s,p,n) Returns substring of string s at position p up to max length n.

If n not supplied then remainder of string from pos p is returned

tolower(s) Translates string s to all lower case

toupper(s) Translates string s to all upper case


Writing your own function


function function_name( parameter list ){

statement

statement

}


Awk allows the use of multiple -f entries on the command line so that a function can be kept in it's own file

awk -f /home/dbartley/awkfuncts/myfunct -f scriptfile inputfile


Special functions getline, close and system


Getline is used to read another line of input. Getline can read from the regular input data stream, it can also handle input from files and pipes.


Getline can be used in a similar method to next, both cause the next line to be read, however next passes control back to the top of the script. The getline function gets the next line without changing control in the script. Returns 1 success, 0 end of file, -1 error.


/regexpmatch/{

getline

print $1

}

Use getline to retrieve data from a file (other than an input file) getline < "datafile"

while ( (getline < "datafile") > 0 )

print

Use getline to retrieve data from standard input:

BEGIN{ printf "enter your name: "

getline < "-"

print

}

Use getline to input a value into a variable

BEGIN{ printf "enter your name: "

getline name < "-"

print name

}

assigning to a variable does not affect the current input line ($0, $1, NF) are unaffected. NR and FNR are however incremented.


Reading input from a pipe.

"command line command" | getline

"command line command" | getline assignment_variable (above rules apply)

"who am i" | getline me


This example shows "who" getting executed once and getline is executed as many lines as there are output of "who"

while ("who" | getline)

who_out[++i] = $0


Close


Using close() may be necessary in order to get an output pipe to finish it's work.

{ some processing of $0 | "sort > mpfile" }

END{

close("sort > mpfile")

while ((getline < "tmpfile") > 0)

statement

}


Closing open files is necessary to keep you from exceeding your system's limit on simultaneously open files.


System


The system() function executes a command supplied as an expression. It does not make the output of the command available within the program for processing.


BEGIN{ if (system("mkdir dale") != 0 }

print "Command Failed" }

system returns exit status of 0 on success.


Directing output to Files and Pipes


> and >> are the same is in a shell. > truncates the file when opening and >> preserves the file and appends to it.

print > "data.out"


print | command

print | "wc -w"


if $1 is a valid filename awk allows

print $0 > $1


limitations may vary per implementation

here is a ballpark


# of fields per record 100

characters per input record 3000

characters per output record 3000

characters per field 1024

characters per printf string 3000

characters in literal string 400

characters in character class 400

files open 15

pipes open 1


invoking awk in a script using #! Syntax


in the first line of an awk script

#!/bin/awk -f