The AWK programming language


Preface

Computer users spend a lot of time doing simple, mechanical data manipulation. All of these jobs ought to be mechanized, but it's a real nuisance to have to write a special purpose program in a standard language like C or Pascal each time such a task comes up.
Awk programs are often one or two lines long.

An Awk program is a sequence of patterns and actions that tell what to look for in the input data and what to do when it's found.

1 - An AWK tutorial

1.1 Getting started

Beth    4.00    0
Dan    3.75    0
Kathy    4.00    10
Mark    5.00    20
Mary    5.50    22
Susie    4.25    18

Name - Hourly rate - number of hours worked
Now you want to output for everyone who worked more that zero hours: The name and the payment.
Easy to do with awk:

awk '$3>0{print $1, $2*$3}' data.txt

It consists of a single pattern-action statement.

The structure of an AWK program

Each awk program in this chapter is a sequence of one or more pattern-action statements:
pattern { action }

The basic operation of awk is to scan a sequence of input lines on after another, searching for lines that are matched by any of the patterns in the program.
Every input line is tested against each of the pattern in turn. For each pattern that matches, the corresponding action (which may involve multiple steps) is performed. Then the next line is read and the matching starts over. This continues until all the input has been read.

Either the pattern of the action (but not both) may be omitted.
Example: $3 == 0 will print all the lines that match the pattern.
Example: {print $1} will print the first field, for every input line.

Running an AWK program

awk 'program' input files run the program on each specified input files.
awk 'program' will read your input until you press C-D (end of signal)

This behavior makes it easy to experiment with awk: type your program, then type data at it and see what happens.

awk -f progfile optional list of input files to execute the program from within a file

1.2 Simple output

There are only two types of data in awk: numbers and strings of characters.
Awk reads its input one line at a time and splits each line into fields, where, by default, a field is a sequence of characters that doesn't contain any blanks or tabs.
$1 for field 1, $2 for field 2, $0 for the whole line.
The number of fields can vary from line to line.

{ print } does the same as { print $0 }
{ print $1, $3 } to print the first and third field

Expressions separated by a comma in a print statement are, by default, separated by a single blank when they are printed. Each line produced by print ends with a newline character. Both these defaults can be changed.

NF, the number of fields

Awk counts the number of fields in the current input line and stores the count in a built-in variable called NF.
{ print NF, $1, $NF } prints the number of fields, the first field, and the last field

Computing and printing

{ print $1, $2 * $3 }

Printing Line Numbers

Awk provides another built-in variable, called NR, that counts the number of lines read so far. We can use NR and $0 to prefix each line with its line number:

{ print NR, $0 }

Putting Text in the Output

{ print "total pay for", $1, "is", $2 * $3 }

1.3 Fancier Output

The print statement is meant for quick and easy output. To format the output exactly the way you want it, you may have to use the printf statement.

Lining Up Fields

printf(format, value1, value2, ..., valuen)

format is a string that contains text to be printed verbatim, interspersed with specifications of how each of the values is to be printed.
A specification is a % followed by a few characters that control the format of a value.
The first specification tells how value1 is to be printed, the second how value2 is to be printed and so on.
Thus, there must be as many specifications in format as values to be printed.

{ printf("total pay for %s is $%.2f\n", $1, $2 * $3) }
%.2f says to print the second value, $2 * $3, as a number with 2 digits after the decimal point.

With printf, no blanks or newlines are produced automatically. You must create them yourself.

{ printf("%-8s $%6.2f\n", $1, $2 * $3) }
%-8s prints a name as a string of characters left-justified in a field 8 characters wide.
%6.2f prints the pay as a number with two digits after the decimal point, in a field 6 characters wide.

Sorting the output

Easiest way, using sort:
awk 'printf("%6.2f %s\n", $2 * $3, $0) }' emp.data | sort

1.4 Selection

Selection by comparison

$2 >= 5

Selection by computation

$2 * $3 > 50

Selection by text content

$1 == "Susie"

all lines that contains Susie anywhere:
/Susie/

Combination of patterns

logical operators ||, && and !.

$2 >= 4 || $3 >= 20
Lines that satisfy both conditions are printed only once. Contrast this with specifying two patterns:
$2 >= 4
$3 >= 20

Data validation

There are always errors in real data. Awk is an excellent tool for checking that data has reasonable values and is in the right format (data validation).
(printing lines that are suspicious)

NF != 3 { print $0, "number of fields is not equal to 3" } 
$2 < 3.35 { print SO, "rate is below minimum wage" }
$2 > 10 { print $0, "rate exceeds $10 per hour" }
$3 < 0 { print $0, "negative hours worked" }
$3 > 60 { print $0, "too many hours worked" }

If there are no errors, there are no output.

Begin and End

The special pattern BEGIN matches before the first line of the first input file is read, and END matches after the last line of the last file has been processed.

BEGIN { print "NAME    RATE    HOURS"; print "" }
{ print }

You can put several statements on a single line if you separate them by semicolons.

1.5 Computing with AWK

An action is a sequence of statements separated by newlines or semicolons.
In awk, user-created variables are not declared.

Counting

$3 > 15 { emp = emp + 1 }
END { print emp, "employees worked more than 15 hours" }

Awk variables used as numbers begin life with the value 0, so we didn't need to initialize emp.

Computing Sums and Averages

{ pay = pay + $2 * $3 }
END { print NR, "employees"
      print "total pay is", pay
      print "average pay is", pay/NR
}

Handling Text

One of the strengths of awk is its ability to handle strings of characters as conveniently as most languages handle numbers.
Awk variables can hold strings of characters as well as numbers.

$2 > maxrate { maxrate = $2; maxemp = $1 }
END { print "highest hourly rate", maxrate, "for", maxemp }

String concatenation

{ names = names $1 " " }
END { print names }

Printing the last input line

Although NR retains its value in an END action, $0 does not.

{ last = $0 }
END { print last }

Built-in functions

We have already seen that awk provides built-in variables that maintain frequently used quantities, like the number of fields and the input line number.
Similarly, there are built-in functions for computing other useful values.
Square roots, logarithms, random numbers, length of a text

{ print $1, length($1) }

Counting Lines, Words and Characters

{ nc = nc + length($0) + 1
nw = nw + NF
}
END { print NR, "lines,", nw, "words,", nc, "characters" }

1.6 Control-flow statements

Awk provides an if-else statement for making decisions and several statements for writing loops, all modeled on those found in the C programming language.
They can only be used in actions.

If-else statement

$2 > 6 { n = n + 1; pay = pay + $2 * $3 }
END { if (n > 0)
        print n, "employees, total pay is", pay,
                 "average pay is", pay/n
      else
        print "no employees are paid more than $6/hour"
    }

Not that we can continue a long statement over several lines by breaking it after a comma.

While statement

# compute a compound interest
# input: amount rate years
# output: compounded value at the end of each years

{ i = 1
  while (i <= $3) {
    printf("\t%.2f\n", $1 * (1 + $2) ^ i)
    i = i + 1
  }
}

For statement

# compute a compound interest
# input: amount rate years
# output: compounded value at the end of each years

{ for (i = 1; i <= $3; i = i + 1)
    printf("\t%.2f\n", $1 * (1 + $2) ^ i)
}

The loop is a single statement, no braces are needed to enclose it.

1.7 Arrays

Awk provides arrays for storing groups of related values

# reverse - print input in reverse order by line

{ lines[NR] = $0 }
END { for (i = NR; i > 0; i = i - 1)
        print lines[i]
}

1.8 A handful of useful "One-liners"

1. Print the total number of input lines

END { print NR }

2. Print the tenth input line

NR == 10

3. Print the last field for every input line

{ print $NF }

4. Print the last field of the last line

{ lf = $NF }
END { print lf }

5. Print every input line with more than four fields

NF > 4

6. Print every input line in which the last field is more than 4

$NF > 4

7. Print the total number of fields in all input files

{ n = n + $NF }
END { print n }

8. Print the total number of lines that contains Beth

/Beth/{ n = n + 1 }
END { print n }

9. Print the largest first field and the line that contains it

$1 > max { max = $1; l = $0 }
END { print max, l }

10. Print every line that has at least one field

NF >= 1

11. Print every line longer than 80 characters

length($0) > 80

12. Print the number of fields in every line followed by itself

{ print NF, $0 }

13. Print the first two fields, in opposite order

{ print $2, $1 }

14. Exchange the first two fields

{ t = $1; $1 = $2; $2 = t; print }

15. Print every line with the 1st field replaced by the line number

{ $1 = NR; print }

16. Print every line after erasing the second field

{ $2 = ""; print }

17. Print in reverse order the fields of every line

{ for (i = NF; i > 0; i = i - 1)
  printf("%s", $i)
print ""
}

or:

{ 
  for (i = 1; i <= NF; i = i + 1)
    c[i] = $i
  for (i = NF; i > 0; i = i - 1)
    $(NF-i+1) = c[i]
  print
}

18. Print the sums of the fields of every line

{ 
  s = 1
  for (i = 1;  i <= NF; i = i + 1)
  s = s + $i
  print s
}

19. Add up all fields in all lines and print the sum

{ 
  for (i = 1;  i <= NF; i = i + 1)
  s = s + $i
  print s
}

20. Print every line after replacing each field by its absolute value

{ 
  for (i = 1;  i <= NF; i = i + 1)
    if ($i < 0) $i = -$i
  print 
}

1.9 What's next?

You have now seen the essential of awk.
The rest of the book elaborates on these basic ideas.
Nothing answers questions so well as some simple experiments.
You should browse through the whole book; each example conveys something about the language.