Tuesday, January 18, 2011

GAWK Notes

Here are some bits and pieces I picked up about GAWK these days. This post is somewhat rough, you might need to study it in detail to get a good grasp over it.
Gawk is, essentially, a tool to process text data. It can work out like simple commands such as cat or grep as well as create powerful scripts and data filters. The Gawk instruction generally consists of a pattern and and action. It can operate on text files as well as standard input. A gawk command will look like:
$ gawk 'pattern {action}'

Example 1, to look for name "aditya" in contacts files, you can run:
$ gawk '/aditya/{print}' contact.txt

Here Gawk will look for the pattern "aditya" and if it is found it will print it, as specified in action.

Example 2, to check out the products for which you have to pay some amount and .99 bucks like 9.99 or something, you can do:
$ gawk '/[0-9]*\.99/{print}' prices.txt

[]matches a class of characters. * is used for repeated matches so that we could match 9.99 as well as 11.99. The peroid (.), plus(+) and question mark (?), none used in this example, are used to match a single character, one or more characters, none or one character respectively.

Example 3, to check out home or office contacts I can do the following:
$ gawk '/home|office/ {print}' contects.txt

Here pipe (|) acts as or.

Example 4, to print out any particular word of each line or the entire line do the following:
$ gawk '/aditya/ {print $2; print $0}' contacts.txt

$n will print out the nth word of the line while $0 will print the entire line. NR and NF are special variables which respectively holds the value of number of current record and number of fields in current record.

Example 5. to check out the length of a line:
$ gawk '{print length($0), $0}' contacts.txt

Some string functions:
  • length(str): returns the number of characters in the string.
  • index(str1, str2): returns the position in str2 where str1 begins.
  • split(str, arr, delim): copies the segment of str that are separated by delimiter into array and returns the number of elements.
  • substr(str, pos, len): returns a substring starting at pos of lenght len
  • match(str, pattern): returns the position of the match.
  • sub(pattern, replacement, str) and gsub(pattern, replacement, str): performs a substitution on string str replacing every pattern with replacement string. gsub is global sub.
  • toupper(str) and tolower(str): obvious.