>> Noel O'Boyle

awk code

awk is a pattern scanning and processing language (according to gawk's man page).

Editing SD files

If you define the Record Separator as '$$$$\n', and the Field Separator as '\n', you can handle SD files in a way you may find useful. In this case, $0 is the complete description of a single molecule, excluding the terminating '$$$$\n'. $1 is the first line of the SD file (commonly used as a title).

For example, to extract a subset of an SD file:

awk 'BEGIN {RS="\\$\\$\\$\\$\n"; FS="\n"} $1==201491 || $1==282672 || $1==151254 || $1==154838 ||
     $1==66716 || $1==192373 || $1==235700 {print $0; print "$$$$"}' largefile.sd > subset.sd

SD files exported from Catalyst do not adhere well to the SD specification. If you want to use the CDK to import these files, you need to add a line with 'M  END' before every terminating '$$$$'. This can be done as shown below. Note that setting the output field separator (OFS) to null is required to prevent additional newlines (test this by setting OFS="\n", the default).

awk 'BEGIN {RS="\\$\\$\\$\\$\n"; OFS=""} {print $0,"M  END\n$$$$"}' catalyst.sd > fixed.sd

Tabular data

Tab-separated data files are very common. The following example shows how to extract the first two columns from a tab-separated file and output the result in the same format. Note that it may be necessary to specify the field separator as "\t"; otherwise it will also treat spaces as field separators. The default output field separator is a space.

awk 'BEGIN {FS="\t"; OFS="\t"} {print $1,$2}' large.txt > small.txt

The next example shows how to extract all those lines with a particular value in one of the fields. Note that in this example it is assumed that the first field does not contain spaces. "{print $0}" is implicit - that is, the whole line is printed.

awk '$2<25' large.txt > small.txt