Manipulating SD files with Open Source tools

In cheminformatics, molecular data ('name', coordinates of some type, properties) is commonly stored in MDL's SD file format. The ability to extract, summarise or otherwise manipulate this information can be quite useful. Here we describe and compare some of the options, as well as providing examples on how to use various Open Source cheminformatics toolkits. The emphasis will be on the use of Python to access the various toolkits, whether they are implemented in C, Java or Python, as these tasks are suited to writing scripts.

How many molecules?

'grep' is your friend. We need to count up the number of times '$$$$' appears. There is a catch, though. Dollar signs are special in grep (they normally match the end of a line) so we need to un-special them ('escape' them) by placing a backslash before each one.

grep -c '\$\$\$\$' myfile.sd

Find the molecular weights

Let's say we want to print out the molecular weights of every molecule in an SD file. Why? Well, we might want to plot a histogram of the distribution, or see whether the average of the distribution is significantly different (in the statistical sense) compared to another SD file.

Here I will discuss 3 choices of toolkit:

frowns - a cheminformatics toolkit geared towards SD files and SMARTS strings implemented in 100% Python. (No longer under development)
OpenBabel - a cheminformatics toolkit geared towards file format conversion, but which does much more. It's implemented in C++, but has Python wrappers.
The CDK - a cheminformatics toolkit in Java.

It is silly to compare toolkits on the basis of speed unless speed is important, but I have quoted times below for subset 1 ('xsaa.sdf') of ZINC's drug-like compounds in SD format (100K molecules). You may want to compare on the basis of how easy a toolkit is to use from Python. Or how quickly you can write a program that uses a particular toolkit in whatever language. The functionality of the toolkit will also be an important factor. Frowns has much less functionality (and it may be broken as it is not under development) than OpenBabel or the CDK. However, they can all be used to solve the current problem.

By combining various ways of using Python (with/without the optimiser Psyco), various ways of accessing Java libraries from Python (JPype/Jython), and the two alternatives for accessing OpenBabel from Python (openbabel.py/pyopenabel.py), there are a number of possibilities which are enumerated here:

1. frowns
2. frowns+psyco
3. frowns+notransforms
4. frowns+notransforms+psyco
5. frownsjython
6. openbabel.py
7. pyopenbabel.py
8. CDKjava
9. CDKjpype
10. CDKjython

In the results below, times quoted are the user times from the following command 'time python2.4 myprog.py > output.txt'. A Dell Pentium 4 3.33GHz PC with 1GB RAM was used, running Python2.4.1 and Jython2.2a1 on Debian GNU/Linux. In general, the latest SVN revision of the CDK and OpenBabel was used (20Apr2006).

Choices 1-5

Psyco is an optimiser for Python code that optimises code (using a memory tradeoff) on Intel processors. A modest improvement in speed is observed in this case. Normally, frowns calculates the SSSR (smallest set of smallest rings) and aromaticity when a molecule is read in. By passing the 'transforms=[]' option, this is avoided resulting in a speed increase. Finally, Jython can run most CPython code, so it was used to run the frowns example without transforms (after some minor edits involving incompatible import statements) - this was a lot slower.

	1	2	3	4	5
Time	3m48s	3m07s	2m44s	2m32s	14m29s

Frowns

from frowns import MDL

for mol, text, error in MDL.sdin(open("../xsaa.sdf")):
    print sum([x.mass for x in mol.atoms])

Frowns+Psyco

import psyco
psyco.full()
# The remainder of the lines are the same as for Frowns (above)

Frowns+Notransforms

from frowns import MDL

for mol, text, error in MDL.sdin(open("../xsaa.sdf"),transforms=[]):
    print sum([x.mass for x in mol.atoms])

Choices 6-7

openbabel.py is a SWIG-generated Python wrapper for the OpenBabel C++ library. pyopenbabel.py is a Python wrapper around openbabel.py that attempts to simplify access to OpenBabel. As seen below, the code for pyopenbabel.py is much simpler, but there is a cost in speed. (As a developer of pyopenbabel.py, I may improve this - it is due to the fact that all of the attributes of a molecule are calculated automatically when a molecule is created, whereas with openbabel.py only one method is called: the GetMolWt() method.)

	6	7
Time	2m19s	11m39s

openbabel.py

from openbabel import *

obconversion = OBConversion()
obconversion.SetInFormat("sdf")
obmol = OBMol()

notatend = obconversion.ReadFile(obmol,"../xsaa.sdf")
while notatend:
    print obmol.GetMolWt()
    obmol = OBMol()
    notatend = obconversion.Read(obmol)

pyopenbabel.py

from pyopenbabel import *

for molecule in readfile("sdf","../xsaa.sdf"):
    print molecule.molwt

Choices 8-10

Using Jython to access the CDK is considerably simpler than using Java itself, and not a lot slower. In contrast, while using Jpype has its advantages (can use CPython2.4 syntax for example, and additional python libraries without any difficulties), Jpype doesn't appear to free up memory in the loop over molecules. As a result, the JVM ran out of heap memory space after about 4000 molecules. The result quoted below is scaled up from 10000 molecules (this required passing the switch -Xmx400m to the JVM during startup which increases heap space from the default(?) to 400MB). As can be seen, Jython wins out over Jpype in terms of speed.

	8	9	10
Time	2m59s	~14m	3m27s

cdkjava

import org.openscience.cdk.exception.CDKException;
import org.openscience.cdk.*;
import org.openscience.cdk.interfaces.IChemObjectBuilder;
import org.openscience.cdk.io.iterator.*;

import java.io.*;
import java.util.*;

class readall {
  public static void main(String[] args) throws CDKException {
    BufferedReader sdfile=null;
    try {
      sdfile= new BufferedReader (new FileReader(new File("../xsaa.sdf")));
    }
    catch (FileNotFoundException e) {
      System.err.println("File not found");
      System.exit(1);
    }
    IteratingMDLReader myiter = new IteratingMDLReader(sdfile,DefaultChemObjectBuilder.getInstance());

    Molecule mol = null;
    while (myiter.hasNext()) {
      mol = (Molecule) myiter.next();
      float tot = 0;
      for (Enumeration e = mol.atoms(); e.hasMoreElements(); ) {
          Atom a = (Atom) e.nextElement();
          tot += a.getExactMass();
      }
      System.out.println(tot);
    }
  }
}

cdkjpype

# Requires JPype 0.5.1
from jpype import *
startJVM("/home/no228/Tools/Java/jdk1.5.0_03/jre/lib/i386/server/libjvm.so","-Djava.class.path=/home/no228/Tools/CDK/cdk-svn-6039.jar","-Xmx400M")

cdk = JPackage("org").openscience.cdk
MDLReader = cdk.io.iterator.IteratingMDLReader

sdfile = java.io.FileReader("../just1000.sdf")
builder = cdk.DefaultChemObjectBuilder.getInstance()

reader = MDLReader(sdfile,builder)

for molecule in reader:
    print sum([x.exactMass for x in molecule.atoms])

cdkjython

from org.openscience.cdk.io.iterator import IteratingMDLReader
from org.openscience.cdk import DefaultChemObjectBuilder
from java.io import InputStreamReader,BufferedReader

sdfile = BufferedReader(InputStreamReader(open("../xsaa.sdf","r")))
builder = DefaultChemObjectBuilder.getInstance()

for mol in IteratingMDLReader(sdfile,builder):
    print sum([x.exactMass for x in mol.atoms()])