Python, through the SciPy numerical extension, offers an alternative to using R in cheminformatics. You will understand what I mean if you have ever tried to code an algorithm in R.
Advantages of R
Disadvantages of R
To compare the two, I used a 1GB RAM Pentium 4 to read the same data file and calculate the principal components (after scaling). The data file contained different numbers of molecules, with 10 descriptor values for each. In the table below, the units of time are seconds.
For R jobs, .RData was removed and then the command R CMD BATCH myscript.R was used.
The data is in the format required for the R command read.table. The data is read in using:
Method | 300K cmpds | 600K cmpds | 1.6M cmpds |
Python | 6.8 | 13.9 | 41 |
R (read.table) | 42 | 105 | NA |
R (scan) | 9 | 20 | 56 |
The scan method in R requires more parameters. In addition, the data is either read in as a list or a single vector, and requires a transformation to a data frame before it is of use.
R uses singular value decomposition (svd) to do the PCA, whereas I found the eigenvalues of the covariance matrix using SciPy.
Language | 300K cmpds | 600K cmpds | 1.6M cmpds |
Python | 2.2 | 3.6 | 42 |
R (read.table) | 5 | 10 | NA |
R (scan) | 3 | 5 | 29 |