>> Noel O'Boyle

Clustering

Taylor-Butina Clustering

Also referred to as Leader clustering, Taylor-Butina clustering was first described by Taylor (J. Chem. Inf. Comput. Sci., 1995, 35, 59) and later by Butina (J. Chem. Inf. Comput. Sci., 1999, 39, 747). It is an unsupervised non-hierarchical clustering method that guarantees that every cluster contains molecules which are within a distance cutoff of the central molecule. The algorithm is quite simple to understand, and the interested reader is referred to Butina's paper for a full description, as well as a comparison to Jarvis-Patrick clustering (which is very popular in cheminformatics).

My implementation (below) requires an n x n similarity matrix, and thus is restricted to clusters of about 4000 molecules (on a 1GB machine, anyway). (C is probably more suited to handling this problem for larger sets.) It should also be noted that the result has a small dependence on the sorting method used, where molecules with nearest neighbours in common have the same number of nearest neighbours (because quicksort doesn't preserve the order, whereas shell sort does). This ambiguity, referred to as the proximity problem, can be significant (MacCuish et al., J. Chem. Inf. Comput. Sci, 2001, 41, 134 - Wards clustering, but still relevant I think), and a possible solution is suggested by the same author (http://www.mesaac.com/ClusteringAmbiguity.ppt) - to deal with the 'tightest cluster' first, that is, the one with the smallest sum of minimum distances (this is my interpretation anyway [confirmed by J. MacCuish, personal communication]). This is not implemented at the moment.

Further notes: Taylor's method (Cluster Sampling) contained an additional step, and focused on ordering molecules by diversity - indeed, he wasn't even interested in the clusters that were a byproduct of the method. Another point to note is that in some implementations of the method, false singletons are assigned to their nearest neighbour after the initial clustering process is finished. Overlapping clusters are also possible.

[butina.R]

# butina.R
# 
# Example:
# source("butina.R")
# nn <- findnn(r)
# butina(r, nn, output="butina.out")
#
# Reason for division into two functions:
# findnn is slow enough, like 5 minutes.
# This is a pain during the development of butina.
# Final version will include findnn at start of butina.

findnn <- function(r, cutoff=0.8) {
    mylen <- length(r[1,])
    nn <- rep(0,mylen)
    for (i in 1:mylen) {
        # Count up the nearest neighbours
        # (subtract 1 because it includes itself)
        nn[i] <- sum(r[,i] > cutoff) - 1
        }
    nn    
    }
    
butina <- function(r, nn, cutoff=0.8, output="butina.out") {
    sorted <- sort(nn,decreasing=TRUE,index.return=TRUE)$ix
    
    mylen <- length(r[1,])
    seen <- rep(FALSE,mylen)
    clusterno <- 0
    sink(output)

    i <- 1
    while (i<=mylen) {
        clusterno <- clusterno+1
        # After loop, i will either be >mylen or seen[sorted[i]]==FALSE
        while (i<=mylen & seen[sorted[i]]) { i <- i+1 }
        if (i<=mylen) {
            cat("Cluster no. ",clusterno,"\n")
            ans <- (r[sorted[i],] > cutoff) & !seen
            print(which(ans))
            seen[ans] <- TRUE
            }
        i <- i+1 # Since we have just set seen[sorted[i]]==TRUE
        }   
    sink(NULL)    
    }        

Commercial implementations: You may be interested in commercial implementations of Taylor-Butina clustering from Darko Butina himself (dbclus, at ChemoMine Consultancy), or John MacCuish (at Mesa Analytics & Computing). If you ask nicely, you may find that an academic license is available for a low price/free.