\name{cmp.duplicated} \alias{cmp.duplicated} \title{quickly detect compound duplication in a descriptor database} \description{ 'cmp.duplicated' detects duplicated compounds from a descriptor database generated by 'cmp.parse'. Two compounds are said to duplicate each other when their descriptors are the same. } \usage{ cmp.duplicated(db, sort = FALSE) } \arguments{ \item{db}{The desciptor database, in the format returned by 'cmp.parse'.} \item{sort}{Whether to sort the descriptors for a compound. See details.} } \details{ 'cmp.duplicated' will take the descriptors in the descriptor database, concatenate all descriptors for the same compound into a string, and use this string as the identification of a compound. If two compounds share the same identification string, they are said to duplicate each other. In most cases the method will identify the duplicates correctly. However, users have to be aware that the atom pair algorithm will treat isomers, conformers and other smaller structural variants as identical compounds. If it is important to retain those variants in the data set then the function 'cmp.duplicated' should not be used. The support of InChI stings will overcome this limitation in the future. 'cmp.duplicated' assume the the database passed in as argument to follow the format generated by 'cmp.parse'. That is, 'db' is a list, 'db$descdb' is a list, and each entry of 'db$descdb' is an array of numeric values that give descriptors for one compound. By default, 'cmp.duplicated' will assume the descriptors for a compound is already sorted. That is each entry in 'db\$descdb' is a sorted array. This is true for database generated by 'cmp.parse'. If you generate the database using some other tools, you might want to enable sorting. } \value{ Returns a logic array, telling whether a compound in the database is a duplication of a compound appearing before this one. For example, if the i-th element of the array is TRUE, it means that the i-th compound in the database is a duplication of a compound listed before this compound in the database. The returned array can be used to remove duplication. Simply use it to index the descriptor database. If you are interested in what compound is duplicated, you can do a search in the database with cutoff set to 1. } \author{Y. Eddie Cao} \seealso{\code{\link{cmp.parse}}, \code{\link{cmp.search}}} \examples{ # load sample database from web db <- cmp.parse("http://bioweb.ucr.edu/ChemMineV2/static/example_db.sdf") # manually create a duplication # note that we ignore the other information in the database and only consider # the descriptor information db$descdb[[89]] <- db$descdb[[10]] length(db$descdb) # find duplication dup <- cmp.duplicated(db) # locate duplicated compound using search cmp.search(db, db$descdb[dup][[1]], cutoff=1, quiet=TRUE) # remove duplication from db db$descdb <- db$descdb[!dup] # normally you should also clear the entries in db$cids and db$sdfsegs length(db$descdb) } \keyword{utilities}