\name{cmp.duplicated}
\alias{cmp.duplicated}
\title{quickly detect compound duplication in a descriptor database}
\description{
    'cmp.duplicated' detects duplicated compounds from a descriptor
        database generated by 'cmp.parse'. Two compounds are said to
        duplicate each other when their descriptors are the same. 
}
\usage{
    cmp.duplicated(db, sort = FALSE)
}
\arguments{
  \item{db}{The desciptor database, in the format returned by 'cmp.parse'.}
  \item{sort}{Whether to sort the descriptors for a compound. See details.}
}
\details{
    'cmp.duplicated' will take the descriptors in the descriptor database,
    concatenate all descriptors for the same compound into a string, and use
    this string as the identification of a compound. If two compounds share
    the same identification string,  they are said to duplicate each other.

    In most cases the method will identify the duplicates correctly. However,
    users have to be aware that the atom pair algorithm will treat isomers,
    conformers and other smaller structural variants as identical compounds. If
    it is important to retain those variants in the data set then the function
    'cmp.duplicated' should not be used. The support of InChI stings
    will overcome this limitation in the future.

    'cmp.duplicated' assume the the database passed in as argument to follow
    the format generated by 'cmp.parse'. That is, 'db' is a list,
    'db$descdb' is a list, and each entry of 'db$descdb' is an array of numeric
    values that give descriptors for one compound.

    By default, 'cmp.duplicated' will assume the descriptors for a compound is
    already sorted. That is each entry in 'db\$descdb' is a sorted array. This
    is true for database generated by 'cmp.parse'. If you generate the database
    using some other tools, you might want to enable sorting.
        
}
\value{
    Returns a logic array, telling whether a compound in the database is a
    duplication of a compound appearing before this one. For example, if the
    i-th element of the array is TRUE, it means that the i-th compound in the
    database is a duplication of a compound listed before this compound in the
    database.

    The returned array can be used to remove duplication. Simply use it to
    index the descriptor database.

    If you are interested in what compound is duplicated, you can do a search
    in the database with cutoff set to 1.
}
\author{Y. Eddie Cao}
\seealso{\code{\link{cmp.parse}}, \code{\link{cmp.search}}}
\examples{
# load sample database from web
db <- cmp.parse("http://bioweb.ucr.edu/ChemMineV2/static/example_db.sdf")
# manually create a duplication
# note that we ignore the other information in the database and only consider
# the descriptor information
db$descdb[[89]] <- db$descdb[[10]]
length(db$descdb)
# find duplication
dup <- cmp.duplicated(db)
# locate duplicated compound using search
cmp.search(db, db$descdb[dup][[1]], cutoff=1, quiet=TRUE)
# remove duplication from db
db$descdb <- db$descdb[!dup]
# normally you should also clear the entries in db$cids and db$sdfsegs
length(db$descdb)
}
\keyword{utilities}