Summary
- R Language
- Large data requires efficient programming.
- Efficient programming benefits from understanding more of how R
works 'under the hood'. Correctness is always more important than
speed.
- Parallel evaluation is a secondary approach to gaining
performance.
- Objects
- Objects allow co-ordinated manipulation of complex inter-related
data; objects are pervasive in R.
- Formal S4 objects provide structure that benefits
interoperability between related classes, while enabling
experienced users and package developers to rapidly re-use
existing concepts and code.
- S4 objects are used extensively (and to good effect) in
Bioconductor; it pays to understand key classes and their
manipulation.
- C (and other) languages
- Two reasons for writing C code are (1) to interface with existing
libraries and (2) to implement high-performance algorithms.
- Writing C code has many drawbacks, including large time
investment to develop the code, implementations that often
undermine R concepts such as handling of NAs, and introduction of
catastrophic or subtle memory bugs. These consideration should
discourage us from embarking on projects that involve C code
except as a last resort.
- For algorithm implementation, one quickly graduates from the
relative simplicity of the .C interface to the flexibility of the
.Call interface (requiring significant understanding of R's
internal representation) to Rcpp-style programming that masks
some of the complexity of interacting with R while exposing the
object-oriented facilities of C++.
- Data bases and external data representations
- Processing data from non-R formats can be efficient and
powerful. Bioconductor packages use SQL data bases to store gene
and genome annotations, and XML to query web-based resources.
- SQL represents a great solution for querying relational
data. Straight-forward solutions easily scale to data with 100k
rows, but like R exploiting larger SQL data resources requires
non-trivial understanding of SQL and the data base engine in use.
- XML and in particular XPath provides a very flexible way to query
web-based resources or to interoperate with other software. The
XML package has a unique event parsing mechanism for
iterating through large XML objects.