--- title: "Class notes" author: "Martin Morgan" date: "2/4/2015" output: html_document --- ```{r setup, echo=FALSE} library(UseBioconductor) stopifnot(BiocInstaller::biocVersion() == "3.1") ``` ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() ``` # Intro ## R Vectors - everything is a vector: `integer()`, `character()`, `numeric()`, `logical()`, `raw()`, `complex()` - sometimes called 'atomic' ```{r} x = rnorm(1000) y = x + rnorm(sd=.5, 1000) ``` - 'API' (Application Programming Interface) -- how can you work with a vector? - `[` - single bracket subset; 'endomorphism' - `length()` - `c()` - `[<-` -- subset-assign - (`names()`) functions: argument names; - can be optional arguments - named (`sd`; can be partial, e.g., `s=`) -- matched before unnamed - positional -- unnamed are matched by position ```{r} rnorms = lapply(0:3, function(mean) { rnorm(1000, mean) }) rnorms = lapply(0:3, rnorm, n=1000, mean=0) ``` - `matrix()` - atomic vectors with 'dim' and 'class' attributes - 'API' -- two- (n-) dimensional `[`, `[<-` ```{r} m = matrix(1:6, 2) dput(m) ``` `factor()` -- decorated integer() vector `list()` - recurssive data structre - heterogeneous elements - 'API' - 'inherits' (very loose sense) from vector - `[[`, `$` -- extract element of list - `[[<-`, `$<-` -- assign new element - (`unlist()`) - (assign NULL) `data.frame()` - list of vectors, all vectors the same length - 'class' attribute - inherits 'list' API, and also 'matrix' API closures ```{r} acctFactory = function() { balance <- 0 list(deposit=function(amt) { balance <<- balance + amt }, currBalance=function() { balance }) } ``` ## S3 classes and methods ```{r} x = rnorm(1000) y = x + rnorm(sd=.5, 1000) df = data.frame(X=x, Y=y) ``` Use of `data.frame()`: - groups vectors in a useful way - e.g., avoiding bookkeeping errors when subsetting - ensures confromance with `data.frame()` 'contract' - motivates data structures more elaborate than vector ```{r} fit = lm(Y ~ X, df) plot(Y ~ X, df) abline(fit, lwd=4, col="red") anova(fit) ``` - `fit` is an S3 object (instance, class) - `list()` with a `class` attribute - structure is visible, but irrelevant to the user - `class()` to discover the class(!) - `anova` is a generic, with a method appropriate for the class of `fit` - discovery: methods("anova"), methods(class="lm") - help: `?plot` (for the generic), `?plot.lm` (for the method) # Bioconductor ## S4 ```{r} suppressPackageStartupMessages({ library(IRanges) }) start <- as.integer(runif(1000, 1, 1e4)) width <- as.integer(runif(length(start), 50, 100)) ir <- IRanges(start, width=width) coverage(ir) ``` - S4 is more formal than S3 - Specify class structure - Complicated inheritance - Multiple dispatch possible - discovery - `class(ir)`; could look at (but why bother?) structure using `getClass(class(ir))` - Especiallly useful for inheritance - `showMethods("coverage")`, `showMethods(class=class(ir), where=search())` - help - `?coverage` -- help on the generic - `?IRanges` -- Constructor; recent convention: also documents class & important methods - `selectMethod("coverage", signature=class(ir))` to figure out method dispatch, and to see the function definition - `method?"coverage,Ranges"` (tab completion!) - `class?IRanges` (tab completion!) ## Essential classes Sequences - `DNAString`, `DNAStringSet` Ranges - `GRanges`, `GRangesList` Integrated containers - `SummarizedExperiment` # Working with large data Brief review of [lecture material](A01.4_LargeData.html) Efficient `R` code - R programing sins and corrections, of primary importance is correctness - Important to ask how algorithm scales with problem size; many naive approaches scale quadratically (bad!). - Complier (`compiler::cmpfun()`) surprisingly effective at improving `f1()` -- better than `sapply()`. - Explored `vapply()`. Faster and safer than `sapply()`, so should be a best practice - Large gains available from writing effective `R` code; makes appeal to C++ / parallel evaluation less compelling `r Biocpkg("GenomicFiles")` and `r Biocpkg("BiocParallel")` - Extended development of `reduceByYield()` to iterate through files - Easy to parallelize across files via `bplapply()`. - Oops, Rstudio swallows `bplapply()` output. :( # Annotation Brief review of [lecture material](A01.5_Annotation.html) General importance of `select()` interface, including to on-line resources such as `r Biocpkg("biomaRt")` `r Biocpkg("AnnotationHub")` - _Very_ easy to wrangle web-based genome annotation files, e.g., UCSC chain files - Simplified discovery, download, and local management - Easy use in _Bioconductor_ work flows - Role of `r Biocpkg("AnnotationHub")` in deploying more complicated and heavily curated resources, like the GRASP2 data base of GWAS variants - `r CRANpkg("dplyr")` makes working with data bases fun.