--- title: "miRmine dataset as RangedSummarizedExperiment" author: name: Dusan Randjelovic email: dusan.randjelovic@sbgenomics.com package: miRmine output: BiocStyle::html_document abstract: | miRmine is data package built for easier use of miRmine dataset (Panwar et al (2017) miRmine: A Database of Human miRNA Expression. Bioinformatics, btx019. doi: 10.1093/bioinformatics/btx019). In it's current version miRmine contains 304 publicly available experiments from NCBI SRA. Annotations used are from miRBase v21 (miRBase: Annotating high confidence microRNAs using deep sequencing data. Kozomara A, Griffiths-Jones S. NAR 2014 42:D68-D73). vignette: | %\VignetteIndexEntry{miRmine} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Data preparation miRmine dataset contains rich metadata around 304 selected publicly available, miRNA-Seq experiments. Authors' processed the data with miRdeep2 using annotation files from miRBase v21. Mentioned metadata is used as colData and miRBase annotations as GRanges are used as rowRanges while preparing this dataset as RangedSummarizedExperiment. Data used for preprocessing and constructing the `miRmine` RangedSummarizedExperiment are available in `extdata` folder. Details of this proccess could be followed in data help file: `?miRmine`. ```{r} #library(GenomicRanges) #library(rtracklayer) #library(SummarizedExperiment) #library(Biostrings) #library(Rsamtools) ext.data <- system.file("extdata", package = "miRmine") list.files(ext.data) ``` Number of ranges from miRBase GFF and number of features output by miRdeep2 are not the same (2813 vs. 2822). After closer look it turns out that 2 rows from either **tissues** or **cell.lines** data are duplicated (with same mature miRNA and same precursor miRNA) and 7 rows don't correspond to mirna/precursor combination existing in miRBase v21. These rows were removed for all samples, as seen in `?miRmine`. # Usage To load this dataset use: ```{r} library("miRmine") data(miRmine) miRmine ``` You may want to further subset data on some of many colData features (Tissue, Cell Line, Disease, Sex, Instrument) or output some specifics of particular experiment(s) (Accession #, Description, Publication): ```{r} adenocarcinoma = miRmine[ , miRmine$Disease %in% c("Adenocarcinoma")] adenocarcinoma as.character(adenocarcinoma$Sample.Accession) ``` rowRanges data is also rich in metadata, containing all the features from miRBase hsa.gff3, with addition of actual miRNA sequence as DNAString instance. For example to read the sequence of top expressed miRNA over a subset of samples: ```{r} top.mirna = names(sort(rowSums(assays(adenocarcinoma)$counts))[1]) rowRanges(adenocarcinoma)$mirna_seq[[top.mirna]] ``` `miRmine` could be directly used in DESeq2 (note that expression values are RPM not raw reads): ```{r} library("DESeq2") mirmine.subset = miRmine[, miRmine$Tissue %in% c("Lung", "Saliva")] mirmine.subset = SummarizedExperiment( assays = SimpleList(counts=ceiling(assays(mirmine.subset)$counts)), colData=colData(mirmine.subset), rowRanges=rowRanges(mirmine.subset), rowData=NULL ) ddsSE <- DESeqDataSet(mirmine.subset, design = ~ Tissue) ddsSE <- ddsSE[ rowSums(counts(ddsSE)) > 1, ] dds <- DESeq(ddsSE) res <- results(dds) res ``` # Session info {.unnumbered} ```{r sessionInfo, echo=FALSE} sessionInfo() ```