---
title: "Using `RTCGA` package to download  miRNASeq data that are included in `RTCGA.miRNASeq` package"
subtitle: "Date of datasets release: 2015-11-01"
author: "Witold Chodor"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using RTCGA to download miRNASeq data as included in RTCGA.miRNASeq}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
library(knitr)
opts_chunk$set(comment="", message=FALSE, warning = FALSE, tidy.opts=list(keep.blank.line=TRUE, width.cutoff=150),options(width=150), eval = FALSE)
```

# RTCGA package

> The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. The key is to understand genomics to improve cancer care.

`RTCGA` package offers download and integration of the variety and volume of TCGA data using patient barcode key, what enables easier data possession. This may have a benefcial infuence on  development of science and improvement of patients' treatment. `RTCGA` is an open-source R package, available to download from Bioconductor 

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("RTCGA")
```

or use below code to download the development version which is like to be more bug-free than the release version on Bioconductor:

```{r, eval=FALSE}
if (!require(devtools)) {
    install.packages("devtools")
    require(devtools)
}
install_github("RTCGA/RTCGA")
```

Furthermore, `RTCGA` package transforms TCGA data into form which is convenient to use in R statistical package. Those data transformations can be a part of statistical analysis pipeline which can be more reproducible with `RTCGA`.

Use cases and examples are shown in `RTCGA` packages vignettes:
```{r, eval=FALSE}
browseVignettes("RTCGA")
```

# How to download miRNASeq data to gain the same datasets as in RTCGA.miRNASeq package?

There are many available date times of TCGA data releases. To see them all just type:
```{r, eval=FALSE}
library(RTCGA)
checkTCGA('Dates')
```

Version 1.0 of `RTCGA.miRNASeq` package contains miRNASeq datasets which were released `2015-11-01`. They were downloaded in the following way (which is mainly copied from [http://rtcga.github.io/RTCGA/](http://rtcga.github.io/RTCGA/):

## Available cohorts

All cohort names can be checked using:
```{r, eval=FALSE}
(cohorts <- infoTCGA() %>% 
   rownames() %>% 
   sub("-counts", "", x=.))
```

For all cohorts the following code downloads the miRNASeq data.

## Downloading tarred files

In case of miRNASeq data we will download datasets produced by 
two different machines: Illumina Genome Analyzer and Illumina HiSeq 2000.

```{r, eval=FALSE}
#dir.create( "data2" )
releaseDate <- "2015-11-01"

# data produced with Illumina Genome Analyzer machine
sapply( cohorts, function(element){
tryCatch({
downloadTCGA( cancerTypes = element, 
              dataSet = "Merge_mirnaseq__illuminaga_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3",
              destDir = "data2", 
              date = releaseDate )},
error = function(cond){
   cat("Error: Maybe there weren't mutations data for ", element, " cancer.\n")
}
)
})

# data produced with Illumina HiSeq 2000 machine
sapply( cohorts, function(element){
tryCatch({
downloadTCGA( cancerTypes = element, 
              dataSet = "Merge_mirnaseq__illuminahiseq_mirnaseq__bcgsc_ca__Level_3__miR_gene_expression__data.Level_3",
              destDir = "data2", 
              date = releaseDate )},
error = function(cond){
   cat("Error: Maybe there weren't mutations data for ", element, " cancer.\n")
}
)
})
```

## Reading downloaded miRNASeq dataset

### Shortening paths and directories 

```{r, eval=FALSE}
list.files( "data2") %>% 
   file.path( "data2", .) %>%
   file.rename( to = substr(.,start=1,stop=70))
```


### Removing `NA` files from data2

There were not miRNASeq data for these cohorts.

```{r, eval=FALSE}
list.files( "data2") %>%
   file.path( "data2", .) %>%
   sapply(function(x){
      if (x == "data2/NA")
         file.remove(x)      
   })
```

### Paths to miRNASeq data

Below is the code that removes unneeded "MANIFEST.txt" file from each miRNASeq cohort folder.

```{r}
list.files( "data2") %>% 
   file.path( "data2", .) %>%
   sapply(function(x){
      file.path(x, list.files(x)) %>%
         grep(pattern = "MANIFEST.txt", x = ., value=TRUE) %>%
         file.remove()
      })
```

Below is the code that automatically gives the path to files for all available miRNASeq cohorts types downloaded to `data2` folder.

```{r}
data2_files <- list.files("data2")

# Paths to data produced with Illumina Genome Analyzer machine
illuminaga <- which(grepl("illuminaga", x = data2_files))
data2_files[illuminaga] %>%
   file.path("data2", .) %>%
   sapply(function(y){
      file.path(y, list.files(y)) %>%
         assign( value = .,
                 x = paste0(list.files(y) %>%
                            gsub(x = .,pattern = "\\..*",replacement = "") %>%
                            gsub(x=., pattern="-", replacement = "_"), 
                            ".miRNASeq_illuminaga.path"),
                 envir = .GlobalEnv)
      })
# Paths to data produced with Illumina HiSeq 2000 machine
data2_files[-illuminaga] %>%
   file.path("data2", .) %>%
   sapply(function(y){
      file.path(y, list.files(y)) %>%
         assign( value = .,
                 x = paste0(list.files(y) %>%
                            gsub(x = .,pattern = "\\..*",replacement = "") %>%
                            gsub(x=., pattern="-", replacement = "_"),
                         ".miRNASeq_illuminahiseq.path"),
                 envir = .GlobalEnv)
      })
```

### Reading miRNASeq data using `readTCGA`

Because of the fact that miRNASeq data are transposed in downloaded files, there has been prepared special function `readTCGA` to read and transpose data automatically. Code is below

```{r, eval=FALSE}
path_vector <- ls() %>%
   grep("miRNASeq.*path", x = ., value = TRUE)
# First we will read miRNASeq data produced by both Illumina Genome Analyzer and
# Illumina HiSeq 2000 machines
path_vector %>% 
   sapply(function(element){
      tryCatch({
         readTCGA(get(element, envir = .GlobalEnv),
               dataType = "miRNASeq") %>%
         assign(value = .,
                x = sub("\\.path", "", x = element),
                envir = .GlobalEnv )
      }, error = function(cond){
         cat(element)
      }) 
     invisible(NULL)
    }    
)

# Now we will add special column `machine` to miRNASeq data depending on
# kind of machine which produced data
sapply(cohorts, function(element){
   w <- grep(paste0("^",element, "\\."), x = path_vector, value = TRUE)
   if (length(w) == 0) {
      invisible(NULL)
   } else if ((length(w) == 1) && grepl("illuminaga", x = w)){
      data <- get(paste0(element,".miRNASeq_illuminaga"), envir = .GlobalEnv)
      data <- cbind(machine = "Illumina Genome Analyzer", data)
   } else if ((length(w) == 1) && grepl("illuminahiseq", x = w)){
      data <- get(paste0(element,".miRNASeq_illuminahiseq"), envir = .GlobalEnv)
      data <- cbind(machine = "Illumina HiSeq 2000", data)
   } else if ((length(w) == 2) && grepl("illuminaga|illuminahiseq", x=w[1]) && grepl("illuminaga|illuminahiseq", x=w[2])){
      data_illuminaga <- get(paste0(element,".miRNASeq_illuminaga"), envir = .GlobalEnv)
      data_illuminaga <- cbind(machine = "Illumina Genome Analyzer", data_illuminaga)
      data_illuminahiseq <- get(paste0(element,".miRNASeq_illuminahiseq"), envir = .GlobalEnv)
      data_illuminahiseq <- cbind(machine = "Illumina HiSeq 2000", data_illuminahiseq)
      data <- rbind(data_illuminaga, data_illuminahiseq)
   }
   assign(value = data, x = paste0(element, ".miRNASeq"), envir = .GlobalEnv )
   invisible(NULL)
})

```

# Saving miRNASeq data to `RTCGA.miRNASeq` package


```{r, eval=FALSE}
grep( "miRNASeq", x=ls(), value = TRUE) %>%
   grep("illuminahiseq|illuminaga", x = ., value = TRUE, invert = TRUE) %>%
   cat( sep="," ) #can one to id better? as from use_data documentation:
   # ...	Unquoted names of existing objects to save
   devtools::use_data(ACC.miRNASeq,BLCA.miRNASeq,BRCA.miRNASeq,CESC.miRNASeq,
                      CHOL.miRNASeq,COAD.miRNASeq,COADREAD.miRNASeq,DLBC.miRNASeq,
                      ESCA.miRNASeq,FPPP.miRNASeq,GBM.miRNASeq,GBMLGG.miRNASeq,
                      HNSC.miRNASeq,KICH.miRNASeq,KIPAN.miRNASeq,KIRC.miRNASeq,
                      KIRP.miRNASeq,LAML.miRNASeq,LGG.miRNASeq,LIHC.miRNASeq,
                      LUAD.miRNASeq,LUSC.miRNASeq,MESO.miRNASeq,OV.miRNASeq,
                      PAAD.miRNASeq,PCPG.miRNASeq,PRAD.miRNASeq,READ.miRNASeq,
                      SARC.miRNASeq,SKCM.miRNASeq,STAD.miRNASeq,STES.miRNASeq,
                      TGCT.miRNASeq,THCA.miRNASeq,THYM.miRNASeq,UCEC.miRNASeq,
                      UCS.miRNASeq,UVM.miRNASeq,
                      overwrite = TRUE,
                      compress="xz")
```