The Rcwl package is aimed to be a simple and user-friendly way to manage command line tools and build data analysis pipelines in R using Common Workflow Language (CWL). It can be a bridge between heavy bioinformatics tools and pipeline to R/Bioconductor. More details about CWL can be found at http://www.commonwl.org.

1 Installation

  1. Download the package.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rcwl")

The development version is also available to download from Github.

BiocManager::install("hubentu/Rcwl")
  1. Load the package into the R session.
library(Rcwl)

2 First Example

The main class and constructor function is cwlParam, which wrap a command line tool and its parameters in a cwlParam object. Let’s start with a simple example, echo hello world.

First, we load the package and then define the input parameter for “echo”, a string without a prefix. Just an id option required.

input1 <- InputParam(id = "sth")

Second, create a cwlParam object with baseCommand for the command to execute and InputParamList for the input parameters.

echo <- cwlParam(baseCommand = "echo", inputs = InputParamList(input1))

Now we have a command object to run. Let’s send a string “Hello World!” to the object. Without defining the outputs, it will stream standard output to a temporary file by default.

echo$sth <- "Hello World!"
echo
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   sth (string):  Hello World!
## outputs:
## output:
##   type: stdout

Let’s run it. A list including the command executed, temporary output and logs. The output directory is the current folder by default, but it can be changed by setting outdir option. All standard out and standard error stream can also be printed by setting stderr = "".

r1 <- runCWL(echo, outdir = tempdir())
## Final process status is success
r1
## List of length 3
## names(3): command output logs
readLines(r1$output)
## [1] "Hello World!"

The writeCWL function will write the cwlParam object to a CWL file for the command and YML for the inputs. Then it will invoke cwl-runner by default to execute the two files. Thus the command tool cwl-runner is required to be installed and available in the system path.

3 Input Parameters

3.1 Essential Input parameters

For the input parameters, three options need to be defined usually, id, type, and prefix. The type can be string, int, long, float, double, and so on. More detail can be found at: https://www.commonwl.org/v1.0/CommandLineTool.html#CWLType.

Here is an example from CWL user guide. Here we defined an echo with different type of input parameters by InputParam. The stdout option can be used to caputre the standard output stream to a file.

e1 <- InputParam(id = "flag", type = "boolean", prefix = "-f")
e2 <- InputParam(id = "string", type = "string", prefix = "-s")
e3 <- InputParam(id = "int", type = "int", prefix = "-i")
e4 <- InputParam(id = "file", type = "File", prefix = "--file=", separate = FALSE)
echoA <- cwlParam(baseCommand = "echo",
                  inputs = InputParamList(e1, e2, e3, e4),
                  stdout = "output.txt")

Then we give it a try by setting values for the inputs.

echoA$flag <- TRUE
echoA$string <- "Hello"
echoA$int <- 1

tmpfile <- tempfile()
write("World", tmpfile)
echoA$file <- tmpfile

r2 <- runCWL(echoA, outdir = tempdir())
## Final process status is success
r2$command
## [1] "[job file54a627add457.cwl] /tmp/E2TKEr$ echo \\"                                      
## [2] "    --file=/tmp/tmpyDHZ_d/stgca52a5dc-85cf-48ca-8c24-c71f61a15609/file54a66d493772 \\"
## [3] "    -f \\"                                                                            
## [4] "    -i \\"                                                                            
## [5] "    1 \\"                                                                             
## [6] "    -s \\"                                                                            
## [7] "    Hello > /tmp/E2TKEr/output.txt"                                                   
## [8] "Could not collect memory usage, job ended before monitoring began."

3.2 Array Inputs

A similar example to CWL user guide. We can define three different type of array as inputs.

a1 <- InputParam(id = "A", type = "string[]", prefix = "-A")
a2 <- InputParam(id = "B",
                 type = InputArrayParam(items = "string",
                                        prefix="-B=", separate = FALSE))
a3 <- InputParam(id = "C", type = "string[]", prefix = "-C=",
                 itemSeparator = ",", separate = FALSE)
echoB <- cwlParam(baseCommand = "echo",
                 inputs = InputParamList(a1, a2, a3))

Then set values for the three inputs.

echoB$A <- letters[1:3]
echoB$B <- letters[4:6]
echoB$C <- letters[7:9]
echoB
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   A (string[]): -A a b c
##   B:
##     type: array
##     prefix: -B= d e f
##   C (string[]): -C= g h i
## outputs:
## output:
##   type: stdout

Now we can check whether the command behaves as we expected.

r3 <- runCWL(echoB, outdir = tempdir())
## Final process status is success
r3$command
##  [1] "[job file54a6332f44aa.cwl] /tmp/9_WrKx$ echo \\"                    
##  [2] "    -A \\"                                                          
##  [3] "    a \\"                                                           
##  [4] "    b \\"                                                           
##  [5] "    c \\"                                                           
##  [6] "    -B=d \\"                                                        
##  [7] "    -B=e \\"                                                        
##  [8] "    -B=f \\"                                                        
##  [9] "    -C=g,h,i > /tmp/9_WrKx/dd96ceaee1d6aa6446eeac74f10bdd91b61bbfb0"
## [10] "Could not collect memory usage, job ended before monitoring began."

4 Output Parameters

4.1 Capturing Output

The outputs, similar to the inputs, is a list of output parameters. Three options id, type and glob can be defined. The glob option is used to define a pattern to find files relative to the output directory.

Here is an example to unzip a compressed gz file. First, we generate a compressed R script file.

zzfil <- file.path(tempdir(), "sample.R.gz")
zz <- gzfile(zzfil, "w")
cat("sample(1:10, 5)", file = zz, sep = "\n")
close(zz)

We define a cwlParam object to use “gzip” to uncompress a input file.

ofile <- "sample.R"
z1 <- InputParam(id = "uncomp", type = "boolean", prefix = "-d")
z2 <- InputParam(id = "out", type = "boolean", prefix = "-c")
z3 <- InputParam(id = "zfile", type = "File")
o1 <- OutputParam(id = "rfile", type = "File", glob = ofile)
gz <- cwlParam(baseCommand = "gzip",
               inputs = InputParamList(z1, z2, z3),
               outputs = OutputParamList(o1),
               stdout = ofile)

Now the gz object can be used to uncompress the previous generated compressed file.

gz$uncomp <- TRUE
gz$out <- TRUE
gz$zfile <- zzfil
r4 <- runCWL(gz, outdir = tempdir())
## Final process status is success
r4$output
## [1] "/tmp/Rtmp0LOMLU/sample.R"

Or we can use arguments to set some default parameters.

z1 <- InputParam(id = "zfile", type = "File")
o1 <- OutputParam(id = "rfile", type = "File", glob = ofile)
Gz <- cwlParam(baseCommand = "gzip",
               arguments = list("-d", "-c"),
               inputs = InputParamList(z1),
               outputs = OutputParamList(o1),
               stdout = ofile)
Gz
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: gzip 
## arguments: -d -c 
## inputs:
##   zfile (File):  
## outputs:
## rfile:
##   type: File
##   outputBinding:
##     glob: sample.R
## stdout: sample.R
Gz$zfile <- zzfil
r4a <- runCWL(Gz, outdir = tempdir())
## Final process status is success

To make it for general usage, we can define a pattern with javascript to glob the output, which require node to be installed in your system PATH.

pfile <- "$(inputs.zfile.path.split('/').slice(-1)[0].split('.').slice(0,-1).join('.'))"

Or we can use the CWL built in file property, nameroot, directly.

pfile <- "$(inputs.zfile.nameroot)"
o2 <- OutputParam(id = "rfile", type = "File", glob = pfile)
req1 <- list(class = "InlineJavascriptRequirement")
GZ <- cwlParam(baseCommand = c("gzip", "-d", "-c"),
               requirements = list(), ## assign list(req1) if node installed.
               inputs = InputParamList(z1),
               outputs = OutputParamList(o2),
               stdout = pfile)
GZ$zfile <- zzfil
r4b <- runCWL(GZ, outdir = tempdir())
## Final process status is success

4.2 Array Outputs

We can also capture multiple output files with glob pattern.

a <- InputParam(id = "a", type = InputArrayParam(items = "string"))
b <- OutputParam(id = "b", type = OutputArrayParam(items = "File"), glob = "*.txt")
touch <- cwlParam(baseCommand = "touch", inputs = InputParamList(a), outputs = OutputParamList(b))
touch$a <- c("a.txt", "b.gz", "c.txt")
r5 <- runCWL(touch, outdir = tempdir())
## Final process status is success
r5$output
## [1] "/tmp/Rtmp0LOMLU/a.txt" "/tmp/Rtmp0LOMLU/c.txt"

5 Running Tools in Docker

The CWL can work with docker to simplify your software management and communicate files between host and container. The docker container can be defined by the hints or requirements option.

d1 <- InputParam(id = "rfile", type = "File")
req1 <- list(class = "DockerRequirement",
             dockerPull = "r-base")
doc <- cwlParam(baseCommand = "Rscript",
                inputs = InputParamList(d1),
                stdout = "output.txt",
                hints = list(req1))
doc$rfile <- r4$output
r6 <- runCWL(doc)

The tools defined with docker requirements can also be run locally by disabling the docker option. In case your Rscript depends some local libraries to run, an option from cwltools, “–preserve-entire-environment”, can be used to pass all environment variables.

r6a <- runCWL(doc, docker = FALSE, outdir = tempdir(),
              Args = "--preserve-entire-environment")
## Final process status is success

6 Running Tools in Cluster server

The CWL also can work in high performance clusters with batch-queuing system, such as SGE, PBS, SLURM and so on, using the Bioconductor package BiocParallel. Here is an example to submit jobs with “Multiicore” and “SGE”. A more detailed example can be found (https://hubentu.github.io/others/Rcwl_RNASeq.html).

library(BiocParallel)
sth.list <- as.list(LETTERS)
names(sth.list) <- LETTERS

## submit with mutlicore
result1 <- runCWLBatch(cwl = echo, outdir = tempdir(), inputList = list(sth = sth.list),
                       BPPARAM = MulticoreParam(26))

## submit with SGE
result2 <- runCWLBatch(cwl = echo, outdir = tempdir(), inputList = list(sth = sth.list),
                       BPPARAM = BatchtoolsParam(workers = 26, cluster = "sge",
                                                 resources = list(queue = "all.q")))

7 Writing Pipeline

We can connect multiple tools together into a pipeline. Here is an example to uncompress an R script and execute it with Rscript.

Here we define a simple Rscript tool without using docker.

d1 <- InputParam(id = "rfile", type = "File")
Rs <- cwlParam(baseCommand = "Rscript",
               inputs = InputParamList(d1))
Rs
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: Rscript 
## inputs:
##   rfile (File):  
## outputs:
## output:
##   type: stdout

Test run:

Rs$rfile <- r4$output
tres <- runCWL(Rs, outdir = tempdir())
## Final process status is success
readLines(tres$output)
## [1] "[1] 9 6 5 7 8"

The pipeline includes two steps, decompressed by GZ and compiled by Rs. The input file is a compressed file and the output would be the output Rout from Rs.

First we need to define the direct inputs and outputs from GZ and Rs.

i1 <- InputParam(id = "cwl_zfile", type = "File")
o1 <- OutputParam(id = "cwl_cout", type = "File", outputSource = "Compile/output")

For the input “cwl_zifle”, it refers to the GZ input zfile. The output “cwl_cout” will be the outcome of Rs output Rout.

The cwlStepParam is used to define inputs and outputs from previous step. Then it connects with the two steps with Step function. The run option refer to the corresponding cwlParam object and the In option in steps should be linked to the input parameters defined by cwlStepParam. In the end, we use + to connect all steps.

cwl <- cwlStepParam(inputs = InputParamList(i1),
                    outputs = OutputParamList(o1))
s1 <- Step(id = "Uncomp", run = GZ,
           In = list(zfile = "cwl_zfile"))
s2 <- Step(id = "Compile", run = Rs,
           In = list(rfile = "Uncomp/rfile"))
cwl <- cwl + s1 + s2
cwl
## class: cwlStepParam 
##  cwlClass: Workflow 
##  cwlVersion: v1.0 
## inputs:
##   cwl_zfile (File):  
## outputs:
## cwl_cout:
##   type: File
##   outputSource: Compile/output
## steps:
##   Uncomp:
##     run: Uncomp.cwl
##     zfile: cwl_zfile
##     out: rfile
##   Compile:
##     run: Compile.cwl
##     rfile: Uncomp/rfile
##     out: output

Let’s run the pipeline.

cwl$cwl_zfile <- zzfil
r7 <- runCWL(cwl, outdir = tempdir())
## Final process status is success
readLines(r7$output)
## [1] "[1] 9 8 2 5 6"

7.1 Scattering pipeline

The scattering feature can specifies the associated workflow step or subworkflow to execute separately over a list of input elements. To use this feature, ScatterFeatureRequirement must be specified in the workflow requirements. Different scatter methods can be used in the associated step to decompose the input into a discrete set of jobs. More details can be found at: https://www.commonwl.org/v1.0/Workflow.html#WorkflowStep.

Here is an example to execute multiple R scripts. First, we need to set the input and output types to be array of “File”, and add the requirments. In the “Compile” step, the scattering input is required to be set with the scatter option.

i2 <- InputParam(id = "cwl_rfiles", type = "File[]")
o2 <- OutputParam(id = "cwl_couts", type = "File[]", outputSource = "Compile/output")
req1 <- list(class = "ScatterFeatureRequirement")

cwl2 <- cwlStepParam(requirements = list(req1),
                     inputs = InputParamList(i2),
                     outputs = OutputParamList(o2))
s1 <- Step(id = "Compile", run = Rs,
           In = list(rfile = "cwl_rfiles"),
           scatter = "rfile")
cwl2 <- cwl2 + s1
cwl2
## class: cwlStepParam 
##  cwlClass: Workflow 
##  cwlVersion: v1.0 
## requirements:
## - class: ScatterFeatureRequirement
## inputs:
##   cwl_rfiles (File[]):  
## outputs:
## cwl_couts:
##   type: File[]
##   outputSource: Compile/output
## steps:
##   Compile:
##     run: Compile.cwl
##     rfile: cwl_rfiles
##     out: output
##     scatter: rfile

Multiple R scripts can be assigned to the workflow inputs and executed.

cwl2$cwl_rfiles <- c(r4b$output, r4b$output)
r8 <- runCWL(cwl2, outdir = tempdir())
## Final process status is success
r8$output
## [1] "/tmp/Rtmp0LOMLU/1a890ba91ed644d3e0ff3686dc2207e9e5f6cb5d"
## [2] "/tmp/Rtmp0LOMLU/1a890ba91ed644d3e0ff3686dc2207e9e5f6cb5d"

7.2 Pipeline plot

The function plotCWL can be used to visualize the relationship of inputs, outputs and the analysis for a tool or pipeline.

plotCWL(cwl)

8 Web Application

8.1 cwlParam example

Here we build a tool with different types of input parameters.

e1 <- InputParam(id = "flag", type = "boolean",
                 prefix = "-f", doc = "boolean flag")
e2 <- InputParam(id = "string", type = "string", prefix = "-s")
e3 <- InputParam(id = "option", type = "string", prefix = "-o")
e4 <- InputParam(id = "int", type = "int", prefix = "-i", default = 123)
e5 <- InputParam(id = "file", type = "File",
                 prefix = "--file=", separate = FALSE)
e6 <- InputParam(id = "array", type = "string[]", prefix = "-A",
                 doc = "separated by comma")
mulEcho <- cwlParam(baseCommand = "echo", id = "mulEcho",
                 label = "Test parameter types",
                 inputs = InputParamList(e1, e2, e3, e4, e5, e6),
                 stdout = "output.txt")
mulEcho
## class: cwlParam 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   flag (boolean): -f 
##   string (string): -s 
##   option (string): -o 
##   int (int): -i 123
##   file (File): --file= 
##   array (string[]): -A 
## outputs:
## output:
##   type: stdout
## stdout: output.txt

8.2 cwlParam to Shiny App

Some input parameters can be predefined in a list, which will be converted to select options in the webapp. An upload parameter can be used to defined wether to generate an upload interface for the file type option. If FALSE, the upload field will be text input (file path) instead of file input.

inputList <- list(option = c("option1", "option2"))
app <- cwlShiny(mulEcho, inputList, upload = TRUE)
runApp(app)
shinyApp

shinyApp

9 Working with R functions

We can wrap an R function to cwlParam object by simply assigning the R function to baseCommand. This could be useful to summarize results from other tools in a pipeline. It can also be used to benchmark different parameters for a method written in R. Please note that this feature is only implemented by Rcwl, but not available in the common workflow language.

fun1 <- function(x)x*2
testFun <- function(a, b){
    cat(fun1(a) + b^2, sep="\n")
}
assign("fun1", fun1, envir = .GlobalEnv)
assign("testFun", testFun, envir = .GlobalEnv)
p1 <- InputParam(id = "a", type = "int", prefix = "a=", separate = F)
p2 <- InputParam(id = "b", type = "int", prefix = "b=", separate = F)
o1 <- OutputParam(id = "o", type = "File", glob = "rout.txt")
TestFun <- cwlParam(baseCommand = testFun,
                    inputs = InputParamList(p1, p2),
                    outputs = OutputParamList(o1),
                    stdout = "rout.txt")
TestFun$a <- 1
TestFun$b <- 2
r1 <- runCWL(TestFun, Args = "--preserve-entire-environment")
## Final process status is success
readLines(r1$output)
## [1] "6"

The runCWL function wrote the testFun function and its dependencies into an R script file automatically and call Rscript to run the script with parameters. Each parameter requires a prefix from corresponding argument in the R function with “=” and without a separator. Here we assigned the R function and its dependencies into the global environment because it will start a new environment when the vignette compiled.

10 Resources

10.1 RcwlPipelines

The Rcwl package can be utilized to develop pipelines for best practices of reproducible research, especially for Bioinformatics study. Multiple Bioinformatics pipelines, such as RNASeq alignment, quality control and quantification, DNASeq alignment and variant calling, have been developed based on the tool in an R package RcwlPipelines, which contains the CWL recipes and the scripts to create the pipelines. Examples to analyze real data are also included.

The package is currently available in github.

To install the package.

BiocManager::install("hubentu/RcwlPipelines")

More recipes will be collected in this package and we would like to invite community to submit more pipelines to this package.

10.2 Tool collections in CWL format

Plenty of Bioinformatics tools and workflows can be found from github in CWL format. They can be imported to cwlParam object by readCWL function, or can be used directly.

10.3 Docker for Bioinformatics tools

Most of the Bioinformatics software are available in docker containers, which can be very convenient to be adopted to build portable CWL tools and pipelines.

11 SessionInfo

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] Rcwl_1.0.14         S4Vectors_0.22.1    BiocGenerics_0.30.0
## [4] yaml_2.2.0          BiocStyle_2.12.0   
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.5.1       tidyr_1.0.0         jsonlite_1.6       
##  [4] viridisLite_0.3.0   R.utils_2.9.0       shiny_1.4.0        
##  [7] assertthat_0.2.1    BiocManager_1.30.7  base64url_1.4      
## [10] progress_1.2.2      pillar_1.4.2        backports_1.1.5    
## [13] glue_1.3.1          downloader_0.4      digest_0.6.21      
## [16] RColorBrewer_1.1-2  promises_1.1.0      checkmate_1.9.4    
## [19] colorspace_1.4-1    htmltools_0.4.0     httpuv_1.5.2       
## [22] R.oo_1.22.0         XML_3.98-1.20       pkgconfig_2.0.3    
## [25] bookdown_0.14       DiagrammeR_1.0.1    purrr_0.3.2        
## [28] xtable_1.8-4        scales_1.0.0        brew_1.0-6         
## [31] later_1.0.0         BiocParallel_1.18.1 tibble_2.1.3       
## [34] ggplot2_3.2.1       influenceR_0.1.0    withr_2.1.2        
## [37] lazyeval_0.2.2      rgexf_0.15.3        magrittr_1.5       
## [40] crayon_1.3.4        mime_0.7            evaluate_0.14      
## [43] R.methodsS3_1.7.1   Rook_1.1-1          tools_3.6.1        
## [46] data.table_1.12.4   prettyunits_1.0.2   hms_0.5.1          
## [49] lifecycle_0.1.0     stringr_1.4.0       munsell_0.5.0      
## [52] compiler_3.6.1      rlang_0.4.0         grid_3.6.1         
## [55] rstudioapi_0.10     rappdirs_0.3.1      htmlwidgets_1.5.1  
## [58] visNetwork_2.0.8    igraph_1.2.4.1      rmarkdown_1.16     
## [61] codetools_0.2-16    gtable_0.3.0        R6_2.4.0           
## [64] gridExtra_2.3       knitr_1.25          dplyr_0.8.3        
## [67] fastmap_1.0.1       zeallot_0.1.0       readr_1.3.1        
## [70] stringi_1.4.3       Rcpp_1.0.2          vctrs_0.2.0        
## [73] batchtools_0.9.11   tidyselect_0.2.5    xfun_0.10