---
title: "rSWeeP: Functions to creation of low dimensional comparative matrices of Amino Acid Sequence occurrences "
author:
- name: Danrley Rafael Fernandes
  affiliation: Federal University of Paraná, Graduate in Biological Sciences, Curitiba, Paraná, Brazil.
-  name: Camilla Reginatto De Pierri
   affiliation: 
   - Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.
   - Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil.
- name: Mariane Gonçalves Kulik 
  affiliation: Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.
- name: Roberto Raittz
  affiliation: 
   - Federal University of Paraná, Graduate Program in Bioinformatics, Curitiba, Paraná, Brazil.
   - Federal University of Paraná, Department of Biochemistry and Molecular Biology, Curitiba, Paraná, Brazil.
output:
  BiocStyle::html_document
package: rSWeeP
abstract: |
 This is a package with a couple of functions to possibilite the use of the sWeeP method in R. This method was developed to favor the analizes between amino acids sequences and to assist alignment free phylogenetic studies. This method is based on the concept of sparse words, which is applied in the scan of biological sequences and its the conversion in a  matrix of ocurrences.  Aiming the generation of low dimensional matrices of Amino Acid Sequence occurrences.
vignette: >
  %\VignetteIndexEntry{rSWeeP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
# Overview
The “Spaced Words Projection (sWeeP)” is a method for representing biological sequences using relatively, it uses the spacedwords concept by scanning sequences and generating indices to create a higherdimensional vector that is later projected into a smaller randomly oriented orthonormal base. This function is suitable for making high quality comparisons between sequences allowing analyzes that are not possible due to the computational limitation of the traditional techniques. The method is available at [sWeeP](https://sourceforge.net/projects/spacedwordsprojection/) (PIERRI, 2019). This tool has it's main speed gain in  constanci  processing time. The response time grows linear to the number of inputs, while in other methods it grow is exponencial.

## Functions
The package has two functions: orthBase, that generates an orthonormal matrix of a chosen size, and sWeeP, a function that applies the sWeeP method

# Quick Start
 The orthBase function can create a quasi-orthonormal matrix in any desired size. Here it is used to create a matrix to project the sWeeP method, so it must have 160.000 rows and the columns of the size wished for projection.
```{r}
library(rSWeeP)
baseMatrix <- orthBase(160000,10)
```
The **exdna.fas** dataset consists in a list of three strings that simulates a DNA sequence  used for demonstration purposes only.
```{r}
path <- system.file(package = "rSWeeP", "extdata", "exdna.fas")
```
Then the sWeeP method is applied and the returns a matrix that represents the sequences compared by a vectorial method.  And then it's possible to see a graphic  representation in a phylogenetic tree 
```{r}
return <- sWeeP(path,baseMatrix)
distancia <- dist(return, method = "euclidean")
tree <- hclust(distancia, method="ward.D")
plot(tree, hang = -1, cex = 1)
```

# Session information 
```{r label='Session information', eval=TRUE, echo=FALSE}
sessionInfo()
```

# References
- Pierri,C. R. *et al*. **sWeeP: Representing large biological sequences data sets in compact vectors**. Scientific Reports, accepted in December 2019.doi: 10.1038/s41598-019-55627-4.