---
title: "`MungeSumstats`: Getting started"
author: "
Authors: Alan Murphy, Brian Schilder and Nathan Skene
"
date: "Updated: `r format(Sys.Date(), '%b-%d-%Y')`
"
csl: nature.csl
output:
BiocStyle::html_document:
vignette: >
%\VignetteIndexEntry{MungeSumstats}
%\usepackage[utf8]{inputenc}
%\VignetteEngine{knitr::rmarkdown}
bibliography: MungeSumstats.bib
editor_options:
markdown:
wrap: 72
---
# Citation
If you use the *MungeSumstats* package, please cite
[Murphy et al. MungeSumstats: A Bioconductor package for the
standardisation and quality control of many GWAS summary
statistics](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab665/6380562).
# Overview
The *MungeSumstats* package is designed to facilitate the
standardisation of GWAS summary statistics as utilised in our Nature
Genetics paper [@Skene2018].
The package is designed to handle the lack of standardisation of output
files by the GWAS community. There is a group who have now manually
standardised many GWAS: [R interface to the IEU GWAS database API •
ieugwasr](https://mrcieu.github.io/ieugwasr/) and
[gwasvcf](https://github.com/MRCIEU/gwasvcf) but because a lot of GWAS
remain closed access, these repositories are not all encompassing.
The [GWAS-Download
project](https://github.com/mikegloudemans/gwas-download) has collated
summary statistics from 200+ GWAS. This repository has been utilsed to
identify the most common formats, all of which can be standardised with
*MungeSumstats*.
Moreover, there is an emerging standard of VCF format for summary
statistics files with multiple, useful, associated R packages such as
*vcfR*. However, there is currently no method to convert VCF formats to
a standardised format that matches older approaches.
The *MungeSumstats* package standardises both VCF and the most common
summary statistic file formats to enable downstream integration and
analysis.
*MungeSumstats* also offers comprehensive Quality Control (QC) steps
which are important prerequisites for downstream analysis like Linkage
disequilibrium score regression (LDSC) and MAGMA.
Moreover, *MungeSumstats* is efficiently written resulting in all
reformatting and quality control checks completing in minutes for GWAS
summary statistics with 500k SNPs on a standard desktop machine. This
speed can be increased further by increasing the number of threads
(nThread) for `data.table` to use.
Currently *MungeSumstats* only works on data from humans, as it uses
human-based genome references.
# Aim
*MungeSumstats* will ensure that the all essential columns for analysis
are present and syntactically correct. Generally, summary statistic
files include (but are not limited to) the columns:
- SNP : SNP ID (rs IDs)
- CHR : Chromosome number
- BP : Base pair positions
- A1 : reference allele
- A2 : alternative allele
- Z : Z-score
- BETA : Effect size estimate relative to the alternative allele
- P : Unadjusted p-value for SNP
- SE : The standard error
- N : Sample size
- INFO: The imputation information score
- FRQ: The minor/effect allele frequency (MAF/EAF) of the SNP
*MungeSumstats* uses a mapping file to infer the inputted column names
(run `data("sumstatsColHeaders")` to view these). This mapping file is
far more comprehensive than any other publicly available munging tool
containing more than 200 unique mappings at the time of writing this
vignette. However, if your column headers are missing or if you want to
change the mapping, you can do so by passing your own mapping file (see
`format_sumstats(mapping_file)`).
*MungeSumstats* offers unmatched levels of quality control to ensure,
for example, consistency of allele assignment and direction of effects.
Tests run by *MungeSumstats* include:
- Check VCF format
- Check tab, space or comma delimited, zipped, csv or tsv file
- Check for header name synonyms
- Check for multiple models or traits in GWAS
- Check for uniformity in SNP ID - no mix of rs/missing rs/chr:bp
- Check for CHR:BP:A2:A1 all in one column
- Check for CHR:BP in one column
- Check for A1/A2 in one column
- Check if CHR and/or BP is missing (infer from reference genome)
- Check if SNP ID is missing (infer from reference genome)
- Check if A1 and/or A2 are missing (infer from reference genome)
- Check that vital columns are present (SNP,CHR,BP,P,A1,A2)
- Check for one signed/effect column
(Z,OR,BETA,LOG_ODDS,SIGNED_SUMSTAT)
- Check for missing data
- Check for duplicated columns
- Check for small p-values (lower than 5e-324)
- Check N column is an integer
- Check for SNPs with N greater than 5 times standard dev. plus the
mean
- Check SNPs are RS ID's
- Check for uniformity of SNP ID format
- Check for duplicated rows, based on SNP ID
- Check for duplicated rows, based on base-pair position
- Check for SNPs on reference genome. Correct not found SNP IDs using
CHR and BP (infer from reference genome)
- Check INFO score
- Check FRQ value
- Check FRQ is minor allele frequency (MAF)
- Check that the SNPs' standard error (SE) is positive
- Check that SNPs' effect columns (like BETA) aren't equal to 0
- Check for strand-ambiguous SNPs
- Check for non-biallelic SNPs (infer from reference genome)
- Check for allele flipping
- Check for SNPs with nonstandard chromosome names
- Check for SNPs on excluded chromosomes (removes non-autosomal SNPs by default)
- Check for z-score (Z) and impute if missing
- Check for N and impute if missing
- Check output format is LDSC ready
- Check output format is IEU OpenGWAS ready
- Check and perform liftover to desired reference genome if necessary
- Check for indels in the sumstats and drop them if found (not run by default)
Users can specify which checks to run on their data. A **note** on the
allele flipping check: **MungeSumstats** infers the effect allele will
always be the A2 allele, this is the approach done for [IEU GWAS
VCF](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7805039/) and has such
also been adopted here. This inference is first from the inputted file's
column headers however, the allele flipping check ensures this by
comparing A1, what should be the reference allele, to the reference
genome. If a SNP's A1 DNA base doesn't match the reference genome but
it's A2 (what should be the alternative allele) does, the alleles will
be flipped along with the effect information (e.g. Beta, Odds Ratio,
signed summary statistics, FRQ, Z-score\*).
\*-by default the Z-score is assumed to be calculated off the effect
size not the P-value and so will be flipped. This can be changed by a
user.
If a test is failed, the user will be notified and if possible, the
input will be corrected. The QC steps from the checks above can also be
adjusted to suit the user's analysis, see
`MungeSumstats::format_sumstats`.
*MungeSumstats* can handle VCF, txt, tsv, csv file types or .gz/.bgz
versions of these file types. The package also gives the user the
flexibility to export the reformatted file as tab-delimited, VCF or R
native objects such as data.table, GRanges or VRanges objects. The
output can also be outputted in an **LDSC ready** format which means the
file can be fed directly into LDSC without the need for additional
munging. **NOTE** - If LDSC format is used, the naming convention of A1 as the
reference (genome build) allele and A2 as the effect allele will be reversed
to match LDSC (A1 will now be the effect allele). See more info on this
[here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68). Note that any
effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
Please read carefully through our [FAQ Website](https://github.com/Al-Murphy/MungeSumstats/wiki/FAQ)
to gain insight on how best to run MungeSumstats on your data.
# Data
The *MungeSumstats* package contains small subsets of GWAS summary
statistics files. Firstly, on Educational Attainment by Okbay et al
2016: PMID: 27898078 PMCID: PMC5509058 DOI: 10.1038/ng1216-1587b.
Secondly, a VCF file (VCFv4.2) relating to the GWAS Amyotrophic lateral
sclerosis from ieu open GWAS project. Dataset: ebi-a-GCST005647:
These datasets will be used to showcase *MungeSumstats* functionality.
# Running *MungeSumstats*
*MungeSumstats* is available on Bioconductor. To install the package on
Bioconductor run the following lines of code:
if (!require("BiocManager")) install.packages("BiocManager")
BiocManager::install("MungeSumstats")
Once installed, load the package:
```{r setup}
library(MungeSumstats)
```
To standardise the summary statistics' file format, simply call
`format_sumstats()` passing in the path to your summary statistics file
or directly pass the summary statistics as a dataframe or datatable. You
can specify which genome build was used in the GWAS(GRCh37 or GRCh38)
or, as default, infer the genome build from the data.The reference
genome is used for multiple checks like deriving missing data such
SNP/BP/CHR/A1/A2 and for QC steps like removing non-biallelic SNPs,
strand-ambiguous SNPs or ensuring correct allele and direction of SNP
effects. The path to the reformatted summary statistics file can be
returned by the function call, the user can specify a location to save
the file or the user can return an R native object for the data:
data.table, VRanges or GRanges object.
Note that for a number of the checks implored by *MungeSumstats* a
reference genome is used. If your GWAS summary statistics file of
interest relates to *GRCh38*, you will need to install
`SNPlocs.Hsapiens.dbSNP155.GRCh38` and `BSgenome.Hsapiens.NCBI.GRCh38`
from Bioconductor as follows:
```
#increase permissible time to download data, in case of slow internet access
options(timeout=2000)
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")
BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")
```
If your GWAS summary statistics file of interest relates to *GRCh37*,
you will need to install `SNPlocs.Hsapiens.dbSNP155.GRCh37` and
`BSgenome.Hsapiens.1000genomes.hs37d5` from Bioconductor as follows:
```
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh37")
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
```
These may take some time to install and are not included in the package
as some users may only need one of *GRCh37*/*GRCh38*.
The Educational Attainment by Okbay GWAS summary statistics file is
saved as a text document in the package's external data folder so we can
just pass the file path to it straight to *MungeSumstats*.
**NOTE** - By default, Formatted results will be saved to `tempdir()`.
This means all formatted summary stats will be deleted upon ending the R
session if not copied to a local file path. Otherwise, to keep formatted
summary stats, change `save_path` (
e.g.`file.path('./formatted',basename(path))`), or make sure to copy
files elsewhere after processing (
e.g.`file.copy(save_path, './formatted/' )`.
```{r, eval=FALSE, message=TRUE}
eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
package="MungeSumstats")
reformatted <-
MungeSumstats::format_sumstats(path=eduAttainOkbayPth,
ref_genome="GRCh37")
```
```{r,echo=FALSE}
#don't run time intensive checks
eduAttainOkbayPth <- system.file("extdata","eduAttainOkbay.txt",
package="MungeSumstats")
reformatted <-
MungeSumstats::format_sumstats(path=eduAttainOkbayPth,
on_ref_genome = FALSE,
strand_ambig_filter = FALSE,
bi_allelic_filter = FALSE,
allele_flip_check = FALSE,
ref_genome="GRCh37")
```
Here we know the summary statistics are based on the reference genome GRCh37,
GRCh38 can also be inputted. Moreover, if you are unsure of the genome build,
leave it as `NULL` and Mungesumstats will infer it from the data.
Also note that the default dbSNP version used along with the reference genome is
the latest version available on Bioconductor (currently dbSNP 155) but older
versions are also availble. Use the `dbSNP` input parameter to control this.
The arguments `format_sumstats` in that control the level of QC
conducted by *MungeSumstats* are:
- **convert_small_p** Binary, should `p-values < 5e-324` be converted
to 0? Small p-values pass the R limit and can cause errors with
LDSC/MAGMA and should be converted. Default is TRUE.
- **convert_large_p** Binary, should p-values \>1 be converted to 1?
P-values \>1 should not be possible and can cause errors with
LDSC/MAGMA and should be converted. Default is TRUE.
- **convert_neg_p** Binary, should p-values \<0 be converted to 0?
Negative p-values should not be possible and can cause errors with
LDSC/MAGMA and should be converted. Default is TRUE.
- **compute_z** Whether to compute Z-score column from P. Default is
FALSE. **Note** that imputing the Z-score for every SNP will not
correct be perfectly correct and may result in a loss of power. This
should only be done as a last resort.
- **force_new_z** When a "Z" column already exists, it will be used by
default. To override and compute a new Z-score column from P set to
TRUE.
- **compute_n** Whether to impute N. Default of 0 won't impute, any
other integer will be imputed as the N (sample size) for every SNP
in the dataset. **Note** that imputing the sample size for every SNP
is not correct and should only be done as a last resort. N can also
be inputted with "ldsc", "sum", "giant" or "metal" by passing one of
these for this field or a vector of multiple. Sum and an integer
value creates an N column in the output whereas giant, metal or ldsc
create an Neff or effective sample size. If multiples are passed,
the formula used to derive it will be indicated.
- **convert_n\_int** Binary, if N (the number of samples) is not an
integer, should this be rounded? Default is TRUE. analysis_trait If
multiple traits were studied, name of the trait for analysis from
the GWAS. Default is NULL.
- **impute_beta** Binary, whether BETA should be imputed using other
effect data if it isn't present in the sumstats. Note that this
imputation is an approximation so could have an effect on downstream
analysis. Use with caution. The different methods MungeSumstats will
try and impute beta (in this order or priority) are: 1. log(OR) 2. Z
x SE. Default value is FALSE.
- **es_is_beta** Binary, whether to map ES to BETA. We take BETA to be any
BETA-like value (including Effect Size). If this is not the case for your
sumstats, change this to FALSE. Default is TRUE.
- **impute_se** Binary, whether the standard error should be imputed
using other effect data if it isn't present in the sumstats. Note
that this imputation is an approximation so could have an effect on
downstream analysis. Use with caution. The different methods
MungeSumstats will try and impute se (in this order or priority)
are: 1. BETA / Z
2. abs(BETA/ qnorm(P/2)). Default value is FALSE.
- **analysis_trait** If multiple traits were studied, name of the
trait for analysis from the GWAS. Default is NULL.
- **INFO_filter** 0-1 The minimum value permissible of the imputation
information score (if present in sumstatsfile). Default 0.9
- **FRQ_filter** 0-1 The minimum value permissible of the
frequency(FRQ) of the SNP (i.e. Allele Frequency (AF)) (if present
in sumstats file). By default no filtering is done, i.e. value of
0.\
- **pos_se** Binary Should the standard Error (SE) column be checked
to ensure it is greater than 0? Those that are, are removed (if
present in sumstats file). Default TRUE.
- **effect_columns_nonzero** Binary should the effect columns in the
data BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to
ensure no SNP=0. Those that do are removed(if present in sumstats
file). Default TRUE.\
- **N_std** Numeric, the number of standard deviations above the mean
a SNP's N is needed to be removed. Default is 5. **N_dropNA**
controls whether the SNPs with a missing N value are dropped or not
(Default is TRUE).
- **N_dropNA** Drop rows where N is missing.Default is TRUE.
- **chr_style** Chromosome naming style to use in the formatted summary
statistics file ("NCBI", "UCSC", "dbSNP", or "Ensembl"). The NCBI and
Ensembl styles both code chromosomes as `1-22, X, Y, MT`; the UCSC
style is `chr1-chr22, chrX, chrY, chrM`; and the dbSNP style is
`ch1-ch22, chX, chY, chMT`. Default is Ensembl.
- **rmv_chr** Chromosomes to exclude from the formatted summary
statistics file. Use NULL if no filtering is necessary. Default is
`c("X", "Y", "MT")` which removes all non-autosomal SNPs.
- **on_ref_genome** Binary, should a check take place that all SNPs
are on the reference genome by SNP ID. Any SNPs not on the reference
genome, will be corrected from the reference genome (if possible)
using the chromosome and base pair position data. Default is TRUE
- **convert_ref_genome** name of the reference genome to convert to
("GRCh37" or "GRCh38"). This will only occur if the current genome
build does not match. Default is not to convert the genome build
(NULL).\
- **strand_ambig_filter** Binary, should SNPs with strand-ambiguous
alleles be removed. Default is FALSE
- **allele_flip_check** Binary, should the allele columns be checked
against reference genome to infer if flipping is necessary. Default
is TRUE. **allele_flip_drop** controls whether the SNPs for which
neither their A1 or A2 base pair values match a reference genome be
dropped. Default is TRUE. **allele_flip_z** controls whether the
Z-score value should be flipped along with effect and FRQ columns
(e.g. Beta). Default is TRUE. **allele_flip_frq** controls whether
the frequency (FRQ) value should be flipped along with effect and
Z-score columns (e.g. Beta). Default is TRUE.
- **bi_allelic_filter** Binary, should non-biallelic SNPs be removed.
Default is TRUE
- **flip_frq_as_biallelic** Binary, Should non-bi-allelic SNPs frequency
values be flipped as 1-p despite there being other alternative alleles?
Default is FALSE but if set to TRUE, this allows non-bi-allelic SNPs to be
kept despite needing flipping.
- **snp_ids_are_rs_ids** Binary, should the SNP IDs inputted be
inferred as RS IDs or some arbitrary ID. Default is TRUE.\
- **remove_multi_rs_snp** Binary Sometimes summary statistics can have
multiple RSIDs on one row (i.e. related to one SNP), for example
"rs5772025_rs397784053". This can cause an error so by default, the
first RS ID will be kept and the rest removed e.g."rs5772025". If
you want to just remove these SNPs entirely, set it to TRUE. Default
is FALSE.
- **frq_is_maf** Binary, conventionally the FRQ column is intended to
show the minor/effect allele frequency (MAF) but sometimes the major
allele frequency can be inferred as the FRQ column. This logical
variable indicates that the FRQ column should be renamed to
MAJOR_ALLELE_FRQ if the frequency values appear to relate to the
major allele i.e. \>0.5. By default mapping won't occur i.e. is
TRUE.
- **indels** Binary does your Sumstats file contain Indels? These
don't exist in our reference file so they will be excluded from
checks if this value is TRUE. Further information -the reference
dataset we use in MSS (dbSNP) does not include indels so any
checks like is the SNP on the reference genome, attempts to impute
any missing data for indels or check the direction of the effect
columns can not be done for indels. Indels will be kept in the
dataset if possible but certain situations (like if there is missing
data) can cause an indel to be removed. See the printed information by
MSS during your run to know if this affects you. Default is TRUE.
- **drop_indels** Binary should any indels found in the sumstats be
dropped? These can not be checked against a reference dataset and will have
the same RS ID and position as SNPs which can affect downstream analysis.
Default is False.
- **drop_na_cols** A character vector of column names to be checked for
missing values. Rows with missing values in any of these columns (if present
in the dataset) will be dropped. If `NULL`, all columns will be checked for
missing values. Default columns are SNP, chromosome, position, allele 1,
allele 2, effect columns (frequency, beta, Z-score, standard error,
log odds, signed sumstats, odds ratio), p value and N columns.
- **dbSNP** The dbSNP version to use as a reference - defaults to the most
recent version available (155). Note that with the 9x more SNPs in dbSNP
155 vs 144, run times will increase.
- **sort_coordinates** Whether to sort by coordinates of resulting
sumstats.\
- **nThread** Number of threads to use for parallel processes.
- **write_vcf** Whether to write as VCF (TRUE) or tabular file
(FALSE). While **tabix_index** is a binary input for whether to
index the formatted summary statistics with
[tabix](http://www.htslib.org/doc/tabix.html) for fast querying.
- **return_data** Return `data.table`, `GRanges` or `VRanges`directly
to user. Otherwise, return the path to the save data. Default is
FALSE.
- **return_format** If return_data is TRUE. Object type to be returned
("data.table","vranges","granges").
- **save_format** Ensure that output format meets all requirements to
be passed directly into LDSC ("ldsc") without the need for additional
munging or for IEU OpenGWAS format ("opengwas") before saving as a VCF.
**NOTE** - If LDSC format is used, the naming convention of A1 as the
reference (genome build) allele and A2 as the effect allele will be reversed
to match LDSC (A1 will now be the effect allele). See more info on this
[here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68). Note that any
effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
- **log_folder_ind** Should log files be stored containing all
filtered out SNPs (separate file per filter). The data is outputted
in the same format specified for the resulting sumstats file.
- **log_mungesumstats_msgs** Binary Should a log be stored containing
all messages and errors printed by MungeSumstats in a run.
- **imputation_ind** Binary Should a column be added for each
imputation step to show what SNPs have imputed values for differing
fields. This includes a field denoting SNP allele flipping
(flipped). On the flipped value, this denoted whether the alelles
where switched based on MungeSumstats initial choice of A1, A2 from
the input column headers and thus may not align with what the
creator intended.**Note** these columns will be in the formatted
summary statistics returned.
- **log_folder** File path to the directory for the log files and the
log of MungeSumstats messages to be stored. Default is a temporary
directory.
- **force_new** If a formatted file of the same names as
\code{save_path} exists, formatting will be skipped and this file
will be imported instead (default). Set \code{force_new=TRUE} to
override this.
- **mapping_file** MungeSumstats has a pre-defined column-name mapping
file which should cover the most common column headers and their
interpretations. However, if a column header that is in youf file is
missing of the mapping we give is incorrect you can supply your own
mapping file. Must be a 2 column dataframe with column names
"Uncorrected" and "Corrected". See `data(sumstatsColHeaders)` for
default mapping and necessary format.
See `?MungeSumstats::format_sumstats()` for the full list of parameters
to control MungeSumstats QC and standardisation steps.
VCF files can also be standardised to the same format as other summary
statistic files. A subset of the Amyotrophic lateral sclerosis GWAS from
the ieu open GWAS project (a .vcf file) has been added to
*MungeSumstats* to demonstrate this functionality.Simply pass the path
to the file in the same manner you would for other summary statistic
files:
```{r, message=TRUE}
#save ALS GWAS from the ieu open GWAS project to a temp directory
ALSvcfPth <- system.file("extdata","ALSvcf.vcf", package="MungeSumstats")
```
```{r,eval=FALSE}
reformatted_vcf <-
MungeSumstats::format_sumstats(path=ALSvcfPth,
ref_genome="GRCh37")
```
You can also get more information on the SNPs which have had data
imputed or have been filtered out by *MungeSumstats* by using the
`imputation_ind` and `log_folder_ind` parameters respectively. For
example:
```{r, eval=FALSE, message=FALSE}
#set
reformatted_vcf_2 <-
MungeSumstats::format_sumstats(path=ALSvcfPth,
ref_genome="GRCh37",
log_folder_ind=TRUE,
imputation_ind=TRUE,
log_mungesumstats_msgs=TRUE)
```
```{r,echo=FALSE,message=FALSE}
#don't run time intensive checks
reformatted_vcf_2 <-
MungeSumstats::format_sumstats(path=ALSvcfPth,
ref_genome="GRCh37",
log_folder_ind=TRUE,
imputation_ind=TRUE,
log_mungesumstats_msgs=TRUE,
on_ref_genome = FALSE,
strand_ambig_filter = FALSE,
bi_allelic_filter = FALSE,
allele_flip_check = FALSE)
```
Check the file `snp_bi_allelic.tsv.gz` in the `log_folder` directory you supply
(by default a temp directory), for a list of SNPs removed as they are
non-bi-allelic. The text files containing the console output and messages are
also stored in the same directory.
Note you can also control the dbSNP version used as a reference dataset by
MungeSumstats using the `dbSNP` parameter. By default this will be set to the
most recent dbSNP version available (155).
Note that using `log_folder_ind` returns a list from `format_sumstats`
which includes the file locations of the differing classes of removed
SNPs. Using `log_mungesumstats_msgs` saves the messages to the console
to a file which is returned in the same list. Note that not all the
messages will also print to screen anymore when you set
`log_mungesumstats_msgs`:
```{r, message=TRUE}
names(reformatted_vcf_2)
```
A user can load a file to view the excluded SNPs.
In this case, SNPs were filtered based on non-bi-allelic criterion:
```{r, message=TRUE}
print(reformatted_vcf_2$log_files$snp_bi_allelic)
```
The different types of exclusion which lead to the names are explained
below:
- **snp_multi_rs_one_row** - Where the SNP (RS ID) contained more than
one RS ID.
- **snp_missing_rs** - Where the SNP (RS ID) was missing the rs
prefix. Note that these are only removed when other snps have an rs
prefix.
- **snp_multi_colon** - Where the SNP ID has mutliple colons (":") in
one SNP.
- **snp_not_found_from_bp_chr** - Where the RS ID was attempted to be
imputed from the CHR and BP (Base-Pair) information, using the
reference genome, but wasn't successful.
- **chr_bp_not_found_from_snp** - Where the CHR and BP (Base-Pair) was
attempted to be imputed from the SNP (RS ID), using the reference
genome, but wasn't successful.
- **alleles_not_found_from_snp** - Where the alleles (A1 and/or A2)
was attempted to be imputed from the SNP (RS ID), using the
reference genome, but wasn't successful.
- **alleles_dont_match_ref_gen** - Where the alleles (A1 and/or A2)
don't match what's on the reference genome.
- **missing_data** - Where there is data missing across the inputted
columns.
- **dup_snp_id** - Where the SNP ID is duplicated in the input.
- **dup_base_pair_position** - Where the base-pair position is
duplicated in the input.
- **info_filter** - SNP INFO value below the specified threshold.
- **se_neg** - SNPs SE (Standard Error) value is 0 or negative.
- **effect_col_zero** - SNPs effect column(s) value is zero e.g.
BETA=0.
- **n_large** - SNPs N is N standard deviations greater than the mean.
- **n_null** - SNPs N is null.
- **chr_excl** - SNP has an unrecognized chromosome name or is on a
chromosome that was specified to be excluded.
- **snp_strand_ambiguous** - SNP is strand ambiguous.
- **snp_bi_allelic** - SNP is not bi-allelic.
- **MungeSumstats_log_msg** - Text file of all messages to the console
created during MungeSumstats run.
- **MungeSumstats_log_output** - Text file of all errors to the
console created during MungeSumstats run.
Note to export to another type such as R native objects including
data.table, GRanges, VRanges or save as a VCF file, set `return_data=TRUE` and
choose your `return_format`:
```{r, message=FALSE,eval=FALSE}
#set
reformatted_vcf_2 <-
MungeSumstats::format_sumstats(path=ALSvcfPth,
ref_genome="GRCh37",
log_folder_ind=TRUE,
imputation_ind=TRUE,
log_mungesumstats_msgs=TRUE,
return_data=TRUE,
return_format="GRanges")
```
Also you can now output a VCF compatible with [IEU OpenGWAS](https://gwas.mrcieu.ac.uk/)
format (Note that currently all IEU OpenGWAS sumstats are GRCh37, MungeSumstats
will throw a warning if your data isn't GRCh37 when saving):
```{r, message=FALSE,eval=FALSE}
#set
reformatted_vcf_2 <-
MungeSumstats::format_sumstats(path=ALSvcfPth,
ref_genome="GRCh37",
write_vcf=TRUE,
save_format ="openGWAS")
```
See our publication for further discussion of these checks and options:
[Murphy et al. MungeSumstats: A Bioconductor package for the
standardisation and quality control of many GWAS summary
statistics](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab665/6380562).
# Extra Functionality
## Get genome builds
*MungeSumstats* also contains a function to quickly infer the genome
build of multiple summary statistic files. This can be called separately
to `format_sumstats()` which is useful if you want to just quickly check
the genome build:
```{r, message=FALSE,eval=FALSE}
# Pass path to Educational Attainment Okbay sumstat file to a temp directory
eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
package = "MungeSumstats")
ALSvcfPth <- system.file("extdata","ALSvcf.vcf", package="MungeSumstats")
sumstats_list <- list(ss1 = eduAttainOkbayPth, ss2 = ALSvcfPth)
ref_genomes <- MungeSumstats::get_genome_builds(sumstats_list = sumstats_list)
```
## Liftover
*MungeSumstats* exposes the `liftover()` function as a general utility
for users.
Useful features include: - Automatic standardisation of genome build
names (i.e. "hg19", "hg37", and "GRCh37" will all be recognized as the
same genome build.) - Ability to specify `chrom_col` as well as both
`start_col` and `end_col` (for variants that span \>1bp). - Ability to
return in `data.table` or `GRanges` format. - Ability to specify which
chromosome format (e.g. "chr1" vs. 1) to return `GRanges` as.
```{r}
sumstats_dt <- MungeSumstats::formatted_example()
sumstats_dt_hg38 <- MungeSumstats::liftover(sumstats_dt = sumstats_dt,
ref_genome = "hg19",
convert_ref_genome = "hg38")
knitr::kable(head(sumstats_dt_hg38))
```
## Quick formatting
In some cases, users may not want to run the full munging pipeline
provided by\
`MungeSumstats::format_sumstats`, but still would like to take advantage
of the file type conversion and column header standardisation features.
This will not be nearly as robust as the full pipeline, but can still be
helpful.
### From disk
To do this, simply run the following:
```{r}
eduAttainOkbayPth <- system.file("extdata", "eduAttainOkbay.txt",
package = "MungeSumstats")
formatted_path <- tempfile(fileext = "_eduAttainOkbay_standardised.tsv.gz")
#### 1. Read in the data and standardise header names ####
dat <- MungeSumstats::read_sumstats(path = eduAttainOkbayPth,
standardise_headers = TRUE)
knitr::kable(head(dat))
#### 2. Write to disk as a compressed, tab-delimited, tabix-indexed file ####
formatted_path <- MungeSumstats::write_sumstats(sumstats_dt = dat,
save_path = formatted_path,
tabix_index = TRUE,
write_vcf = FALSE,
return_path = TRUE)
```
### From `data.table`
If you already have your data imported as an `data.table`, you can also
standardise its headers like so:
```{r}
#### Mess up some column names ####
dat_raw <- data.table::copy(dat)
data.table::setnames(dat_raw, c("SNP","CHR"), c("rsID","Seqnames"))
#### Add a non-standard column that I want to keep the casing for ####
dat_raw$Support <- runif(nrow(dat_raw))
dat2 <- MungeSumstats::standardise_header(sumstats_dt = dat_raw,
uppercase_unmapped = FALSE,
return_list = FALSE )
knitr::kable(head(dat2))
```
# Future Enhancements
The *MungeSumstats* package aims to be able to handle the most common
summary statistic file formats including VCF. If your file can not be
formatted by *MungeSumstats* feel free to report the bug on github:
along with your summary
statistic file header.
We also encourage people to edit the code to resolve their particular
issues too and are happy to incorporate these through pull requests on
github. If your summary statistic file headers are not recognised by
*MungeSumstats* but correspond to one of:
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS,
SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ
feel free to update the `MungeSumstats::sumstatsColHeaders` following
the approach in the data.R file and add your mapping. Then use a pull
request on github and we will incorporate this change into the package.
A note on `MungeSumstats::sumstatsColHeaders` for summary statistic
files with A0/A1. The mapping in `MungeSumstats::sumstatsColHeaders`
converts A0 to A\*, this is a special case so that the code knows to map
A0/A1 to A1/A2 (ref/alt). The special case is needed since ordinarily A1
refers to the reference not the alternative allele.
A note on `MungeSumstats::sumstatsColHeaders` for summary statistic
files with Effect Size (ES). By default, MSS takes BETA to be any BETA-like
value (including ES). This is coded into the mapping file -
`MungeSumstats::sumstatsColHeaders`. If this isn't the case for your sumstats,
you can set the `es_is_beta` parameter in `MungeSumstats::format_sumstats()` to
FALSE to avoid this. Note this is done to try and capture most use cases of MSS.
# Further functionality
See the [Open GWAS
vignette](https://neurogenomics.github.io/MungeSumstats/articles/OpenGWAS.html)
for how MungeSumstats can be used along with data from the MRC IEU Open
GWAS Project and also Mungesumstats' functionality to handle lists of
summary statistics files.
# Session Information
```{r, message=TRUE, echo=FALSE}
utils::sessionInfo()
```
# References