Welcome to jass_preprocessing’s documentation!

What is jass preprocessing ?

Jass preprocessing is a command line tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.

Overview

The QC and preprocessing step goes as follow:

Map column from of a heterogeneous GWAS entry file to standardize names
Select GWAS SNPs that are in the input reference panel
Align coded allele of the GWASs data with the reference panel
Infer the sample size by SNPs if not present in input data (from MAF, standard deviation and genetic effect)
Filter SNPs with an heterogeneous sample size (as JASS and RAISS packages assume sample size to be constant across SNPs)
Normalize the effect size to Z-scores
Save the output by chromosome as the following example:

rsID	pos	A0	A1	Z
rs6548219	30762	A	G	-1.133

(Optional step) Save the output to one file with a chromosome column

(input format needed to perform LD-score). The additional output correspond to: P, the p-value; N_effective (Effective sample size estimated/retrieved from the data). Effective sample size refers to the total sample size for continuous trait and to 1 / ( 1/Ncases + 1/Ncontrol ) for binary traits.

chrom	rsID	pos	A0	A1	Z	P	N_effective
1	rs4075116	1003629	C	T	0.3	0.76	10220.98

Installation

In a terminal, execute the following lines:

pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing

Input

A reference panel to the format below. The user is expected to provide a reference panel

in tsv format with the following columns in the following order (chromosome, rsID, Minor Allele
Frequency, Position, reference, Alternative allele), without header.

1	rs62635286	0.0970447	13116	T	G
1	rs63125786	0.0970447	15116	T	A
1	rs5686	0.1970447	17116	A	G
1	rs892586	0.7670447	23116	C	G

The GWAS Folder containing all raw gwas data (correspond to the –gwas-info command line parameter): all chromosomes in one file, compressed or uncompressed
A descriptor csv files (see example below and here)that will described each GWAS summary statistic files (correspond to the –input-folder command line parameter): * a header * 1 line per study * the fields categories are:

category	field name
path to the data	filename
study info fields	Consortium,Outcome,fullName,Nsample,Ncase,Ncontrol,Nsnp
names of the header in the GWAS file	snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont

Study field definition:

filename: gwas summary statistic name as it appear in the GWAS folder
Consortium : the Consortium of the study (can also be the category of the trait) in upper case and without _ characters
Outcome: a short tag for the Outcome of the study in upper case and without _ characters
FullName: full description of the trait (for your own information not used in the cleaning process)
Nsample: Number of sample in the study
Ncase: Number of cases in the study (left empty if trait is continuous)
Ncontrol: Number of control in the study (left empty if trait is continuous)

Field corresping to column names in the summary statistic

snpid: name of the column storing rsid in the gwas file
POS: name of the column storing the position in the gwas file
CHR: name of the column storing the chromosome in the gwas file
a1: effect allele
a2: Other allele
freq: name of the column storing the minor allele frequence in the gwas file
pval: name of the column storing the pvalue in the gwas file. This column will be used to derive the Z-score of genetic effect.
n: name of the column storing the sample size by variants (optional, will be inferred from the MAF, genetic effect and standard deviation if absent)
ncas: For binary traits, name of the column storing the number of cases by variants (optional)
ncont: For binary traits, name of the column storing the number of controls by variants (optional)
beta_or_Z: name of the column storing the genetic effect (beta) in the gwas file. This column will be used only to retrieve the sign of the genetic effect with respect to the reference and effect allele.
se: standard error
OR : For binary traits, Odd ratio when available. Not to be confounded with the genetic effect size or ‘beta’.
index-type: precise the type of index
imputation_quality: (Optional) column containing individual-based imputation quality. Will be used to filter low quality imputation data from GWASs if the option –imputation-quality-treshold is used

Warning

Note that the concatenation of Consortium and Outcome must be unique because as it will be used as an index in the cleaning process. Both Outcome and Consortium must be in uppercase and with no _ characters

Here is an example of descriptor field (downloadable example here), the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na or left empty. Some fields are optional like the imputation_quality. If not used they can be filled with na. Column separators must be tabulations.

GWAS information table
filename	Consortium	Outcome	FullName	Type	Nsample	Ncase	Ncontrol	Nsnp	snpid	POS	a1	a2	freq	pval	n	beta_or_Z	OR	se	code	imp	ncas	ncont	imputation_quality	index_type
GIANT_HEIGHT_Wood_et_al.txt	GIANT	HEIGHT	Height	Anthropometry	253288	na	na	2550858	MarkerName	position	Allele1	Allele2	Freq.Allele1.HapMapCEU	p	N	b	na	SE	na	na	na	na	imputationInfo	rs-number

Command line usage example:

It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:

usage: jass_preprocessing [-h] --gwas-info GWAS_INFO --ref-path REF_PATH
                          --input-folder INPUT_FOLDER --diagnostic-folder
                          DIAGNOSTIC_FOLDER --output-folder OUTPUT_FOLDER
                          [--output-folder-1-file OUTPUT_FOLDER_1_FILE]
                          [--percent-sample-size PERCENT_SAMPLE_SIZE]
                          [--minimum-MAF MINIMUM_MAF] [--mask-MHC MASK_MHC]
                          [--additional-masked-region ADDITIONAL_MASKED_REGION]
                          [--imputation-quality-treshold IMPUTATION_QUALITY_TRESHOLD]

Named Arguments

--gwas-info

Path to the file describing the format of the individual GWASs files with correct header

--ref-path

reference panel location (notably used to harmonize reference and alternative allele accross SNPs

--input-folder

Path to the folder containing the Raw GWASs summary statistic files, must end by ‘/’

--diagnostic-folder

Path to the reporting information on the PreProcessing such as the SNPs sample size distribution

--output-folder

Location of main ouput folder for preprocessed GWAS files (splitted by chromosome)

--output-folder-1-file

optional location to store the preprocessing in one tabular file with one chromosome columns (useful to compute LDSC correlation for instance)

--percent-sample-size

the proportion (between 0 and 1) of the 90th percentile of the sample size used to filter the SNPs

Default: 0.7

--minimum-MAF

Filter the reference panel by minimum allele frequency

Default: “0.01”

--mask-MHC

Whether the MHC region should be masked or not. default is False

Default: “False”

--additional-masked-region

List of dictionary containing coordinate of region to mask. For example :[{‘chr’:6, ‘start’:50000000, ‘end’: 70000000}, {‘chr’:6, ‘start’:100000000, ‘end’: 120000000}]

Default: “None”

--imputation-quality-treshold

minimum imputation quality in summary statistics

Default: “None”

Indices and tables

`map_gwas`	Map GWAS
`dna_utils`	Few fonction to to compute DNA complement
`map_reference`	Module of function
`compute_score`
`save_output`

Search Page