Welcome to jass_preprocessing’s documentation!
What is jass preprocessing ?
Jass preprocessing is a command line tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.
Overview
The QC and preprocessing step goes as follow:
Map column from of a heterogeneous GWAS entry file to standardize names
Select GWAS SNPs that are in the input reference panel
Align coded allele of the GWASs data with the reference panel
Infer the sample size by SNPs if not present in input data (from MAF, standard deviation and genetic effect)
Filter SNPs with an heterogeneous sample size (as JASS and RAISS packages assume sample size to be constant across SNPs)
Normalize the effect size to Z-scores
Save the output by chromosome as the following example:
rsID |
pos |
A0 |
A1 |
Z |
---|---|---|---|---|
rs6548219 |
30762 |
A |
G |
-1.133 |
(Optional step) Save the output to one file with a chromosome column
(input format needed to perform LD-score). The additional output correspond to: P, the p-value; N_effective (Effective sample size estimated/retrieved from the data). Effective sample size refers to the total sample size for continuous trait and to 1 / ( 1/Ncases + 1/Ncontrol ) for binary traits.
chrom |
rsID |
pos |
A0 |
A1 |
Z |
P |
N_effective |
1 |
rs4075116 |
1003629 |
C |
T |
0.3 |
0.76 |
10220.98 |
Installation
In a terminal, execute the following lines:
pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing
Input
A reference panel to the format below. The user is expected to provide a reference panel
- in tsv format with the following columns in the following order (chromosome, rsID, Minor Allele
Frequency, Position, reference, Alternative allele), without header.
1 |
rs62635286 |
0.0970447 |
13116 |
T |
G |
---|---|---|---|---|---|
1 |
rs63125786 |
0.0970447 |
15116 |
T |
A |
1 |
rs5686 |
0.1970447 |
17116 |
A |
G |
1 |
rs892586 |
0.7670447 |
23116 |
C |
G |
The GWAS Folder containing all raw gwas data (correspond to the –gwas-info command line parameter): all chromosomes in one file, compressed or uncompressed
A descriptor csv files (see example below and here)that will described each GWAS summary statistic files (correspond to the –input-folder command line parameter): * a header * 1 line per study * the fields categories are:
category |
field name |
---|---|
path to the data |
filename |
study info fields |
Consortium,Outcome,fullName,Nsample,Ncase,Ncontrol,Nsnp |
names of the header in the GWAS file |
snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont |
Study field definition:
filename: gwas summary statistic name as it appear in the GWAS folder
Consortium : the Consortium of the study (can also be the category of the trait) in upper case and without _ characters
Outcome: a short tag for the Outcome of the study in upper case and without _ characters
FullName: full description of the trait (for your own information not used in the cleaning process)
Nsample: Number of sample in the study
Ncase: Number of cases in the study (left empty if trait is continuous)
Ncontrol: Number of control in the study (left empty if trait is continuous)
Field corresping to column names in the summary statistic
snpid: name of the column storing rsid in the gwas file
POS: name of the column storing the position in the gwas file
CHR: name of the column storing the chromosome in the gwas file
a1: effect allele
a2: Other allele
freq: name of the column storing the minor allele frequence in the gwas file
pval: name of the column storing the pvalue in the gwas file. This column will be used to derive the Z-score of genetic effect.
n: name of the column storing the sample size by variants (optional, will be inferred from the MAF, genetic effect and standard deviation if absent)
ncas: For binary traits, name of the column storing the number of cases by variants (optional)
ncont: For binary traits, name of the column storing the number of controls by variants (optional)
beta_or_Z: name of the column storing the genetic effect (beta) in the gwas file. This column will be used only to retrieve the sign of the genetic effect with respect to the reference and effect allele.
se: standard error
OR : For binary traits, Odd ratio when available. Not to be confounded with the genetic effect size or ‘beta’.
index-type: precise the type of index
imputation_quality: (Optional) column containing individual-based imputation quality. Will be used to filter low quality imputation data from GWASs if the option –imputation-quality-treshold is used
Warning
Note that the concatenation of Consortium and Outcome must be unique because as it will be used as an index in the cleaning process. Both Outcome and Consortium must be in uppercase and with no _ characters
Here is an example of descriptor field (downloadable example here), the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na or left empty. Some fields are optional like the imputation_quality. If not used they can be filled with na. Column separators must be tabulations.
filename |
Consortium |
Outcome |
FullName |
Type |
Nsample |
Ncase |
Ncontrol |
Nsnp |
snpid |
POS |
a1 |
a2 |
freq |
pval |
n |
beta_or_Z |
OR |
se |
code |
imp |
ncas |
ncont |
imputation_quality |
index_type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GIANT_HEIGHT_Wood_et_al.txt |
GIANT |
HEIGHT |
Height |
Anthropometry |
253288 |
na |
na |
2550858 |
MarkerName |
position |
Allele1 |
Allele2 |
Freq.Allele1.HapMapCEU |
p |
N |
b |
na |
SE |
na |
na |
na |
na |
imputationInfo |
rs-number |
Command line usage example:
It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:
usage: jass_preprocessing [-h] --gwas-info GWAS_INFO --ref-path REF_PATH
--input-folder INPUT_FOLDER --diagnostic-folder
DIAGNOSTIC_FOLDER --output-folder OUTPUT_FOLDER
[--output-folder-1-file OUTPUT_FOLDER_1_FILE]
[--percent-sample-size PERCENT_SAMPLE_SIZE]
[--minimum-MAF MINIMUM_MAF] [--mask-MHC MASK_MHC]
[--additional-masked-region ADDITIONAL_MASKED_REGION]
[--imputation-quality-treshold IMPUTATION_QUALITY_TRESHOLD]
Named Arguments
- --gwas-info
Path to the file describing the format of the individual GWASs files with correct header
- --ref-path
reference panel location (notably used to harmonize reference and alternative allele accross SNPs
- --input-folder
Path to the folder containing the Raw GWASs summary statistic files, must end by ‘/’
- --diagnostic-folder
Path to the reporting information on the PreProcessing such as the SNPs sample size distribution
- --output-folder
Location of main ouput folder for preprocessed GWAS files (splitted by chromosome)
- --output-folder-1-file
optional location to store the preprocessing in one tabular file with one chromosome columns (useful to compute LDSC correlation for instance)
- --percent-sample-size
the proportion (between 0 and 1) of the 90th percentile of the sample size used to filter the SNPs
Default: 0.7
- --minimum-MAF
Filter the reference panel by minimum allele frequency
Default: “0.01”
- --mask-MHC
Whether the MHC region should be masked or not. default is False
Default: “False”
- --additional-masked-region
List of dictionary containing coordinate of region to mask. For example :[{‘chr’:6, ‘start’:50000000, ‘end’: 70000000}, {‘chr’:6, ‘start’:100000000, ‘end’: 120000000}]
Default: “None”
- --imputation-quality-treshold
minimum imputation quality in summary statistics
Default: “None”
Indices and tables
Map GWAS |
|
Few fonction to to compute DNA complement |
|
Module of function |
|