Javascript required
Skip to content Skip to sidebar Skip to footer

after mapping reads to a reference genome what do i do

Automating with Snakemake

Nosotros are currently developing a Snakemake-driven workflow for automating short read mapping and variant calling. For more than data, see snpArcher on GitHub.

Table of Contents

Whole-genome Pop Gen Sequencing Overview
Experimental Design
Compute Access / Cannon
Sequence Reads
Quality Control
Preprocessing
Base Quality Score Recalibration
Variant Calling
Data Filtering
Next Steps
References

Whole-genome resequencing population genomics overview

Population genetics can exist used to identify genetic variation within and between populations, and with DNA sequencing becoming less expensive, more researchers are turning to whole-genome resequencing to understand genome-wide variation. The objective of this tutorial is to familiarize users with the procedure of obtaining analysis-ready VCF files from population genomic whole-genome resequencing data. The tutorial is based on the GATK's best practices pipeline for Germline SNP and Indel Discovery, however, geared toward non-man organisms. We likewise address depression-coverage whole-genome resequencing data in the tutorial, as we expect this information type to be common for our users. In improver to fastq sequencing data files, information technology is also necessary to have a reference genome fasta file for this pipeline. If the reference exists but you lot don't accept it in hand, you tin download the fasta file from that organism'south genome page from NCBI.

Experimental design

There are numerous preparation methods, (e.g. Nextera, Kapa ) to construct sequencing libraries from DNA extraction. These laboratory methods are beyond the scope of this tutorial. Nevertheless, we volition address a few aspects of study design when designing an experiment.

1. Pooled sequencing vs. individually barcoding samples

Ane conclusion researchers need to make when designing their resequencing experiments is whether to pool unbarcoded individuals together in a unmarried sequencing library (termed Puddle-seq), or to individually barcode each individual, enabling researchers to demultiplex these individuals for downstream analyses even if they are pooled for sequencing itself. There are pros and cons to both approaches, but essentially the decision comes down to price and research objective. Many papers take been written almost the pros and cons of pooled sequencing, and Schlötterer et al. 2014 provides a overnice review and comparison with other methods. Briefly, we outline some of the pros and cons below:

  • Pooled sequencing: The main reward for this approach is cost savings on library training. If large sample sizes are required for the research objectives, library training costs can chop-chop become a limiting factor. Past pooling large numbers of individuals in a single population, researchers would only need to prepare a single sequencing library per pool. However, this method has limitations on possible downstream analyses and potential sequencing biases. This method tin can yield estimates of allele frequencies from a pooled population, only few statistics beyond that (e.g. haplotype information, linkage disequilibrium). Pool-seq besides works all-time when large numbers of individuals (>40) are pooled together, with pools on the order of hundreds or thousands of individuals existence platonic. One of the biggest drawbacks of Pool-seq is that diff individual representation will produce biases in allele frequency estimates, and without barcodes it is incommunicable to know if this has occurred. This is less probable to happen with larger sample sizes.

  • Individually barcoded sequencing: The primary reward of this arroyo is that individually barcoding reads means that variants tin exist chosen for individuals, and with sufficient coverage (see below), it is possible to obtain haplotype information or other useful statistics. Every bit mentioned above, the main drawback of this method is the cost of library preparation. This is increasingly less expensive however, either because of new kits becoming available, or the ability to split library training reagents into multiple microreactions (eastward.g. Baym et al. 2015). Therefore, in the tutorial, we focus on methods for creating VCFs from individually barcoded samples.

ii. Sample sizes

Determining how many individuals you need to sequence depends on what types of assay you wish to conduct downstream. If the objective of the study is to draw population structure and genetic diverseness, very few individuals per population are needed because whole-genome sequencing provides so much data per individual (e.g. Nazareno et al. 2017). If the goal of the study is to perform detailed demographic inference (east.k. with the site frequency spectrum via dadi or fastsimcoal2), minor numbers of individuals may exist sufficient for detecting older events or testing different models, but larger numbers of individuals may exist necessary to detect recent events or approximate parameters (e.m. Robinson et al. 2014).

Identifying allele frequency shifts at specific sites (e.g. looking for Fst outliers) or GWAS-types of analyses require larger population sizes to accept plenty power to observe significant differences with FDR corrections for millions of sites.

3. Sequencing depth

Ideally, to confidently call variants from whole genome resequencing data, diploid organisms should be sequenced to 30x coverage. Notwithstanding, due to limited budgets and different study goals, it is often possible to sequence to much lower coverage. For many population genomic goals, given a set amount of sequencing to work with (e.thou. 100x coverage of the target genome), information technology is often more advantageous to sequence more individuals (100 individuals to 1x coverage), to more accurately infer population genetic parameters (Buerkle and Gompert 2012). Considering of limited budgets despite falling sequencing costs, an increasing number of tools are available to make use of depression-coverage whole-genome resequencing for population genomic inference. For example, the ANGSD and NGSTools packages permit one to summate site frequency spectra, diversity statistics, PCA, and admixture assay among others based entirely on genotype likelihood scores. Other packages, like MAPGD, allow 1 to calculate linkage disequilibrium and relatedness using genotype likelihoods. By non actually calling genotypes, and instead inferring parameters from genotype likelihoods beyond individuals in a population, these programs avert many of the biases associated with low-coverage genome information (e.g. Han et al. 2014).

iv. Sequencing way

For whole-genome resequencing studies, it is near always recommended to use paired-finish sequencing. As genome coverage is mostly a limiting factor, the cost per base is much less for paired-end than single-terminate data. In improver, paired-end data mostly provide amend abilities to map reads to reference genomes, which is highly advantageous, especially for low-coverage data.

Compute acces / Cannon

This tutorial assumes that you have an account on the Cannon computer cluster, which tin can be requested here.

Programs, like those listed beneath (e.chiliad. FastQC, BWA, GATK), are run on Cannon by submitting jobs via the SLURM direction arrangement. The jobs take the form of shell scripts, which are submitted with the sbatch command. The shell scripts asking computational resource (time, memory, and number of cores) for a chore; it is better to request more resources than expected, rather than risk having a job terminated prematurely for exceeding its limits. When running many jobs, it is also practiced practice to run a small subset to meliorate empathize the resource employ for these jobs, and tailor your requests for the total panel.

Running the GATK/PicardTools Pipeline on Cannon

A few notes on running GATK and PicardTools commands on Cannon.

GATK and PicardTools are built with java, and so when running the jar file (due east.g. coffee -jar picard.jar <PicardTool>), you can include a few extra options to pass to java that are especially applicative to running these programs on Cannon. First, you can add a retention limit to java, for example requiring java to use no more than 4GB memory: -Xmx4g. This can help ensure your program does non utilise more memory than you lot request, resulting in job termination on Cannon.

The second is: -XX:ParallelGCThreads=1. This command limits the number of coffee "Garbage Collector" threads running for each task. Based on what you fix this number to be (we recommend ane or 2), you should make certain to asking that that number +one for the number of cores you request in your submission script (due east.g. -n 2). If y'all do not request this, coffee volition start using many threads, and may cause your script to unexpectedly fail.

Finally, you lot must exist using java version viii (or development kit 1.8). And so, be certain to load a coffee module before running whatsoever commands. While there are GATK modules installed on Cannon, it is simple to download the latest versions yourself. But exist sure to download and specify the picard.jar and GenomeAnalysisTK.jar files when running commands.

An example PicardTools command with these 2 variables is:

                                module load java/ane.eight.0_45-fasrc01 java -Xmx4g -Xx:ParallelGCThreads=1 -jar ~/path/to/picard.jar <PicardTool> \ I=<Input> \ 1=<Option1> \ ii=<Option2> \ O=<Output>                              

Annotation that the options associated with the programme can all become in a single line without the \ at the terminate, but it makes a script easier to read and interpret to suspension the control beyond lines. Commands that require use input are demonstrated by <>. You should fill in these options with your ain filenames or commands.

Sequence reads

The fastq format is a common manner to store sequence reads, each read is represented by iv lines:

  1. @SeqID (instrument, flowcell lane, and so positioning info for where on the lane, sequence ID if multiplexed, and pair info (1/2))
  2. Sequence (ATCG)
  3. + (may have sequence identifier)
  4. Quality scores (Phred score, probability that a base call is incorrect), encoded past characters

The @Seqid can be used to glean information near the sequencing run. For example, from this file created on the Illumina HiSeq 2500 here at the Bauer Core:

                                @HISEQ2500:148:C9ECWANXX:8:1101:1338:2248 1:N:0:ATGACT                              

Breaking down this data, we accept:

                                Musical instrument:RUN_ID:FLOWCELL_BARCODE:LANE:TILE_NUMBER:X_COORD_CLUSTER:Y_COORD_CLUSTER  MEMBER_OF_PAIR(paired-end only):IS_READ_FILTERED(Y=yep,N=no):ARE_CONTROL_BITS_ON(0=No,even number otherwise):INDEX_SEQUENCE                              

Quality command

Earlier kickoff analysis, it is a good idea to assess the full general quality of the raw sequence data. One fast and like shooting fish in a barrel to employ programme for this is FastQC. To clarify a fastq file on Cannon:

                                module load fastqc fastqc sample.R1.fastq.gz                              

This will produce an HTML file as output, which can be viewed in any web browser. It will provide a number of useful summary statistics, including graphical representations of your data. 1 of the nearly useful metrics in this context is to see how the quality of your sequence data varies along the length of the read. It is normal to run across quality drop along a sequence, but large drops in the "cherry" zone, especially for older sequence information, may point that compatible trimming to exclude those depression-quality bases is a practiced idea. Meet more than near trimming below.

Preprocessing

Preprocessing sequencing reads is much simpler when dealing with resequencing data compared to building a genome. Most resequencing libraries will be created with big plenty insert sizes so that paired-finish reads will not overlap, and adapter contamination is non an issue (however, if this is non the case with your information, please visit the options for adapter removal in the ATAC-seq workflow).

Note that if for these reasons yous also take many overlapping reads, in improver to trimming adapters, you also might want to merge the reads into unmarried fragments. If you programme to telephone call variants with a tool similar HaplotypeCaller from the GATK pipeline equally we describe below, this is non necessary as the program will account for overlapping bases and not inflate the variant quality scores. Notwithstanding, if the program you program to apply does non account for overlapping bases (e.k. ANGSD) you may desire to use software similar NGmerge in stitch manner, created by the Information science group, to merge the reads.

                                module load NGmerge NGmerge -i <fastq_read_1> -ii <fastq_read_2> -o <output_fasta_name>                              

See the documentation for NGmerge NGmerge -h for additional parameter options, such as the number of threads to utilize.

Note that additional trimming with resequencing data is non normally necessary, as many variant callers (east.one thousand. HaplotypeCaller) accept quality scores into business relationship. Others (e.g. ANGSD), can trim reads during the data filtering step. For that reason, we do non recommend trimming here.

Figure 1. Visual depiction of workflow Figure 1. Visual depiction of workflow

Read Group Identifiers

Read group identifiers are used to identify sequence data past sequencing applied science (e.1000. Illumina), flow jail cell, lane, sample ID, and library. Using these identifiers ensures that batch effects, or biases in the information that might have been introduced at different stages of the sequencing procedure tin be properly accounted for. Detailed documentation on Read Groups can be found on the GATK website.

The about common and recommended read groups are:

  • ID : Read Group Identifier
    A unique identifier for each read group. The convention for Illumina information is {FLOWCELL}.{LANE}.

  • PU: Platform Unit
    A sample/library specific identifier, specified with: {FLOWCELL_BARCODE}.{LANE}.{SAMPLE}. The flowcell barcode is a unique identifier for a flow jail cell, lane is the lane of that flowcell, and sample is the sample or library specific identifier.

  • SM: Sample
    The name of the sample represented past this read group. This will be the proper name used in the sample column of the VCF file.

  • PL: Platform
    The sequencing technology used to create the data. Electric current valid values: ILLUMINA, SOLID, LS454, HELICOS, and PACBIO.

  • LB: Data Preparation Library Identifier
    The library preparation identifier. This is used by MarkDuplicates to identify which read groups contain molecular (e.g. PCR) duplicates.

The read group information can be constitute in the file header (look for @RG) and the RG:Z tag for each sequence record. This information is not automatically added to Fastq files following sequencing, but needs to be added either when mapping with BWA or separately after mapping with Picard's AddOrReplaceReadGroups tool.

If you don't know the information on the flowcell and lane for your data, you can derive the information from the sequence headers found in a Fastq file, every bit described in the Sequence reads section.

Mapping reads to a reference genome

One time you have your reads, you need to map them to a reference genome. There are many different aligners out there (e.one thousand. BWA or bowtie2), but we recommend using BWA so that read group information can be added during the alignment stage, without requiring a split up stride with Picard's AddOrReplaceReadGroups tool.

Earlier you can marshal your reads to a reference genome, yous need to create an index. This only needs to be completed once per reference genome. BWA indexes are made from a FASTA genome file using bwa index:

                                bwa index -p <genome_prefix> <reference.fasta>                              

The genome prefix should be a short identifier to be used every bit the prefix for all output files (e.g. prefix.bwt).

For almost resequencing data, nosotros want to use the bwa mem algorithm (for 70bp to 1Mbp query sequences) to map our reads. A typical control would be:

                                bwa mem -Thou -t one -R '@RG\tID:{FLOWCELL}.{LANE}\tPU:{FLOWCELL_BARCODE}.{LANE}.{SAMPLE}\tSM:{SAMPLE}\tPL:{PLATFORM}\tLB{LIBRARY}' <genome_prefix> <reads_1.fq> <reads_2.fq> > <samplename_bwa.sam>                              

There are many arguments available to use, as you lot can read in the manual. Some of the fundamental arguments for these purposes are:

Argument Clarification
-M Mark shorter split hits equally secondary - mandatory for Picard compatibility
-t <int> Number of threads (default i)
-R <str> Read group information (meet to a higher place for description)
-p Specifies that fastq read i and read 2 files are interleaved, if simply one fastq is specified and this control is not used, will assume unmarried-end information

The output file format from BWA is a SAM (Sequence Alignment/Map) file format. This is a common file format, and detailed documentation can be found on the Samtools website. Samtools is part of a useful set of programs written to interact with high throughput sequencing data. The details of all you can do with this plan are beyond the telescopic of this tutorial, but this program can exist used to view, merge, calculate the depth of coverage, calculate other statistics, and alphabetize SAM-style files among other things.

Sorting and Indexing

Following alignment, you will demand to sort the SAM file. We also recommend you store these types of alignment files in BAM format, which is like to SAM format, simply its binary equivalent, and therefore compressed and more efficient. As mentioned above, Samtools can exist used to convert among file types, simply we are working within the GATK pipeline for this tutorial, and so volition work within the GATK/PicardTools universe.

Y'all tin can automatically sort your SAM file by coordinate position (required for downstream analyses) and output the file in BAM format with PicardTools SortSam command.

                                java -jar ~/path/to/picard.jar SortSam \ I=samplename_bwa.sam \ O=samplename_sorted.bam \ SORT_ORDER=coordinate \ CREATE_INDEX=true                              

To employ this BAM file, you also need to create a BAM index, so that software tin can efficiently access the compressed file. You detect that we automatically include index creation when sorting by specifying CREATE_INDEX=truthful. This is a universal command that tin be applied to any PicardTools program. Even so, if you need to create an index separately, we do this with the BuildBamIndex command.

                                coffee -jar ~/path/to/picard.jar BuildBamIndex \ I=samplename_sorted.bam                              

Alignment Metrics

It may also exist useful to calculate metrics on the aligned sequences. Nosotros tin can hands do this with the CollectAlignmentSummaryMetrics tool. Annotation that y'all can collect the alignment metrics on several different levels. In the below example, I've included metrics both at the sample and read group level. You likewise demand to include the reference fasta file.

                                java -jar ~/path/to/picard.jar CollectAlignmentSummaryMetrics \ I=samplename_sorted.bam \ R=reference.fasta \ METRIC_ACCUMULATION_LEVEL=SAMPLE \ METRIC_ACCUMULATION_LEVEL=READ_GROUP \ O=samplename.alignment_metrics.txt                              

Deduplication

Later on alignment, sorting and indexing, information technology is necessary to identify any duplicate sequences from the aforementioned DNA fragment in your files that occur due to sample preparation (e.thou. during PCR) or incorrect optical cluster identification during sequencing. This possibility is why it is of import to identify read groups for different lanes of the same sample. This is also a useful point to merge together any BAM files from the same sample that are currently separated (demonstrated in example beneath). We place duplicate sequences with MarkDuplicates, and additional details on how this is performed can be found in the tool documentation.

Note that it is not recommended to actually remove the duplicate sequences from the file, but simply to marking the flags appropriately in the BAM file, and so that those sequences are ignored downstream. If using tools other than those we recommend here, make sure they tin can identify these flags. These sequences can also be removed later should the need arise.

                                java -jar ~/path/to/picard.jar MarkDuplicates \ TMP_DIR=tmp \ I=samplename_sorted_file1.bam \ I=samplename_sorted_file2.bam \ O=samplename.dedup.bam \ METRICS_FILE=samplename.dedup.metrics.txt \ REMOVE_DUPLICATES=false \ TAGGING_POLICY=All                              

Nosotros also recommend creating a deduplications metrics file, which will report the proportion and type of duplicate sequences in your sample and read groups.

Following deduplication make sure to sort and index your file, every bit shown in the in a higher place section.

Validating BAM files

Once you are done with the above steps, it is best practice to validate your BAM file, to make certain there were not issues or mistakes associated with previous analyses. This is washed with ValidateSamFile.

                                java -jar ~/path/to/picard.jar ValidateSamFile \ I=sample.dedup.sorted.bam \ O=sample.validate.txt \ Style=SUMMARY                              

Base of operations Quality Score Recalibration

The adjacent pace recommended by the GATK developers is base quality score recalibration or BSQR. This step corrects base quality scores in the data for systematic technical errors based on a ready of known true variants, such as those generated for humans in the 1000 genomes projection. If you lot are working with an organism that does not yet accept a truth set of variant calls (encounter this GATK forum post for some bachelor resources), but your sequenced genomes to a moderate depth of coverage (~15x), it is all the same possible to perform BSQR past iteratively calling variants and using the highest scoring set as the input for BSQR. That is:

  1. Complete variant calling (see below) on original data.
  2. Take SNPs with highest confidence, (e.g. >15x coverage), and employ the VCF every bit the database of known SNPs for BSQR.
  3. Perform variant calling once again on recalibrated BAM files.
  4. Echo as needed until convergence occurs.

To run the BSQR tool, first create the recalibration table:

                                                  ~/path/to/gatk BaseRecalibrator \ -R reference.fasta \ -I sample_1.dedup.sorted.bam \ -known-sites SNPdb.vcf \ -O sample_1_recal_data.table                              

Then, create the recalibrated BAM by giving the recalibration table to the ApplyBQSR tool:

                                                  ~/path/to/gatk ApplyBQSR \ -R reference.fasta \ -I sample_1.dedup.sorted.bam \ --bqsr-recal-file sample_1_recal_data.tabular array \ -O sample_1.recal.bam                              

Variant calling

There are multiple options for variant calling, including programs like FreeBayes, Samtools, and the GATK. For this tutorial, we are focusing on the HaplotypeCaller program from the GATK pipeline. Calling variants with HaplotypeCaller is essentially a 2-step process (similar to indel realignment). Offset, you lot call genotypes individually for each sample. 2d, you perform joint genotyping across samples to produce a multi-sample VCF call-set. The advantage to this strategy is that the virtually computationally intensive step, calling genotypes for each sample, only needs to be performed in one case, even if additional samples will be added afterward. The joint genotyping, which is less computationally intensive, can be performed every bit many times as needed as individuals may be added to the dataset.

Note that even if you are not planning on using SNP calls in downstream analyses (e.chiliad. due to low-coverage sequencing), it is possible to use the genotype likelihood scores from the resulting VCF files (discussed below), and take advantage of the agile development on these variant callers.

i. Calling variants for each sample

For each sample, the HaplotypeCaller plan is used to call variants. The minimum options needed are a reference genome, BAM files for that sample, and output file name. Note that this procedure can exist computationally intensive, so to speed the procedure up, you may wish to use a besprinkle-gather arroyo (Figure 2), and perform the variant calling on non-overlapping segments of the genome, specified with the -L scaffold_list option.

The output for this program will be a GVCF, which has raw, unfiltered SNP and indel calls for all sites, variant or invariant, unlike a typical VCF file (see below for descriptions of the VCF file format.). This is specified by the --emitRefConfidence GVCF command, with an example below. Run into the program page for boosted parameter options.

Note, for low-coverage data, nosotros recommend changing the defaults for ii options: --min-pruning 1 and --min-dangling-branch-length i. These commands ensure that any paths in the sample graph (meet detailed documentation on the model) are only dropped if in that location is no coverage. Otherwise the defaults of two and 4 respectively, will drop depression-coverage regions. Come across the documentation for details on these and other available options.

Example command:

                                ~/path/to/gatk HaplotypeCaller \ --coffee-options "-Xmx4g -XX:ParallelGCThreads=1" \  -R reference.fasta \ -I sample_1.dedup.sorted.bam \ -O sample_1.raw.g.vcf \ --emit-ref-confidence GVCF                              

Figure 2. Scatter-gather approach Effigy two. Scatter-assemble approach

2. Joint genotyping across samples

Once you lot have run HaplotypeCaller on your accomplice of samples, the resulting GVCFs need to exist combined into a single file using GenomicsDBImport earlier we can apply them to call variants across all of our samples. If you split up your genome into intervals for haplotype calling above, you lot can go on to run variant calling on those intervals by adding the -Fifty scaffold_list pick, only the GenomicsDBImport requires that at least one interval is provided.

                                ~/path/to/gatk GenomicsDBImport \ --java-options "-Xmx4g -Twenty:ParallelGCThreads=ane" \  -V sample_1.raw.g.vcf \ -V sample_2.raw.g.vcf \ ... -V sample_N.raw.g.vcf \ --genomicsdb-workspace-path my_database \  -L scaffold_0,scaffold_1                              

This generates a directory called my_database that includes combined GVCF data for the first two scaffolds of our reference sequence. The name of this directory is and then given to GenotypeGVCFs to perform joint genotyping and produce a multi-sample variant telephone call-set from your GVCF files.

                                ~/path/to/gatk GenotypeGVCFs \ --java-options "-Xmx4g -Xx:ParallelGCThreads=1" \  -R reference.fasta \ -V gendb://my_database \ -O samples.final.vcf \ --tmp-dir=/path/to/large/tmp                              

As well, for not-human organisms, you may wish to vary the heterozygosity prior from the default value of 0.001. Y'all can do this with the heterozygosity option, for example with a value of 0.005: --heterozygosity 0.005. If you wish to include all sites, both variant and invariant, you lot need to use the --include-non-variant-sites true option. See the plan page for additional parameter options.

Note, if you specify output file names with .gz extensions, GATK volition automatically compress your output files and create and index with tabix.

VCF File Format

A standard way to store variant information in single or multiple samples is a Variant Call Format, or VCF file. The full general format is a header, with data almost the dataset, references, and annotations, and these lines start with ##. Post-obit the header, each variant is represented on a separate line with tabs separating each field. It starts with information on the chromosome (CHROM), position (POS), variantID (ID), reference allele (REF), alternate allele (ALT), quality score (QUAL), filter designation (FILTER, e.chiliad. Laissez passer), annotations (INFO), individual representation format (FORMAT), and finally genotype information for each sample.

The sample-level information tin can vary depending on the program used, but with the GATK pipeline it is shown as: GT:AD:DP:GQ:PL. Where:

  • GT : Genotype
    Genotype for the sample at each site. For a diploid, 0/0 indicates homozygous reference alleles, 0/1 indicates a heterozygous sample, with a reference and alternating allele, ane/1 indicates homozygous alternate allele, and ./. indicates no sequence at that site for that individual. Note that samples with different ploidies will take an appropriate number of alleles.

  • AD: Allele Depth
    The number of reads that support each allele. If diploid, ref,alt.

  • DP: Depth of Coverage
    The filtered depth of coverage at the sample level.

  • GQ: Genotype Quality
    The quality score of the genotype for that individual. This is usually measured on the Phred scale, equally with the Fastq file format, described above. Higher values are meliorate.

  • PL: "Normalized" Phred-scaled likelihoods of possible genotypes
    For each possible genotype (three in the case of a diploid), the normalized likelihood score (phred-scaled), with the most likely genotype set up to 0. The other values are scaled to this nearly likely genotype.

Combining VCF files

If you have been working with the scatter-gather approach (Figure 2), an easy manner to combine non-overlapping VCF files is to use the GatherVcfs tool from PicardTools.

                                java -jar ~/path/to/picard.jar GatherVcfs  \ I=all_samples_interval_1.vcf \ I=all_samples_interval_2.vcf \ I=all_samples_interval_3.vcf \ O=all_samples_combined.vcf                              

Data filtering

In one case y'all take produced a VCF file with all of your samples, information technology is necessary to evaluate and filter the dataset for quality. If you lot are working with an organism that has a variant truth set, or fix variants that are thought to be correct, prior to any downstream analyses you should perform Variant Quality Score Recalibration (VQSR). As VQSR is a tricky process to get right (and with few organisms outside of humans with advisable truth and preparation datasets), the GATK has a detailed tutorial. Here, we will focus on using hard filters, with a tutorial based on the GATK recommendations.

At that place are a number of parameters recommended for utilise for difficult-filtering with both SNP and INDEL data. The GATK developers provide a number of recommended values, and give a detailed description of why they chose these recommended values, and how you might choose parameters appropriate for your dataset following these recommendations. These parameters are:

  • QD : QualByDepth
    Variant quality score divided by the depth of the alternate allele. Recommendation: SNPs: two.0, INDELS: 2.0

  • FS: FisherStrand
    Phred-scaled p-values using Fisher'southward Verbal Test for strand bias. Higher values are more than likely false positive calls. Recommendation: SNPs: lx.0, INDELS: 200.0

  • MQ: RMSMappingQuality
    Root Mean Square of the mapping quality of reads across samples. Recommendation: SNPs: forty.0

  • MQRankSum: MappingQualityRankSumTest
    U-based z-approximation from Mann-Whitney Rank Sum Test for mapping qualities, comparing reads with reference bases and those with alternate alleles. Recommendation: SNPs: -12.5

  • ReadPosRankSum: ReadPosRankSumTest
    U-based z-approximation from Isle of mann-Whitney Rank Sum Test for altitude from end of reads for those reads with an alternating allele. Equally bases almost the ends of reads are more likely to contain errors, if all reads with the allele are near the end of the reads this may exist indicative of an error. Recommendation: SNPs: -eight.0, INDELS: -20.0

  • SOR: StrandOddsRatio
    Loftier values point strand bias in the data Recommendation: SNPs: 3.0, INDELS: x.0

Note: we likewise recommend filtering by sequencing coverage, particularly if the reference genome is a draft genome, and may have repeat regions, etc. We recommend filtering out all sites with an average depth of coverage > (the mean depth for all sequenced individuals / ii), and < (the hateful sequencing depth ten 2)

Ideally, to decide on the best parameter values for your data, you will decide on filters appropriate for both SNPs and INDELS by:

  1. Create vcf files for but SNPs and just INDELS using SelectVariants:
                                                  ~/path/to/gatk SelectVariants \      -R reference.fasta \     -V samples.concluding.vcf \     --select-type-to-include SNP \     -O raw_snps.vcf                              
                                                  ~/path/to/gatk SelectVariants \      -R reference.fasta \     -V samples.final.vcf \     --select-type-to-include INDEL \     -O raw_indels.vcf                              
  1. Plot distributions of each parameter, and compare to distributions here.

VCFtools can be used to extract parameter values, which can so exist plotted in a program like R.

Instance:

                                vcftools --vcf raw_snps.vcf --outraw_snps_MQ --get-INFO MQ                              
  1. Employ filters to SNPs and INDELs. You tin can employ more i filter here (see example below), and apply names to each unlike filter using VariantFiltration. The advantage of this is that after it is possible to remove specific subsets from the data depending on which filters the sites pass. Note that the filters are divers based on JEXL.
                                                  ~/path/to/gatk VariantFiltration \     -R reference.fasta \     -Five samples_combined.vcf \     --filter-expression "!vc.hasAttribute('DP')" \     --filter-name "noCoverage" \     --filter-expression "vc.hasAttribute('DP') && DP < MINDEPTH" \     --filter-name "MinCov" \     --filter-expression "vc.hasAttribute('DP') && DP > MAXDEPTH" \     --filter-proper noun "MaxCov" \     -filter-expression "(vc.isSNP() && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -eight.0)) || ((vc.isIndel() || vc.isMixed()) && (vc.hasAttribute('ReadPosRankSum') && ReadPosRankSum < -xx.0)) || (vc.hasAttribute('QD') && QD < 2.0) " \     --filter-proper noun "badSeq" \     --filter-expression "(vc.isSNP() && ((vc.hasAttribute('FS') && FS > 60.0) || (vc.hasAttribute('SOR') &&  SOR > 3.0))) || ((vc.isIndel() || vc.isMixed()) && ((vc.hasAttribute('FS') && FS > 200.0) || (vc.hasAttribute('SOR') &&  SOR > 10.0)))" \     --filter-proper noun "badStrand" \     --filter-expression "vc.isSNP() && ((vc.hasAttribute('MQ') && MQ < twoscore.0) || (vc.hasAttribute('MQRankSum') && MQRankSum < -12.5))" \     --filter-name "badMap" \     -O samples_filtered.vcf                              

Once the variants have filtered flags, you can again use SelectVariants to create hard-filtered vcf files. For example, to only include variants that pass all filters:

                                ~/path/to/gatk SelectVariants \  -R reference.fasta \ -V samples_filtered.vcf \ --exclude-filtered truthful \ -O samples_passed_sites.vcf                              

Side by side steps

Additional Quality Checks

Once you have filtered your variants, you are ready for downstream analyses in whatsoever of your favorite programs that accept VCF files. All the same, we recommend that earlier charging full steam into downstream analyses, yous conduct additional quality checks on sample and overall dataset quality. A quick PCA can inform you if your population is structured, if samples are clustering equally they should (east.g. all individuals of the same species cluster together), or if at that place is anything weird going on. In many population genetic analyses, samples are causeless to be unrelated from a large population, so a quick relatedness test can assistance to ensure that there are no related samples. You also may want to make sure your accept like depths of coverage across your samples, or that SNPs generally correspond to chromosome length. Abnormalities in these statistics can help place potential problems in the library training or variant calling pipeline.

PCA

There are many options for conducting a quick PCA assay. One quick selection is SNPRelate, a Bioconductor bundle. However, if you are working with low-coverage data and want to base your PCA on genotype likelihood scores rather than actual variant calls, ANGSD and NGSTools might be a better selection. A tutorial for running these analyses is bachelor here. Note that to perform a PCA in ANGSD from a VCF file, you must starting time convert the VCF to a Beagle GL file.

In that location are several programs that can be used to compute relatedness betwixt pairs of individuals, and one fairly quick programme is King. Alternatively, if yous take depression-coverage data, NGSRelate can compute relatedness values based on genotype likelihood values. These analyses should be run on individuals from a single population, and then nosotros recommend calculating relatedness after you take conducted a PCA.

Manipulating VCF files:

There are many software packages out there to handle VCF files, and below we will mention some of the nigh common ones for manipulating or calculating basic statistics.

  • VCFTools
    VCFTools is a program specifically written to incorporate utilities for dealing with VCF files, and is a bit like the swiss-army pocketknife of VCF manipulation. Not only tin it filter VCF files by site or sample, information technology can besides calculate basic statistics like Hardy-Weinberg, depth of coverage, LD statistics, the Transition/Transversion ratio, etc. We recommend getting to know VCFTools for overall VCF manipulation.

  • Tabix
    Tabix is part of the Samtools package of programs, and is well-nigh useful for indexing bgzipped VCF files. This index (.tbi) is required by many programs in order to use bgzipped VCF files as input. It tin also be used for extremly fast data retrieval (e.g. to extract particular regions of the genome).

  • BedTools
    Bedtools is a useful program for not merely working with VCF files, only as well other genome file formats including BAM, GFF/GTF, and BED files. Among other utilities, this program can intersect genome files (e.grand. to extract coding sequences in a GFF file from a VCF file), calculate depth of coverage, identify the closest genomic element to variants, etc.

References

Baym M, Kryazhimskiy S, Lieberman TD, Chung H, Desai MM, Kishony R. 2015. Inexpensive Multiplexed Library Preparation for Megabase-Sized Genomes. PLoS 1 x(5):e0128036

Buerkle, A. C. and Gompert, Z. (2013), Population genomics based on depression coverage sequencing: how low should nosotros go?. Mol Ecol, 22: 3028–3035.

DePristo, M. A. et al. 2011. A framework for variation discovery and genotyping using next-generation Dna sequencing data. Nat Genet 43, 491–498.

Fumagalli, M., Vieira, F. Grand., Linderoth, T. & Nielsen, R. 2014. ngsTools: methods for population genetics analyses from next-generation sequencing data. Bioinformatics (Oxford, England) 30, 1486–1487.

Han, E., Sinsheimer, J. S. & Novembre, J. 2014. Characterizing Bias in Population Genetic Inferences from Low-Coverage Sequencing Data. Molecular Biology and Evolution 31, 723–735.

Li, H. & Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England) 25, 1754–1760.

Li, H. et al. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 2078–2079.

McKenna, A. et al. 2010. The Genome Assay Toolkit: a MapReduce framework for analyzing adjacent-generation DNA sequencing information. Genome Inquiry twenty, 1297–1303.

Nazareno, A. One thousand., Bemmels, J. B., Dick, C. W. and Lohmann, L. Grand. (2017), Minimum sample sizes for population genomics: an empirical study from an Amazonian plant species. Mol Ecol Resour, 17: 1136–1147

Poplin et al. 2017. Scaling authentic genetic variant discovery to tens of thousands of samples. bioRxiv 201178

Robinson, J. D., Coffman, A. J., Hickerson, Thou. J. & Gutenkunst, R. N. Sampling strategies for frequency spectrum-based population genomic inference. BMC Evol Biol 14, 254 (2014).

Korneliussen, T. Southward., Albrechtsen, A. & Nielsen, R. 2014. ANGSD: Analysis of Next Generation Sequencing Data. BMC Bioinformatics fifteen, 356.

Schlötterer, C., Tobler, R., Kofler, R. & Nolte, 5. 2014. Sequencing pools of individuals — mining genome-wide polymorphism data without big funding. Nature Reviews Genetics 15, 749–763.

mackinoltycriniveran.blogspot.com

Source: https://informatics.fas.harvard.edu/whole-genome-resquencing-for-population-genomics-fastq-to-vcf.html