GATK Reference
Free reference guide: GATK Reference
About GATK Reference
The GATK Reference is a comprehensive command-line cheat sheet for the Genome Analysis Toolkit (GATK), the industry-standard software suite for variant discovery in high-throughput sequencing data. It covers all essential GATK tools organized into seven categories: Variant Calling (HaplotypeCaller, Mutect2, FilterMutectCalls, GenotypeGVCFs, GenomicsDBImport), BQSR (BaseRecalibrator, ApplyBQSR, AnalyzeCovariates), Preprocessing (MarkDuplicates, MarkDuplicatesSpark, SplitNCigarReads, AddOrReplaceReadGroups), Filtering, Utilities, Pipelines, and Common Arguments.
Each tool entry provides complete command-line syntax with all key flags, input/output file specifications, and practical usage patterns. The variant calling section covers both germline (HaplotypeCaller in GVCF mode with joint genotyping via GenomicsDBImport) and somatic (Mutect2 with tumor-normal pairs, Panel of Normals, and FilterMutectCalls) workflows. The filtering section documents both hard filtering (VariantFiltration with QD, FS, MQ, SOR thresholds) and machine-learning-based VQSR using HapMap, Omni, 1000G, and dbSNP training resources.
The reference includes three complete best-practice pipelines: Germline variant calling (BWA-MEM2 alignment, MarkDuplicates, BQSR, HaplotypeCaller GVCF, joint genotyping), Somatic variant calling (Panel of Normals creation, Mutect2, LearnReadOrientationModel, FilterMutectCalls), and RNA-seq variant calling (STAR 2-pass alignment, SplitNCigarReads, BQSR, HaplotypeCaller). VCF annotation fields (QD, FS, SOR, MQ, GT, AD, DP, GQ, PL) are also documented with clear descriptions.
Key Features
- Complete HaplotypeCaller commands for germline SNP/Indel calling in both single-sample and GVCF modes with region-specific analysis
- Mutect2 somatic variant calling with tumor-normal pair analysis, tumor-only mode, germline resource, and Panel of Normals integration
- Full BQSR workflow: BaseRecalibrator with known-sites (dbSNP, Mills, 1000G), ApplyBQSR with static quantization, and AnalyzeCovariates comparison plots
- Documents both hard filtering (VariantFiltration with QD/FS/MQ/SOR thresholds for SNPs and Indels) and VQSR with training resources and tranches
- GenomicsDBImport for consolidating multi-sample GVCFs with sample map files, followed by GenotypeGVCFs for joint genotyping
- Preprocessing tools: MarkDuplicates/MarkDuplicatesSpark, SplitNCigarReads for RNA-seq, and AddOrReplaceReadGroups with @RG tag descriptions
- Three complete best-practice pipelines with step-by-step commands: Germline, Somatic, and RNA-seq variant calling
- VCF INFO/FORMAT field reference: QD, FS, SOR, MQ, MQRankSum, ReadPosRankSum, GT, AD, DP, GQ, PL with example records
Frequently Asked Questions
What is the difference between HaplotypeCaller and Mutect2?
HaplotypeCaller is designed for germline variant calling, detecting inherited SNPs and Indels present in the sample's genome. Mutect2 is designed for somatic variant calling, detecting mutations acquired in tumor cells that are absent from matched normal tissue. HaplotypeCaller supports GVCF mode for scalable joint genotyping across many samples, while Mutect2 supports tumor-normal pair analysis with Panel of Normals and contamination estimation.
What is BQSR and why is it needed?
Base Quality Score Recalibration (BQSR) corrects systematic errors in the quality scores assigned by the sequencing machine. BaseRecalibrator builds a recalibration model using known variant sites (dbSNP, Mills indels, 1000G SNPs) to learn error patterns, then ApplyBQSR writes a new BAM with corrected quality scores. This improves variant calling accuracy by ensuring quality scores accurately reflect the true probability of sequencing errors.
When should I use hard filtering vs VQSR?
VQSR (Variant Quality Score Recalibration) is preferred for whole-genome and large whole-exome datasets because it uses machine learning with known variant training resources (HapMap, Omni, 1000G, dbSNP) to learn the profile of true variants. Hard filtering with VariantFiltration is recommended when you have too few variants for VQSR to build a reliable model, such as small targeted panels or single-gene studies. The reference includes both SNP and Indel hard filter thresholds.
How does the GVCF joint genotyping workflow work?
First, HaplotypeCaller runs on each sample individually in GVCF mode (-ERC GVCF), producing per-sample GVCF files that include reference confidence blocks. Then GenomicsDBImport consolidates all GVCFs into a GenomicsDB workspace. Finally, GenotypeGVCFs performs joint genotyping across all samples simultaneously. This approach is scalable because new samples can be added to GenomicsDB without reprocessing existing ones.
What are the key VCF annotation fields I should understand?
Key INFO fields include QD (quality by depth), FS (Fisher strand bias), SOR (strand odds ratio), MQ (mapping quality), MQRankSum and ReadPosRankSum (rank sum tests). Key FORMAT fields include GT (genotype: 0/0, 0/1, 1/1), AD (allelic depth for ref and alt), DP (read depth), GQ (genotype quality), and PL (Phred-scaled genotype likelihoods). These annotations are used by both hard filtering and VQSR.
How do I call variants from RNA-seq data with GATK?
The RNA-seq variant calling pipeline differs from DNA-seq in several preprocessing steps: 1) Use STAR 2-pass alignment instead of BWA-MEM2, 2) Run MarkDuplicates, 3) Run SplitNCigarReads to handle splice junction reads with N-in-CIGAR, 4) Perform BQSR, 5) Run HaplotypeCaller. SplitNCigarReads is the critical RNA-seq-specific step that splits reads spanning introns into separate supplementary alignments.
What is a Panel of Normals and when is it used?
A Panel of Normals (PoN) is a VCF created from Mutect2 calls on multiple normal (non-tumor) samples. It captures recurrent technical artifacts and germline variants. During somatic calling, Mutect2 uses the PoN (--panel-of-normals) to filter out these artifacts, improving the specificity of true somatic variant detection. CreateSomaticPanelOfNormals merges the individual normal VCFs into the final PoN.
What memory and threading options are available in GATK?
Use --java-options "-Xmx16g" to set Java heap memory (adjust based on your data size and available RAM). For threading, HaplotypeCaller supports --native-pair-hmm-threads for PairHMM parallelization. MarkDuplicatesSpark uses Spark-based parallelism with --conf spark.executor.cores. The --tmp-dir flag specifies a temporary directory for intermediate files, useful when the default /tmp has limited space.