liminfo

snpEff/VEP Reference

Free reference guide: snpEff/VEP Reference

29 results

About snpEff/VEP Reference

The snpEff & VEP Reference is a searchable guide to genomic variant annotation and effect prediction tools used in bioinformatics and clinical genomics. It covers snpEff commands for annotating VCF files against genome databases like GRCh38.105, including the ANN field structure with its pipe-delimited format containing variant type, impact level, gene name, transcript ID, HGVS notation for both coding DNA (c.) and protein (p.) changes.

This reference provides detailed breakdowns of variant effect classifications across the four impact tiers: HIGH (frameshift_variant, stop_gained, splice_donor_variant), MODERATE (missense_variant, inframe_deletion), LOW (synonymous_variant, splice_region_variant), and MODIFIER (intron_variant, intergenic_region). It also covers functional prediction scores including SIFT (damaging below 0.05), PolyPhen-2 with both HumDiv and HumVar models, and integrated pathogenicity scores like CADD (PHRED-scaled), REVEL (ensemble of 13 tools), and DANN (deep learning-based).

For practical workflow support, the reference includes Ensembl VEP (Variant Effect Predictor) commands with plugin integration for CADD, SIFT, and PolyPhen databases, the --pick option for selecting the most severe consequence, VEP REST API usage, and consequence severity hierarchy. SnpSift commands for filtering annotated VCFs by impact and allele frequency, extracting specific fields to tab-delimited output, annotating with dbSNP/dbNSFP, and integrating ClinVar clinical significance data are also covered.

Key Features

  • snpEff annotation commands with genome database management (download, list, annotate) and statistics generation
  • ANN field structure decoder explaining all pipe-delimited subfields in snpEff VCF output
  • Four-tier impact classification (HIGH/MODERATE/LOW/MODIFIER) with complete variant effect type listings
  • SIFT and PolyPhen-2 score interpretation with HumDiv and HumVar model thresholds
  • CADD PHRED-scaled scoring guide with percentile cutoffs (PHRED 20 = top 1%, PHRED 25 = top 0.3%)
  • Ensembl VEP command reference including plugin integration, --pick options, REST API, and consequence hierarchy
  • SnpSift filter expressions for impact-based and frequency-based VCF filtering with field extraction
  • ClinVar and dbNSFP integration commands for adding clinical significance and functional prediction annotations

Frequently Asked Questions

What is the difference between snpEff and VEP?

Both snpEff and VEP (Variant Effect Predictor) annotate genomic variants with predicted functional effects, but they differ in implementation. snpEff is a Java-based standalone tool that uses its own gene model databases and outputs annotations in the ANN field of VCF files. VEP is maintained by Ensembl, supports extensive plugin integration (CADD, SIFT, PolyPhen), and offers both command-line and REST API interfaces. Many pipelines use both tools for comprehensive annotation.

How do I interpret the snpEff impact levels?

HIGH impact variants (frameshift, stop_gained, splice_donor) likely cause protein truncation or loss of function. MODERATE variants (missense, inframe_deletion) change the protein but may or may not affect function. LOW impact variants (synonymous, splice_region) are unlikely to change protein behavior. MODIFIER variants (intron, intergenic) are in non-coding regions with uncertain functional impact.

What does a SIFT score of 0.02 mean?

A SIFT score below 0.05 is classified as "Damaging," meaning the amino acid substitution is predicted to affect protein function based on sequence conservation across species. A score of 0.02 indicates high confidence that the substitution is functionally deleterious. Scores at or above 0.05 are classified as "Tolerated." SIFT4G is an improved version optimized for genome-scale analysis.

How do I use CADD scores for variant prioritization?

CADD PHRED-scaled scores rank variants by deleteriousness. A PHRED score of 10 means the variant is in the top 10% most deleterious, 20 is top 1%, 25 is top 0.3%, and 30 is top 0.1%. For clinical variant filtering, a CADD PHRED threshold of 15-20 is commonly used as a starting point, though the optimal cutoff depends on your specific analysis goals.

What is the difference between PolyPhen-2 HumDiv and HumVar?

HumDiv was trained on evolutionary divergence between human proteins and their close homologs, making it better for evaluating rare alleles and Mendelian disease variants. HumVar was trained on known human disease-causing mutations versus common polymorphisms, making it more suitable for distinguishing pathogenic variants in a clinical diagnostics context. Both use the same score thresholds: >0.908 probably damaging, 0.446-0.908 possibly damaging, <0.446 benign.

How do I filter snpEff-annotated VCFs with SnpSift?

Use SnpSift filter with expressions like: java -jar SnpSift.jar filter "(ANN[*].IMPACT = 'HIGH') & (AF < 0.01)" to select high-impact variants with allele frequency below 1%. Use extractFields to pull specific columns: java -jar SnpSift.jar extractFields annotated.vcf CHROM POS REF ALT "ANN[0].GENE" "ANN[0].IMPACT" "ANN[0].HGVS_P" for tab-delimited output.

What is the REVEL score and when should I use it?

REVEL is an ensemble pathogenicity prediction score that combines 13 individual tools (including SIFT, PolyPhen, MutationAssessor, and others) specifically for missense variants. It ranges from 0 to 1, with scores above 0.75 suggesting likely pathogenicity. REVEL often outperforms individual tools because it leverages the strengths of multiple prediction methods.

How do I integrate ClinVar annotations with snpEff results?

Use SnpSift annotate to add ClinVar data: java -jar SnpSift.jar annotate clinvar.vcf.gz annotated.vcf > clinvar_annotated.vcf. This adds the CLNSIG field with clinical significance classifications (Pathogenic, Likely_pathogenic, Benign, Likely_benign, Uncertain_significance). You can also use SnpSift dbnsfp to add dbNSFP functional prediction scores in the same step.