DESeq2 Reference
Free reference guide: DESeq2 Reference
About DESeq2 Reference
The DESeq2 Reference provides a comprehensive, searchable guide to the DESeq2 R/Bioconductor package for RNA-Seq differential gene expression analysis, covering data input from count matrices, HTSeq-count files, and tximport (Salmon/Kallisto), through the complete analysis pipeline including size factor normalization, dispersion estimation, and Wald/LRT statistical testing.
This reference includes visualization techniques such as MA plots, volcano plots, PCA sample clustering, dispersion diagnostic plots, and heatmaps of top differentially expressed genes using pheatmap. It also covers normalization methods (VST, rlog), batch effect correction with SVA, interaction models, time series analysis, and multiple comparison strategies.
Designed for bioinformatics researchers, genomics scientists, and graduate students performing RNA-Seq experiments, this tool provides ready-to-use R code snippets with practical examples for every step from data preparation through result interpretation and downstream gene set enrichment analysis with clusterProfiler.
Key Features
- DESeqDataSet creation from count matrices, HTSeq-count files, and tximport (Salmon/Kallisto) with design formula specification
- Complete DESeq() pipeline: estimateSizeFactors, estimateDispersions, and nbinomWaldTest with parameter details
- Log2 fold change shrinkage methods: apeglm (recommended), ashr, and normal with coef/contrast specification
- Visualization gallery: MA plot, volcano plot, PCA (plotPCA with ggplot2 customization), dispersion plot, and pheatmap heatmaps
- VST and rlog normalization comparison with guidance on when to use each (sample count threshold at n=30)
- Advanced experimental designs: batch effect correction (~batch + condition), interaction models (genotype:treatment), and time series LRT
- Result interpretation with column definitions (baseMean, log2FoldChange, padj) and gene filtering strategies
- Downstream analysis integration: gene symbol annotation with org.Hs.eg.db, enrichGO, GSEA with ranked gene lists, and CSV export
Frequently Asked Questions
How do I create a DESeqDataSet from a count matrix?
Use DESeqDataSetFromMatrix() with three arguments: countData (a matrix of raw integer counts with genes as rows and samples as columns), colData (a data.frame of sample information with condition, batch, etc.), and design (a formula like ~ condition specifying the experimental variable). Ensure count data contains raw counts, not normalized values, and that colData row names match countData column names.
What is the difference between lfcShrink methods (apeglm, ashr, normal)?
The apeglm method (recommended by DESeq2 developers) provides adaptive shrinkage with better false sign rate control and uses the coef argument. The ashr method allows contrast specification and handles arbitrary contrasts. The normal method is the legacy default with uniform shrinkage. For routine analyses, use apeglm with coef = "condition_treated_vs_control". Check available coefficients with resultsNames(dds).
When should I use the Likelihood Ratio Test (LRT) instead of the Wald test?
Use the LRT when testing for any differences across multiple conditions (3+ groups) rather than specific pairwise comparisons. Specify DESeq(dds, test = "LRT", reduced = ~ 1) where the reduced formula removes the variable of interest. LRT is also ideal for time series analysis (design = ~ time, reduced = ~ 1) to identify genes with any temporal expression changes. The Wald test is preferred for specific two-group comparisons.
Should I use VST or rlog for normalization?
Use VST (vst()) for datasets with 30 or more samples as it is computationally much faster. Use rlog (rlog()) for smaller datasets (fewer than 30 samples) as it provides better performance for low-count genes. Both produce log2-scale transformed values suitable for PCA, heatmaps, and clustering. Set blind=FALSE when using for downstream analysis (uses design info) and blind=TRUE for unbiased QC assessment.
How do I correct for batch effects in DESeq2?
Include batch as the first term in the design formula: design = ~ batch + condition. DESeq2 will account for batch variation when testing for the condition effect. The order matters: nuisance variables come first, the variable of interest comes last. For hidden batch effects, use the sva package: run svaseq() to discover surrogate variables, then include them in the design formula.
How do I create a volcano plot from DESeq2 results?
Convert results to a data.frame, add a significance column (padj < 0.05 and abs(log2FoldChange) > 1), then use ggplot2 with aes(x = log2FoldChange, y = -log10(pvalue)). Color points by significance, add dashed vertical lines at log2FC thresholds of -1 and 1, and a horizontal line at -log10(0.05). The lfcShrink results produce cleaner volcano plots with reduced noise from low-count genes.
How do I perform gene set enrichment analysis with DESeq2 results?
For over-representation analysis (ORA), extract significant gene IDs (padj < 0.05) and use clusterProfiler enrichGO() with the appropriate OrgDb and keyType. For GSEA, create a ranked gene list by sorting log2FoldChange values in decreasing order with gene IDs as names, then pass to gseGO(). Both methods require converting Ensembl IDs to Entrez or symbols using mapIds() from the annotation package.
How do I analyze interaction effects between two experimental variables?
Use the design formula ~ genotype + treatment + genotype:treatment to model main effects and the interaction term. Run DESeq(dds) and extract the interaction effect with results(dds, name = "genotypeKO.treatmenttreated"). Significant genes in this result have different treatment responses between genotypes. This design requires balanced sample groups across all factor level combinations.