liminfo

SAMtools Reference

Free reference guide: SAMtools Reference

27 results

About SAMtools Reference

The SAMtools Reference is a comprehensive command cheat sheet for SAMtools, the essential toolkit for manipulating SAM/BAM/CRAM alignment files in next-generation sequencing workflows. It covers over 25 commands organized across file conversion and sorting (view, sort, index, merge, collate), filtering and extraction (region selection, FLAG-based filtering with -f/-F, MAPQ filtering with -q, fastq/fasta extraction, read group management), and statistics and quality control (flagstat, idxstats, stats, depth, coverage, flags).

The reference includes detailed variant calling entries for mpileup with BCFtools pipelines, consensus sequence generation, and MD/NM tag recalculation via calmd. Utility commands cover the complete duplicate removal pipeline using collate, fixmate, sort, and markdup in a streaming workflow. FASTA indexing with faidx, sequence dictionary creation with dict, and the terminal-based alignment viewer tview are also documented with practical examples.

Each entry provides real command-line examples with commonly used flags, including multi-threaded operation (-@ threads), memory allocation (-m), BED file region filtering (-L), output format selection (-b for BAM, -C for CRAM), and piped workflows. A complete BWA-MEM to variant-calling pipeline example ties all the commands together for whole-genome sequencing analysis. All content runs locally in your browser with no data sent to any server.

Key Features

  • Full SAM/BAM/CRAM format conversion with samtools view including FLAG (-f/-F) and MAPQ (-q) filtering
  • Coordinate and name sorting, BAI/CSI/CRAI indexing, and multi-file merge commands with threading options
  • Complete duplicate removal pipeline: collate, fixmate -m, sort, markdup with streaming syntax
  • Alignment statistics via flagstat, idxstats, stats, depth, and coverage with JSON and plot output
  • Per-position pileup with mpileup for BCFtools variant calling and consensus sequence generation
  • FASTQ/FASTA extraction from BAM with paired-end read handling and unmapped read filtering
  • FASTA indexing (faidx), sequence dictionary (dict), and read group management (addreplacerg, split)
  • End-to-end BWA-MEM + SAMtools WGS pipeline from reference indexing through duplicate removal and QC

Frequently Asked Questions

What is SAMtools and what file formats does it handle?

SAMtools is a suite of command-line tools for reading, writing, editing, and indexing SAM (Sequence Alignment/Map), BAM (compressed binary SAM), and CRAM (reference-based compressed) format files. It is a core component of nearly every NGS bioinformatics pipeline, handling tasks from format conversion to statistics and variant calling preparation.

How do I filter reads by FLAG values using samtools view?

Use -f to include reads with specific FLAG bits set and -F to exclude them. For example, -F 0x4 excludes unmapped reads, -f 0x2 keeps only properly paired reads, and -F 0x900 removes secondary and supplementary alignments. Common FLAG values include 0x4 (unmapped), 0x100 (secondary), 0x400 (PCR duplicate), and 0x800 (supplementary).

What is the correct pipeline for duplicate removal with SAMtools?

The recommended streaming pipeline is: samtools collate -O input.bam | samtools fixmate -m - - | samtools sort - | samtools markdup - dedup.bam. The collate step groups reads by name, fixmate adds mate score tags (-m), sort reorders by coordinate, and markdup identifies and optionally removes (-r flag) duplicates.

How do I get alignment statistics from a BAM file?

Use samtools flagstat for a quick FLAG-based summary (mapped%, properly paired%, duplicates). Use samtools idxstats for per-chromosome read counts. Use samtools stats for comprehensive metrics including error rate, insert size distribution, and quality scores. The stats output can be visualized with plot-bamstats.

How do I calculate coverage depth across genomic regions?

Use samtools depth for per-position coverage (add -a to include zero-depth positions, -d 0 for unlimited max depth). Use samtools coverage for per-chromosome summary statistics including mean depth, coverage breadth percentage, and mean mapping quality. Both commands support BED file region filtering.

How do I extract FASTQ reads from a BAM file?

Use samtools fastq -1 R1.fq.gz -2 R2.fq.gz for paired-end extraction. Name-sort the BAM first (samtools sort -n) for best results. To extract only unmapped reads, pipe samtools view -b -f 0x4 into samtools fastq. For single-end or FASTA output, use samtools fasta.

What is the difference between BAI and CSI indexes?

BAI is the standard index format for BAM files with chromosome sizes up to 2^29 (about 512 Mb). CSI (Coordinate Sorted Index) supports larger chromosomes and is needed for some non-standard genomes. Create BAI with samtools index, or CSI with samtools index -c. CRAM files use .crai indexes.

How do I set up a complete WGS analysis pipeline with BWA and SAMtools?

First index the reference with bwa index, samtools faidx, and samtools dict. Then align with bwa mem piped into samtools sort. Index the sorted BAM, run the duplicate removal pipeline (collate, fixmate, sort, markdup), index again, and verify with flagstat and stats. The reference includes the complete pipeline with thread and read group parameters.