liminfo

UniProt Reference

Free reference guide: UniProt Reference

26 results

About UniProt Reference

The UniProt Reference is a searchable cheat sheet for the Universal Protein Resource (UniProt), the most comprehensive protein sequence and functional annotation database in bioinformatics. It covers UniProtKB identifiers (6 or 10-character accession numbers like P04637 for human p53, entry names in PROTEIN_SPECIES format like TP53_HUMAN), search syntax for querying by gene name, organism ID, EC number, and reviewed status (reviewed:true for the ~570K manually curated Swiss-Prot entries vs. reviewed:false for the ~250M automatically annotated TrEMBL entries).

The reference documents the UniProt REST API at rest.uniprot.org with endpoints for protein retrieval (JSON, FASTA, TXT, XML, GFF formats), programmatic search with field filtering and pagination, ID mapping between external databases (GeneCards, Ensembl, PDB) and UniProt accessions, and bulk data streaming for downloading entire proteomes. It covers protein information sections including Function (biological activity and GO terms), Subcellular Location (signal peptides, transmembrane regions), Protein Existence levels (PE 1-5), Feature types (DOMAIN, BINDING, ACT_SITE, MOD_RES, VARIANT, MUTAGEN), post-translational modifications, and disease-associated variants with dbSNP/ClinVar cross-references.

Data format sections explain the FASTA header structure (sp|accession|entry_name followed by OS, OX, GN, PE, SV fields), UniProt flat file line codes (ID, AC, DE, GN, OS, DR, FT), and structured XML output for programmatic parsing. The reference also covers related databases — UniRef clusters (UniRef100/90/50 for sequence redundancy reduction), UniParc (non-redundant sequence archive), reference proteomes (like UP000005640 for Homo sapiens), Gene Ontology annotations (Molecular Function, Biological Process, Cellular Component), and external cross-references to PDB, Pfam, InterPro, KEGG, and Ensembl. Python code examples are included for REST API access. All content runs entirely in your browser.

Key Features

  • UniProtKB identifier formats: accession numbers, entry names, and Swiss-Prot vs TrEMBL distinction
  • Complete search syntax with gene name, organism ID, EC number, and reviewed status filters
  • REST API documentation: protein lookup, search, ID mapping, streaming, and field selection endpoints
  • Protein annotation sections: Function, Subcellular Location, PE levels, Features, PTMs, and disease variants
  • Data format reference: FASTA header fields, flat file line codes, GFF, and XML structures
  • Related databases: UniRef clusters, UniParc, proteomes, GO annotations, and external cross-references
  • Python code examples for programmatic REST API access and response parsing
  • Searchable and filterable across categories: Identifiers, REST API, Protein Info, Formats, Related DBs

Frequently Asked Questions

What is the difference between Swiss-Prot and TrEMBL?

Swiss-Prot (reviewed:true) contains approximately 570,000 protein entries that have been manually curated and reviewed by expert biologists, with experimentally verified annotations and literature references. TrEMBL (reviewed:false) contains approximately 250 million entries that are automatically annotated by computational pipelines without manual review. Swiss-Prot entries are considered high-confidence and are preferred for research, while TrEMBL provides broader coverage including computationally predicted proteins.

How do I search UniProt programmatically?

Use the REST API at https://rest.uniprot.org/uniprotkb/search with query parameters. For example: ?query=(gene:BRCA1)+AND+(organism_id:9606)&format=json&size=10 finds human BRCA1 entries in JSON format. You can filter by fields (accession, protein_name, gene_names, organism_name), specify format (json, tsv, fasta, xml), and use pagination (size, cursor). The &fields= parameter returns only selected columns, reducing bandwidth for large queries.

What do the Protein Existence (PE) levels mean?

PE levels indicate the type of evidence supporting a protein's existence: PE 1 = experimental evidence at protein level (mass spectrometry, X-ray, etc.), PE 2 = evidence at transcript level (mRNA detected but protein not directly confirmed), PE 3 = inferred by homology from a characterized protein in another species, PE 4 = predicted by gene prediction algorithms, PE 5 = uncertain/dubious. For high-confidence analyses, filter for PE 1 entries.

How do I map IDs between UniProt and other databases?

Use the ID mapping API: POST to https://rest.uniprot.org/idmapping/run with parameters specifying the source database (from=GeneCards, Ensembl, PDB, RefSeq, etc.) and target (to=UniProtKB). The API returns a job ID; poll https://rest.uniprot.org/idmapping/results/{jobId} for results. This handles batch conversions and supports dozens of external databases. The REST API response includes mapped accessions with metadata.

What is the UniProt FASTA header format?

A UniProt FASTA header follows this structure: >db|accession|entry_name Description OS=Organism OX=TaxonomyID GN=GeneName PE=ProteinExistence SV=SequenceVersion. For example: >sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4. The "sp" prefix indicates Swiss-Prot (manually reviewed), while "tr" indicates TrEMBL (auto-annotated). This header format is standardized and parseable by most bioinformatics tools.

What are Feature types in UniProt annotations?

Feature types annotate specific regions or positions in the protein sequence. Key types include: DOMAIN (functional domain boundaries), BINDING (ligand/substrate binding sites), ACT_SITE (catalytic active site residues), MOD_RES (post-translational modifications like phosphoserine), VARIANT (natural sequence variants with dbSNP IDs and clinical significance from ClinVar), MUTAGEN (experimentally introduced mutations and their effects), CHAIN (mature protein after signal peptide cleavage), and TRANSMEM (transmembrane helix regions).

What are UniRef clusters and when should I use them?

UniRef provides pre-computed sequence clusters at three identity thresholds: UniRef100 (identical sequences merged), UniRef90 (90%+ identity), and UniRef50 (50%+ identity). Use UniRef90 or UniRef50 to reduce redundancy in sequence databases for faster BLAST searches, to create non-redundant training sets for machine learning, or to reduce computational costs in large-scale analyses. Each cluster has a representative sequence that is typically the best-annotated Swiss-Prot entry.

How do I access UniProt data with Python?

Use the requests library to call the REST API: import requests; resp = requests.get("https://rest.uniprot.org/uniprotkb/P04637.json"); data = resp.json(). The JSON response contains structured data including proteinDescription, gene, organism, comments (function, subcellular location), features (domains, variants), and cross-references. For bulk operations, use the stream endpoint with format=tsv or format=fasta and write directly to files. The requests library handles pagination via the Link header for search results.