Universal Dependencies Reference
Free reference guide: Universal Dependencies Reference
About Universal Dependencies Reference
The Universal Dependencies Reference is a searchable guide to the UD annotation framework, the cross-linguistically consistent treebank standard used in NLP research and computational linguistics. It covers the 17 UPOS (Universal Part-of-Speech) tags including NOUN, VERB, ADJ, ADV, ADP, DET, PRON, and PROPN with CoNLL-U format examples showing tag columns, morphological features, and dependency head assignments.
The dependency relations section documents core syntactic relations: nsubj (nominal subject), obj (direct object), iobj (indirect object), obl (oblique/prepositional argument), advmod (adverbial modifier), amod (adjectival modifier), det (determiner), and conj (conjunction/coordination). Each relation is illustrated with both English and Korean examples showing tree notation like nsubj(laughed, child) and practical CoNLL-U annotations.
Additional sections cover the CoNLL-U 10-column tab-separated format (ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC), morphological features (Number, Case, Tense, VerbForm with pipe-separated values), language-specific XPOS tags (Penn Treebank for English, Sejong for Korean), enhanced dependencies for shared arguments and relative clause resolution, multi-word token handling, and the validate.py validation tool for checking format, tag validity, tree structure, and projectivity.
Key Features
- Complete UPOS tag set: NOUN, PROPN, VERB, ADJ, ADV, ADP, DET, PRON with language-specific examples
- Core dependency relations: nsubj, obj, iobj, obl, advmod, amod, det, conj with tree notation
- Full CoNLL-U 10-column format specification with sent_id, text metadata, and tab-separated fields
- Morphological features: Number (Sing/Plur), Case (Nom/Acc/Dat/Gen), Tense, VerbForm with pipe syntax
- XPOS cross-reference for Penn Treebank (NN, VBZ, JJ) and Sejong tagset (NNG, VV, VA, MAG)
- FORM/LEMMA mapping examples: running->run, went->go, better->good across multiple languages
- Enhanced dependencies for coreference resolution, shared arguments, and ellipsis restoration
- Multi-word token handling (contractions like don't -> do + not) and validate.py usage guide
Frequently Asked Questions
What UPOS tags does Universal Dependencies define?
UD defines 17 universal POS tags. This reference covers the most commonly used: NOUN (common nouns), PROPN (proper nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), ADP (adpositions/prepositions/postpositions), DET (determiners like the/a/this), and PRON (pronouns). Each tag includes CoNLL-U column examples and language-specific notes.
What are the key dependency relations in Universal Dependencies?
Core argument relations: nsubj (nominal subject, e.g., child->nsubj->laughed), obj (direct object), iobj (indirect object). Modifier relations: advmod (adverbial modifier), amod (adjectival modifier), det (determiner). Coordination: conj and cc. Oblique: obl for prepositional/adverbial noun phrases with case marking.
How is the CoNLL-U format structured?
CoNLL-U uses 10 tab-separated columns: ID (token index from 1), FORM (surface form), LEMMA (base form), UPOS (universal POS), XPOS (language-specific POS), FEATS (morphological features), HEAD (dependency head index), DEPREL (dependency relation), DEPS (enhanced dependencies), MISC. Sentences are separated by blank lines with # comments for metadata.
How do morphological features work in UD?
Features are pipe-separated key=value pairs in alphabetical order in the FEATS column: Number=Sing|Person=3, Case=Acc|Number=Plur, Mood=Ind|Tense=Past|VerbForm=Fin. Common features include Number (Sing/Plur/Dual), Case (Nom/Acc/Dat/Gen), Tense (Past/Pres/Fut), and VerbForm (Fin/Inf/Part/Ger).
What is the difference between UPOS and XPOS?
UPOS is the universal tag set (17 tags) consistent across all languages. XPOS is the language-specific tag set that varies by corpus: English uses Penn Treebank (NN, VBZ, JJ, RB), Korean uses Sejong (NNG, VV, VA, MAG), German uses STTS (VVFIN, ADJA). Both appear in CoNLL-U columns 4 and 5.
How does FORM differ from LEMMA?
FORM is the surface/inflected word as it appears in text, while LEMMA is the dictionary/base form. Examples: running(FORM)->run(LEMMA), went->go, better->good. For Korean: morphologically rich forms are lemmatized to base verb forms.
What are enhanced dependencies?
Enhanced dependencies (column 10, DEPS) extend basic trees to handle shared arguments, relative clause resolution, and ellipsis restoration. They use the format head:relation with pipe separation for multiple dependencies, e.g., 2:nsubj|4:nsubj indicates the token is subject of both predicate 2 and predicate 4.
How do I validate a CoNLL-U file?
Use the official validate.py tool: python validate.py --lang en file.conllu. It checks format correctness (tab separation, column count), tag validity against the UD tagset, tree structure (single root, no cycles), and projectivity. The --lang flag enables language-specific validation rules.