Encodingref
Free reference guide: Encodingref
About Encodingref
The Character Encoding Reference is a developer-focused cheat sheet covering the full spectrum of text encoding standards used in modern software. It includes detailed entries for UTF-8 byte structures, ASCII control characters, EUC-KR and CP949 Korean encodings, Base64 and Base64url schemes, URL percent-encoding rules, and Unicode code points. Each entry includes a concise description and runnable code examples in Python, JavaScript, and shell commands.
This reference is used by web developers, backend engineers, security researchers, and data scientists who regularly deal with encoding issues — from fixing garbled Korean text to constructing JWT tokens, embedding images as Data URIs, or handling internationalized domain names (IDN/Punycode). The entries are organized by encoding category so you can quickly jump to the specific standard you need.
The reference covers six major encoding categories: UTF-8 (variable-length Unicode encoding, BOM handling, surrogate validation), ASCII (7-bit character table, control codes, case-conversion bit tricks), EUC-KR/CP949 (Korean legacy encodings, iconv conversion, broken character recovery), Base64/Base32 (encoding principles, Data URI, TOTP secrets), URL Encoding (percent-encoding, encodeURIComponent vs encodeURI, form encoding), and Unicode (planes, normalization forms NFC/NFD, surrogate pairs, escape sequences).
Key Features
- UTF-8 byte structure tables for 1-byte through 4-byte sequences with binary bit patterns
- ASCII control character codes (NUL, LF, CR, ESC, DEL) and printable character ranges
- EUC-KR vs CP949 comparison — 2,350 vs 11,172 Korean syllable coverage
- Base64 encoding principle, Base64url (JWT-safe), Data URI format, and Base32 for TOTP
- URL percent-encoding rules with Python urllib and JavaScript encodeURIComponent examples
- Unicode planes 0–16 overview, NFC/NFD normalization, and surrogate pair calculation
- Encoding conversion code snippets for Python (iconv, codecs) and Linux shell
- Garbled text recovery — how to fix text mis-read in wrong encoding
Frequently Asked Questions
What is the difference between UTF-8 and UTF-16?
UTF-8 uses 1–4 variable bytes per character, making it space-efficient for ASCII-heavy text and the dominant encoding on the web. UTF-16 uses 2 bytes for most characters (including all Korean, Chinese, and Japanese) and 4 bytes for characters outside the Basic Multilingual Plane. UTF-8 is preferred for files, APIs, and web content; UTF-16 is used internally by JavaScript, Java, and Windows APIs.
Why does Korean text appear as garbled characters?
Garbled Korean (mojibake) occurs when text is read with the wrong encoding. For example, if a UTF-8 file is opened as EUC-KR, or a CP949 file is decoded as Latin-1, the byte values map to wrong characters. To fix it, re-encode the raw bytes with the correct codec. The reference includes Python snippets for recovering text mis-read as EUC-KR or Latin-1.
What is the difference between EUC-KR and CP949?
EUC-KR (KS X 1001) is the legacy Korean standard encoding covering 2,350 complete Hangul syllables using 2 bytes per character. CP949 (MS949) is a Microsoft extension of EUC-KR that expands coverage to all 11,172 modern Hangul syllables by adding additional byte ranges (0x81–0xFE for the first byte). Modern systems should use UTF-8; CP949 remains relevant for legacy Windows files.
When should I use Base64url instead of standard Base64?
Standard Base64 uses + and / characters which are not safe in URLs or filenames. Base64url replaces + with - and / with _ and makes padding (=) optional. Use Base64url whenever encoding data for JWTs, OAuth tokens, URL query parameters, or filenames. Python's base64.urlsafe_b64encode() handles this automatically.
What is the difference between encodeURI and encodeURIComponent in JavaScript?
encodeURI is designed to encode a complete URL and preserves characters like : / ? # & = that have structural meaning in URLs. encodeURIComponent encodes everything except letters, digits, and - _ . ! ~ * ' ( ), making it safe for encoding individual query parameter values or path segments. Always use encodeURIComponent when encoding user-provided values.
What are Unicode normalization forms NFC and NFD?
Unicode allows some characters to be represented in multiple ways. The Korean syllable "가" can be stored as a single code point U+AC00 (NFC, composed form) or as two separate code points ㄱ (U+1100) + ㅏ (U+1161) (NFD, decomposed form). NFC is compact and preferred for storage and transmission; NFD is used in macOS filenames and some legacy systems. Always normalize to the same form before comparing strings.
What is a surrogate pair and when does it occur?
UTF-16 represents characters with code points above U+FFFF (outside the Basic Multilingual Plane, such as emoji and rare CJK characters) using two 16-bit code units called a surrogate pair. The high surrogate is in range 0xD800–0xDBFF and the low surrogate is in 0xDC00–0xDFFF. JavaScript strings use UTF-16 internally, so emoji like 😀 occupy 2 positions in a string and require string.codePointAt() instead of charCodeAt() for correct processing.
How does Punycode encoding work for internationalized domain names?
Internationalized domain names (IDN) allow non-ASCII characters in domain names, but the DNS system only supports ASCII. Punycode converts Unicode labels to an ASCII-compatible encoding (ACE) prefixed with xn--. For example, "한글.kr" becomes "xn--bj0bj06e.kr". Python's str.encode("idna") performs this conversion. Be aware that visually similar Unicode characters can be used in homograph phishing attacks, so browsers display the punycode form for suspicious domains.