Remembering Text Encoding
July 03, 2020
Text encoding is one of those things that I engage with rarely enough that I can never quite remember the details of how it works. I enjoyed hearing Surma and Jake explain some of the details on HTTP 203 recently, and decided it would be useful to outline some of the important points of text encoding for my own future reference.
At a fundamental level, computers can only really deal with numbers, represented as sequences of 1s and 0s (binary data). This means that in order to represent any sequence of characters (i.e., text), we must agree on a mapping, whereby each unique character is assigned a number that the computer can use to represent it.
Not the first, but a reasonably old (since ~1960) mapping containing 128 unique code points. Regular, unaccented characters required by American English start with the “space” character at 32. Code points below this are unprintable control characters.
Since you only need 7 bits to represent 128 code points, and most computers deal in 8-bit bytes, there was room for an additional 128 code points, but no universally agreed upon standard for how to use them.
Unicode is a standard that aims to create a mapping for every possible character in every conceivable language. Unicode code points are hexadecimal numbers that represent an agreed-upon character, and look like this:
U+1F3BE. Unicode is not opinionated about how these numbers are stored on disk or in memory.
A UTF is an “algorithmic mapping from every Unicode code point to a unique byte sequence”. This is the part that determines how those Unicode numbers are represented on disk or in memory.
There are many variants of UTF, the most common in use on the web today is called UTF-8, which stores the code points from 0-127 in a single 8-bit byte, with the option to store code points beyond that with up to 6 bytes.