Remembering Text Encoding

July 03, 2020

Text encoding is one of those things that I engage with rarely enough that I can never quite remember the details of how it works. I enjoyed hearing Surma and Jake explain some of the details on HTTP 203 recently, and decided it would be useful to outline some of the important points of text encoding for my own future reference.

There are many great resources out there that cover text encoding in depth. With this, I’m only really intending to jog my own memory on the basics every time I forget them.

Why encode text?

At a fundamental level, computers can only really deal with numbers, represented as sequences of 1s and 0s (binary data). This means that in order to represent any sequence of characters (i.e., text), we must agree on a mapping, whereby each unique character is assigned a number that the computer can use to represent it.

American Standard Code for Information Interchange (ASCII)

Not the first, but a reasonably old (since ~1960) mapping containing 128 unique code points. Regular, unaccented characters required by American English start with the “space” character at 32. Code points below this are unprintable control characters.

Since you only need 7 bits to represent 128 code points, and most computers deal in 8-bit bytes, there was room for an additional 128 code points, but no universally agreed upon standard for how to use them.

Unicode

Unicode is a standard that aims to create a mapping for every possible character in every conceivable language. Unicode code points are hexadecimal numbers that represent an agreed-upon character, and look like this: U+1F3BE. Unicode is not opinionated about how these numbers are stored on disk or in memory.

Unicode Transformation Format (UTF)

A UTF is an “algorithmic mapping from every Unicode code point to a unique byte sequence”. This is the part that determines how those Unicode numbers are represented on disk or in memory.

There are many variants of UTF, the most common in use on the web today is called UTF-8, which stores the code points from 0-127 in a single 8-bit byte, with the option to store code points beyond that with up to 6 bytes.

UTF-16 was an earlier variant that represents code points with either one or two 16-bit units. JavaScript strings are UTF-16.

Bonus

JavaScript character escape sequences allow you to use Unicode code points directly in strings:

'\u{1F3BE}'