Forget ISO-8859-1

January 16, 2017

UTF-8 was first presented in 1993. One would assume that 24 years is enough to time for it to become ubiquitous, especially given that the Internet is global. ASCII doesn’t even cover French letters, not to mention Cyrillic or Devanagari (the Hindi script). That’s why ASCII was replaced by ISO-8859-1, which kind of covers most of western languages’ orthographies.

88.3% of the websites use UTF-8. That’s not enough, but let’s assume these 11.7% do not accept any input and are just English-language static websites. The problem of the still-pending adoption of UTF-8 is how entrenched ASCII/ISO-8859-1 is. I’ll try to give a few examples:

  • UTF-8 isn’t the default encoding in many core Java classes. FileReader, for example. It’s similar for other languages and runtimes. The default encoding of these Java classes is the JVM default, which is most often ISO-8859-1. It is allegedly taken from the OS, but I don’t remember configuring any encoding on my OS. Just locale, which is substantially different.
  • Many frameworks, tools and containers don’t use UTF-8 by default (and don’t try to remedy the JVM not using UTF-8 by default). Tomcat’s default URL encoding I think is still ISO-8859-1. Eclipse doesn’t make the files UTF-8 by default (on my machine it’s somethings even windows-1251 (Cyrillic), which is horrendous). And so on. I’ve asked for having UTF-8 as default in the past, and I repeat my call
  • Regex examples and tutorials always give you the [a-zA-Z0-9]+ regex to “validate alphanumeric input”. It is built-in in many validation frameworks. And it is so utterly wrong. This is a regex that must never appear anywhere in your code, unless you have a pretty good explanation. Yet, the example is ubiquitous. Instead, the right regex is [\p{Alpha}0-9]+. Using the wrong regex means you won’t be able to accept any special character. Which is something you practically never want. Unless, probably, due to the next problem.
  • Browsers have issues with UTF-8 URLs. Why? It’s complicated. And it almost works when it’s not part of the domain name. Almost, because when you copy the URL, it gets screwed (pardon me – encoded).
  • Microsoft Excel doesn’t work properly with UTF-8 in CSV. I was baffled to realize the UTF-8 CSVs become garbage. Well, not if you have a BOM (byte order mark), but come on, it’s [the current year].

As Jon Skeet rightly points out – we have issues with the most basic data types – strings, numbers and dates. This is partly because the real world is complex. And partly because we software engineers tend to oversimplify it. This is what we’ve done with ASCII and other latin-only encodings. But let’s forget ASCII and ISO-8859-1. It’s not even okay to call them “legacy” after 24 years of UTF-8. After 24 years they should’ve died.

Let’s not give regex examples that don’t work with UTF-8, let’s not assume any default different than UTF-8 is a good idea and let’s sort the URL mess.

Maybe I sound dogmatic. Maybe I exaggerate because my native script is non-latin. But if we want our software to be global (and we want that, in order to have a bigger market), then we have to sort our basic encoding issues. Having UTF-8 as a standard is not enough. Let’s forget ISO-8859-1.

Share Button

3 Responses to “Forget ISO-8859-1”

  1. I live and work in a country that uses the Roman alphabet (Spain) and that probably makes it worse because stuff “just works”—well, at least most of the time, so you don’t have to care about anything, let’s just finish the project and we will “fix accents” later. In 2017 I still see now and then the typical misencodings caused by the MS-DOS 850 codepage.

    An alarming amount of IT professionals do not really understand or care about how text is stored in computers and some of them build major programs and libraries.

  2. Having an ö-Umlaut in my name (Köhler) I already got used to see encoding errors around the Internet. The thing is that i see Köhler way more than Khler which shows that at some point it was encoded as UTF-8 and later was interpreted as ISO-8859-1.

  3. Couldn’t agree more!

    Note that Unicode allows other digits besides 0-9. To be completely general the regex should be [\p{L}\p{N}]+

Leave a Reply