Hawaiʻi's Technology Community

UTF-8 or Death!

Software developers of the world - please stick to UTF-8 encoding from your UIs to your DBs. I'm starting to favor capital punishment for those who use the unholy encodings (i.e. anything other than UTF-8.) If I had a dime for every hour I've spent dealing with supposedly i18n software that monkeys up characters at some point in the pipeline... I suppose its OK to use other encodings if you prefix the file with
THIS_FILE_USES_THE_UNHOLY_ENCODING_(encoding)_(file name)

Some developers use ASCII instead of UTF-8 because they think it will save space. They say, "I donʻt want to use a double byte encoding because most of the content is English." Actually, UTF-8 is a variable byte-length encoding that uses only one byte for ASCII, so you donʻt have to worry about this. It will store your ASCII characters efficiently and ensure Chinese characters donʻt blow up your system :-) Whether you are deciding on the encoding for your DB, HTML or XML config file, if you find yourself asking, "What encoding should I use?" The answer is always UTF-8!

I feel better now. Going to my happy place...

▶ Reply to This

Replies to This Discussion

Permalink Reply by Brian on July 25, 2010 at 2:14am

What's your happy place? A lemur-filled cavern?

I think part of the blame resides with how most languages/OS don't decouple these sorts of issues well from the developer.. They fall under that category of things you shouldn't have to worry about like.. what brand of mouse the user is operating with (or really.. what type of input devices they are or aren't using at all).

Just program in Go ;) that'll fix everything.

▶ Reply

Permalink Reply by Daniel Leuck on July 25, 2010 at 5:51am

What's your happy place? A lemur-filled cavern?

That's pretty close. Its Manoa, but with lemurs and software developers who only use UTF-8.

I think part of the blame resides with how most languages/OS don't decouple these sorts of issues well from the developer.. They fall under that category of things you shouldn't have to worry about like.. what brand of mouse the user is operating with (or really.. what type of input devices they are or aren't using at all).

Part of the problem is that so much software comes from the monolingual US where programmers live in blissful ignorance of other writing systems. The unholy encodings are prevalent everywhere from the world's most popular IDE (Eclipse) to most text editors. They all default to antiquated encodings such as the OS-specific Cp1250 and MacRoman. This serves to perpetuate the encoding problems that have plagued us since the first computers showed up outside the US. In this day and age everything should default to UTF-8.

Microsoft has a long and distinguished history of screwing up encodings. For years Outlook couldn't handle subject lines in Japanese. As a result it became common practice in Japan to write subject lines in roman characters, which isn't particularly natural or readable. Java people can't be smug. To this day javac defaults to ASCII. How do you write Mika's name in Java source? \u5b9f\u4f73. Of course you can use the -encoding switch but the default is ASCII, and that is just plain wrong, especially given the fact UTF-8 transparently handles ASCII.

▶ Reply

Permalink Reply by Mika Leuck on July 25, 2010 at 12:10pm

Iʻm pretty sure email with different encodings for the subject line and body (neither of which is UTF8) was somehow introduced to Japan by Kim Jong-il.

▶ Reply

Permalink Reply by Brian on July 25, 2010 at 5:12pm

I agree completely Dan, more or less what I was trying (poorly) to say.. the plethora of English/ASCII-centric APIs, OS roots, protocol roots (DNS anyone?), file systems, languages, etc have created this situation.

Even today on non-English language sites where EVERYTHING is in.. say Japanese - the "file names" (more properly.. the targets of the URL) are in English (or some romanized form).

Since people have a tendency to ignore the problem and just sorta pray/hope it goes away or their software won't have to deal with it - I really believe it has to be resolved at deeper OS/language levels. In Flex for example there's the extra step of localizing your app using localization maps, etc..

This is stupid! All modern HLLs/etc should take care of the details for you. Done cleanly through smart dev tools/implementation such that say.. if I'm a Java developer and all I "ever" want to write are English language apps.. then let me pick a default locale and easily add the relevant character strings - but don't make me jump through hoops to "localize" my app.

Of course.. the code will still be in ASCII ;) Care to open that door?

▶ Reply

Permalink Reply by Daniel Leuck on July 25, 2010 at 11:23pm

Brian: This is stupid! All modern HLLs/etc should take care of the details for you.

Exactly. The problem is lazy programmers. Its not good enough for text editors, IDEs, consoles, DBs, etc. to support UTF-8. They need to start defaulting to UTF-8. People are lazy. They will use the defaults 99% of the time. If the defaults don't change the encoding nightmare will never end.

Note: I'm sure someone will bring this up, so I'll preempt the argument by saying that in some cases, such as manipulation of Strings in memory, it makes sense to use a simpler encoding such as UTF-16 internally for performance reasons. Java and .NET do this. When I say all text should be encoded in UTF-8 I'm referring to inputs and outputs. It doesn't matter what happens internally.

Brian: Of course.. the code will still be in ASCII ;) Care to open that door?

Are you referring to the use of non-ASCII characters for things like keywords and operators? There is plenty of precedent. The DARPA-funded experimental language Fortress makes full use of Unicode. Professor Philip Johnson from the University of Hawaii worked on this project. There are also internationalized languages, such as ALGOL, that support use of non-English keywords. There are numerous Russian languages that use Russian keywords written in Cyrillic.

Even in languages that use only ASCII characters for keywords, the source should be UTF-8 to support use of non-ASCII characters for Strings, comments, etc. English comments are of limited use to non-English speaking (or even ESL) Chinese programmers.

▶ Reply

Permalink Reply by Brian on July 26, 2010 at 2:08am

I was referring more to keywords/operators. That's interesting to see there are some languages that support this.

Of course the code itself should be in UTF-8 and not actual ASCII (though typically just using the lower "ASCII" bits of it for the actual code).

(Massive segue inc..)

Thinking this whole topic through a few steps.. you realize truly how clumsy our efforts to capture our business logic and transform it to actual "code" remain - or just machine input in general. I realize this on my phone - this whole "pushing buttons" paradigm really starts to break down.

Likewise with software development the objective is to create a machine-grokkable representation of a workflow/logic pattern that is locked within our bony skulls. Anecdotally, I think there's often a backlash against things like making code by drawing pictures and connecting dots, etc. I was reminded of this when reading many forum comments about Google's App Inventor. Tech Crunch even had an article where they questioned whether it was a Gateway Drug or Doomsday Device. "Oh noes.. myriad crap apps coming along". Yeah because uh.. there is so much quality software out there already ;)

[Going on the record to say that 99% of software sucks]

I wonder what the impact has been over the past decades of having computing be English-dominant? For European countries I don't think it's been a big deal as their language systems are similar.. both conceptually and structurally - and practically you can support many by just adding a few special characters.

Bringing this full circle.. getting back to your adventures in Japanese word segmentation.

In Japanese and Chinese there is (and correct me if I'm wrong) no clear word boundaries - so you need to search for them by identifying known patterns of how the word is structured. Likewise in a language such as Lao and Khmer there may be phrase segmentation but not for words per se.

So.. I wonder for speakers of those languages. .or even more unusual ones such as Khoe, Wajarri or Sandawe - would they even really easily understand how to program (From a conceptual standpoint) in our "modern" languages? Does the concept that you would have an opword and throw it some variables even make sense to them?

I don't have any of these answers, but it does seem to highlight how far we have to go before we can really have computing for everyone. My hope is that 10-20 years from now we will have completely solved all these problems along with clean water for all and ethnic/religious harmony ;) But seriously, I think we will look back at objections to "drawing" programs into life to be as ridiculous as those (still, but infrequently) given to intermediate languages/virtual runtimes.. etc.

And now I've completely derailed your conversation ;)

▶ Reply

Permalink Reply by Clifford Chinn on September 6, 2010 at 12:34am

Ha, localization "controversy"? But that NEVER happens! The funny thing about this conversation is that this is not the first time I've heard it but I've never actually seen anyone defend using ASCII as a default other than in hyperbole. Every conversation in every forum I've ever watched on this topic seems to have universal consensus that UTF-8 should be the standard and default but yet it's still not.

Not to besmirch any major software conglomerates that might or might not be signing my paycheck, it amazes me how often localization type issues are forgotten, especially when said corporations employ such a culturally diverse set of employees.

▶ Reply

Permalink Reply by Daniel Leuck on September 6, 2010 at 11:21am

@Clifford Well said. Everyone agrees it should be done, but even the software giants continue to use antiquated encodings. It is maddening.

UTF-8 or Death!

Replies to This Discussion

Sponsors