If you’re building web application or software that targets an international audience who speak and read languages other then english, than utf8 is one of the character sets that you must know about.
I was messing with a mysql database and wonder what are the differences between the collations utf8_unicode_ci and utf8_general_ci.
This is perhaps the best explanation and comparison that I’ve found from MySQL forums:
utf8_general_ci is a very simple collation. What it does – it just
- removes all accents
- then converts to upper case
and uses the code of this sort of “base letter” result letter to compare.
For example, these Latin letters: ÀÁÅåāă (and all other Latin letters “a” with any accents and in any cases) are all compared as equal to “A”.
utf8_unicode_ci uses the default Unicode collation element table (DUCET).
The main differences are:
1. utf8_unicode_ci supports so called expansions and ligatures, for example:
German letter ß (U+00DF LETTER SHARP S) is sorted near “ss” Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near “OE”.
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in a wrong order.
2. utf8_unicode_ci is *generally* more accurate for all scripts.
For example, on Cyrillic block:
utf8_unicode_ci is fine for all these languages:
Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian.
While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic.
Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not well sorted / not sorted accurately.
The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
So when you need better sorting order – use utf8_unicode_ci,
and when you’re utterly interested in performance – use utf8_general_ci.