Author Topic: Locale-based sorting (Read 5225 times)

Uffe · « **on:** October 12, 2015, 09:11:07 pm »

Hi all,

Unicode support is coming along and that's nice. So let's make it even nicer and add locale support too.

A problem with the Unicode table is that outside of the 26 characters of the rather impoverished English alphabet, the characters have just been dumped in likely-looking piles. And fair enough, really, because the additional characters are treated differently in different languages; sometimes they're proper letters in their own right, sometimes they're just diacritic variations.

But that should be taken into consideration when sorting things in EA's project browser, search results, etc. Right now, everything's sorted in Unicode order, and that's not the same as alphabetical order which is what people expect.

There are two issues here. The first is that letters with diacritical marks (accents etc) should be sorted with the unmarked letters. Right now, they all end up after Z because that's where they are in the Unicode table.

The other is that proper letters should be sorted into their correct alphabetical positions.

Scandinavian languages have more letters than English, appearing at the end of the alphabet. These are proper letters, not umlauts as in German. There are two main Scandinavian alphabets, Danish/Norwegian and Swedish, and just for fun, the characters are not identical in the two, nor are they sorted in the same order.

The Danish/Norwegian alphabet ends with Z Æ Ø Å, the Swedish with Z Å Ä Ö. EA will sort everything by Univode number, and the following table shows you how that clashes with expectations.

Unicode	Character	DK/NO	SE
196/228	Ä/ä	--	28
197/229	Å/å	29	27
198/230	Æ/æ	27	--
214/246	Ö/ö	--	29
216/248	Ø/ø	28	--

With Finnish I'm not entirely sure, but I think you can treat it the same as Swedish. "Å" (197/229) isn't part of Finnish proper, but there is a Swedish-speaking minority in Finland and Swedish is an official language there.

So what it all boils down to is that the same Unicode character has different meanings in different locales (in some a proper letter with its own alphabetical position, and in others a mere diacritic variation), and the sorting should take that into consideration. The current implementation is simple enough, but not good enough.

I'm sure you'll be pleased to know that Unicode recently published the 32nd revision to its Unicode Collation Algorithm. So if EA simply implements that, we should be OK.

/Uffe

skiwi · « **Reply #1 on:** October 13, 2015, 09:24:59 am »

Could I throw in a request to stop sorts being case sensitive.

qwerty · « **Reply #2 on:** October 13, 2015, 10:08:16 am »

You opened a can of worms here. "Simply implement" looks like quite a bit of work better spent in fixing bugs (hope dies at last).

q.

KP · « **Reply #3 on:** October 14, 2015, 02:50:04 pm »

IANADBA, but any chance this might be a DBMS setting? At least for those situations where EA uses an SQL "ORDER BY" clause to define sort order.

Uffe · « **Reply #4 on:** October 14, 2015, 07:45:01 pm »

Hi KP,

Good point. But it depends on which part of EA you're in. The project browser does not sort its contents based on an 'order by Name' clause, but the doc generator does (if you tell it to).

I ran the following test:
1) Created an empty package, and reset its sort order.
2) Added six classes with single-letter names; ö, ä, å, z, á, a.
They were added in that order, so the 'a' one got the highest object ID.
3) The project browser sorts the classes a, z, á, ä, å, ö (Unicode order)
4) An SQL query with 'order by Name' sorts them a, á, z, å, ä, ö (correct alphabetical order for the locale).
I was using an MS SQL Server database, presumably set up with the Swedish locale but I couldn't verify this.
5) Created a doc template which only prints out the names of elements.
6) Generated a document, specifying 'Elements by Tree Order' => a, z, á, ä, å, ö (browser/Unicode).
7) Generated a document, specifying 'Elements by Name' => a, á, z, å, ä, ö (alphabetical).

So it works correctly in document generation, but the project browser has its own sorting algorithm and looks like that's just plain old character number less-than (well, after element type).

/Uffe

Sparx Systems Forum

News: