KonradJankowski/Collation

Multibyte collation support in FreeBSD

Overview

This project aims to bring to FreeBSD the correct sorting of national characters encoded in UTF8. Currently, strings with those characters in their beginning always ended at the and of the resulting sorted list - which was obviously wrong. Some workarounds were tried and worked - for example using the ICU library for Postgres 8.3. Although appealing, such approaches don't give us correct support in the base system - for example in sort(1) and ls(1). They also mean we (or authors) have to manually patch every program affected. It is obvious that a lower level solution is needed. This project is the solution.

Situation before the project started

Two level comparison
Collation table size of only 256 elements (UCHAR_MAX + 1)
colldef(1) file format differing from the POSIX standard, but elegant

What needs to be done

Three level comparison
Collation table size of at least 65536 elements
Enhance the colldef(1) utility to accept 16 character values and to store for them 3 levels of data
It seems the input file format will have to be changed completely
Write converter programs/port existing ones to use the vast CLDR database and target them to the new enhanced colldef(1) format
Change the libc part to use the new table format and do 3-level comparisons
Other parts of the UCA algorithm, like variable weighting, should be dealt with while generating binary table with colldef(1)
Need to think what to do with contractions/expansions - they can destroy current O(1) lookup time

What will be used:
UCA - Unicode Collation Algorithm
data files from CLDR - Common Locale Data Repository posix.zip
regression test data tests.zip
Apple's colldef: http://www.opensource.apple.com/darwinsource/tarballs/apsl/adv_cmds-119.tar.gz
Apple's Libc: http://www.opensource.apple.com/darwinsource/tarballs/apsl/Libc-498.1.1.tar.gz

Current project status:

converter scripts for CLDR data	done
my version of the program generating the LC_COLLATE table (colldef)	done
porting Apple's colldef program	done
porting collation support from Apple's libc	done
writing regression tests	done
add support for expansions needed for some languages	in progress
documenting everything	to be done

Links to working parts

my colldef
port of Apple's colldef: colldef.apple
Scripts preparing input to the before mentioned program's scripts
libc integrating multibyte collation support: libc
CLDR data version 1.6, with some of my fixes: cldr
Also, my patch to Freebsd's 7.0 libc (you also need the colldef.apple program and locale data files): fbsd_7.0_collation.patch
Hot and new patch against 9-CURRENT (works agains 8.0 too) fbsd_9CURRENT_collation.patch
Regression tests

Implementation rationale

When porting parts of Apple's libc, I faced a choice of importing xlocale (http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/xlocale.3.html), or throwing it out. The xlocale changes were very widespread throughout the libc, and I felt importing it is beyond the scope of this project. Just the number of affected files made me feel uneasy:

19:15|versus@vspredator:libc% grep -l -R locale_t * | grep -v FreeBSD | wc -l
218

Affected functions include vfprintf itself, which now has additional locale_t argument. The diff for vfprintf.c is 900 lines long. Also, personally, I don't like adding things for which I don't see immediate gains in functionality - and I never saw a program which (used/could benefit from) two different locales in two different threads at the same time.

Also see

SummerOfCode2014/Unicode

KonradJankowski/Collation (last edited 2017-09-18T13:06:47+0000 by KubilayKocak)