Multibyte collation support in FreeBSD
Overview
This project aims to bring to FreeBSD the correct sorting of national characters encoded in UTF8. Currently, strings with those characters in their beginning always ended at the and of the resulting sorted list - which was obviously wrong. Some workarounds were tried and worked - for example using the ICU library for Postgres 8.3. Although appealing, such approaches don't give us correct support in the base system - for example in sort(1) and ls(1). They also mean we (or authors) have to manually patch every program affected. It is obvious that a lower level solution is needed. This project is the solution.
Situation before the project started
- Two level comparison
- Collation table size of only 256 elements (UCHAR_MAX + 1)
- colldef(1) file format differing from the POSIX standard, but elegant
What needs to be done
- Three level comparison
- Collation table size of at least 65536 elements
- Enhance the colldef(1) utility to accept 16 character values and to store for them 3 levels of data
- It seems the input file format will have to be changed completely
- Write converter programs/port existing ones to use the vast CLDR database and target them to the new enhanced colldef(1) format
- Change the libc part to use the new table format and do 3-level comparisons
- Other parts of the UCA algorithm, like variable weighting, should be dealt with while generating binary table with colldef(1)
- Need to think what to do with contractions/expansions - they can destroy current O(1) lookup time
- What will be used:
- UCA - Unicode Collation Algorithm
data files from CLDR - Common Locale Data Repository posix.zip
regression test data tests.zip
Apple's colldef: http://www.opensource.apple.com/darwinsource/tarballs/apsl/adv_cmds-119.tar.gz
Apple's Libc: http://www.opensource.apple.com/darwinsource/tarballs/apsl/Libc-498.1.1.tar.gz
Current project status:
converter scripts for CLDR data |
done |
my version of the program generating the LC_COLLATE table (colldef) |
done |
porting Apple's colldef program |
done |
porting collation support from Apple's libc |
done |
writing regression tests |
done |
add support for expansions needed for some languages |
in progress |
documenting everything |
to be done |
Links to working parts
my colldef
port of Apple's colldef: colldef.apple
Scripts preparing input to the before mentioned program's scripts
libc integrating multibyte collation support: libc
CLDR data version 1.6, with some of my fixes: cldr
Also, my patch to Freebsd's 7.0 libc (you also need the colldef.apple program and locale data files): fbsd_7.0_collation.patch
Hot and new patch against 9-CURRENT (works agains 8.0 too) fbsd_9CURRENT_collation.patch
Implementation rationale
When porting parts of Apple's libc, I faced a choice of importing xlocale (http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/xlocale.3.html), or throwing it out. The xlocale changes were very widespread throughout the libc, and I felt importing it is beyond the scope of this project. Just the number of affected files made me feel uneasy:
19:15|versus@vspredator:libc% grep -l -R locale_t * | grep -v FreeBSD | wc -l 218
Affected functions include vfprintf itself, which now has additional locale_t argument. The diff for vfprintf.c is 900 lines long. Also, personally, I don't like adding things for which I don't see immediate gains in functionality - and I never saw a program which (used/could benefit from) two different locales in two different threads at the same time.