Multibyte collation support in FreeBSD

Overview

This project aims to bring to FreeBSD the correct sorting of national characters encoded in UTF8. Currently, strings with those characters in their beginning always ended at the and of the resulting sorted list - which was obviously wrong. Some workarounds were tried and worked - for example using the ICU library for Postgres 8.3. Although appealing, such approaches don't give us correct support in the base system - for example in sort(1) and ls(1). They also mean we (or authors) have to manually patch every program affected. It is obvious that a lower level solution is needed. This project is the solution.

Situation before the project started

What needs to be done


Current project status:

converter scripts for CLDR data

done

my version of the program generating the LC_COLLATE table (colldef)

done

porting Apple's colldef program

done

porting collation support from Apple's libc

done

writing regression tests

done

add support for expansions needed for some languages

in progress

documenting everything

to be done

Implementation rationale

When porting parts of Apple's libc, I faced a choice of importing xlocale (http://developer.apple.com/documentation/Darwin/Reference/ManPages/man3/xlocale.3.html), or throwing it out. The xlocale changes were very widespread throughout the libc, and I felt importing it is beyond the scope of this project. Just the number of affected files made me feel uneasy:

19:15|versus@vspredator:libc% grep -l -R locale_t * | grep -v FreeBSD | wc -l
218

Affected functions include vfprintf itself, which now has additional locale_t argument. The diff for vfprintf.c is 900 lines long. Also, personally, I don't like adding things for which I don't see immediate gains in functionality - and I never saw a program which (used/could benefit from) two different locales in two different threads at the same time.

KonradJankowski/Collation (last edited 2011-04-06 18:19:55 by KonradJankowski)