Improve Unicode support in FreeBSD

Description

There are lot of functions in libc that are ethier unimplemented or buggy or may be significantly improved. The most known functions that need an implementation are strcoll and strcoll_l (being the reason why some projects use external libraries like PostgreSQL uses ICU). Functions that may need reimplementation are for example wc/mb and uchar.h-related functions; their support can be significantly improved to be at least as fast as they are in glibc or ICU library.

Approach

Implement string collation algorithm as discussed in UTS#10. Fix various bugs in libc Unicode support.

Deliverables

Implement strcoll and strcoll_l functions, improve wc/mb functions as well as C11 and POSIX standart functions. One of the goals of this project is to let some projects like PostgreSQL to be built without ICU as dependency (thus strcoll and strcoll_l are the absolutely necessary functions). However, I personally treat this project as a possibility to provide better Unicode support everywhere where we can.

Checkpoints

Functions for normalization and canonicalization are implemented in source files under lib/libc/unicode directory. New collation database interface was provided in order to extend collation support as well as its Python bindings. Right now source code is placed under lib/libcolldb directory, though it's likely to be moved to some other place.

The Code

https://socsvn.freebsd.org/socsvn/soc2014/ghostmansd

https://reviews.freebsd.org/D2736

http://www.unicode.org/reports/tr10

http://www.unicode.org/unicode/reports/tr15

http://userguide.icu-project.org/collation/api

http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm

https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

SummerOfCode2014/Unicode (last edited 2015-06-05T16:48:47+0000 by PedroGiffuni)