Improve Unicode support in FreeBSD
Student: DmitrySelyutin (ghostmansd@)
Mentor: PedroGiffuni (pfg@)
Description
There are lot of functions in libc that are ethier unimplemented or buggy or may be significantly improved. The most known functions that need an implementation are strcoll and strcoll_l (being the reason why some projects use external libraries like PostgreSQL uses ICU). Functions that may need reimplementation are for example wc/mb and uchar.h-related functions; their support can be significantly improved to be at least as fast as they are in glibc or ICU library.
Approach
Implement string collation algorithm as discussed in UTS#10. Fix various bugs in libc Unicode support.
Deliverables
Implement strcoll and strcoll_l functions, improve wc/mb functions as well as C11 and POSIX standart functions. One of the goals of this project is to let some projects like PostgreSQL to be built without ICU as dependency (thus strcoll and strcoll_l are the absolutely necessary functions). However, I personally treat this project as a possibility to provide better Unicode support everywhere where we can.
Checkpoints
Functions for normalization and canonicalization are implemented in source files under lib/libc/unicode directory. New collation database interface was provided in order to extend collation support as well as its Python bindings. Right now source code is placed under lib/libcolldb directory, though it's likely to be moved to some other place.
The Code
https://socsvn.freebsd.org/socsvn/soc2014/ghostmansd
https://reviews.freebsd.org/D2736
Useful links
http://www.unicode.org/reports/tr10
http://www.unicode.org/unicode/reports/tr15
http://userguide.icu-project.org/collation/api
http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm