New approach to the FreeBSD locale database

Background

Over the years the FreeBSD locale database (share/colldef, share/monetdef, share/msgdef, share/numericdef, share/timedef) has accumulated a total of 165 definitions (language - country-code - character-set triplets). The contents of the files is for Western European languages often low-ASCII but for Eastern European and Asian languages partly or fully high-ASCII. Without knowing how to display or interpret the character-sets, it is difficult to make sure by the general audience that the local languages (language - country-code) definitions is displayed properly in various character-sets.

Solution

With a per definition (language - country-code) low-ASCII file with the definitions of the characters for the fields, it would be possible to generate the various character-sets for that language.

What do we need

Gotchas

Examples

The word for the last day of the week in the en_US language - country code would be in Unicode format:

<LATIN CAPITAL LETTER S><LATIN SMALL LETTER A><LATING SMALL LETTER T><LATIN SMALL LETTER U><LATIN SMALL LETTER R><LATIN SMALL LETTER D><LATIN SMALL LETTER A><LATIN SMALL LETTER Y>

Converted into UTF-8 this will be:

Saturday

Converted into ISO-8859 this will be:

Saturday

The word for the last day of the week in the ru_RU language - country code would be in Unicode format:

<CYRILLIC SMALL LETTER ES><CYRILLIC SMALL LETTER U><CYRILLIC SMALL LETTER BE><CYRILLIC SMALL LETTER BE><CYRILLIC SMALL LETTER O><CYRILLIC SMALL LETTER TE><CYRILLIC SMALL LETTER A>

Converted into UTF-8 this will be:

<D1><81><D1><83><D0><B1><D0><B1><D0><BE><D1><82><D0><B0>

Converted into KOI8-R this will be:

<D3><D5><C2><C2><CF><D4><C1>

Careful!

Current status

Finished:

Pending:

Pending third parties:

SCM

(Currently the SCM contains all the definitions (language - country-code - character-set) in low and high-ASCII. To keep the SCM history, we will once move them to their .unicode extension and then overwrite them with the Unicode encoding definitions)

The .unicode files are stored in SCM and will be, in the long term, be the only source in SCM. Right now due to lack of bsdiconv in the base operating system we will have to store also the character-map sources (.src) files into the SCM. Once bsdiconv is in the base system these files can be removed and the whole database can be made self-hosted.

Testing (before move to src/tools/tools/locale)

To test the current system, you need the following data:

Local configuration:

CLDRDIR=        /home/edwin/unicode/cldr/1.7.1
LOCALE_DESTDIR= /home/edwin/locale/new
LOCALE_SHAREOWN=edwin
LOCALE_SHAREGRP=edwin

Test it out:

#
# All targets for TARGET_CHARACTERMAP
#
# .unicode -> .utf-8.src -> .utf-8.out
#                 \__ .iso8859-1.src -> .iso8859-1.out
# <----1---><--2---><------3--------><----4----->
#
# 1. The files .unicode are stored in the SCM and are the source
#    for the whole further system
# 2. The Perl script converts the .unicode files and the Unicode
#    CLDR database into UTF-8 code
# 3. The UTF-8 gets converted by libiconv or bsdiconv in the specific
#    character-map.
# 4. Get rid of the comments.
#
# As long as there is no bsdiconv, the files with the extension
# .unicode and .src must be stored in the SCM and will not be
# generated as part of the build process.
#

LocaleNewApproach (last edited 2009-10-04T21:15:30+0000 by EdwinGroothuis)