Multilingual Web sites

advertisement
MULTILINGUAL WEB SITES AT FAO OF THE UN
Introduction
Since the onset of the internet,
FAO has been using the web as a major platform for
disseminating information and knowledge in multiple languages. The Organization’s official
languages are English, French, Spanish, Arabic and Chinese. In some cases, other
languages are used, such as Portuguese, Russian and Italian. There are over 3 million
HTML files indexed on www.fao.org, over 50,000 full text FAO publications, and more than
4 million user visits per month.
The issues related to multilingual Web sites can be divided in three main areas:
1. Technical aspects
2. Translation costs
3. Maintenance
1. Technical aspects
The main technical issue when developing a Web site in multiple languages is the encoding
scheme. For many years, the only way to instruct a browser how to display a page in a
specific language was to refer to a character set.
For pages in English, French and Spanish, it was relatively easy to encode, because they
used the same character set (charset=iso-8859-1). Even if the character set was not
specified, most computers recognized the characters.
For Arabic and Chinese languages, it was much more complicated. Firstly, the correct
character set had to be selected. The use of Chinese was made possible by referring to the
character set “charset=gb_2312-80”. For Arabic, the character set would vary according to
the platform: “charset=windows-1256” for PCs and “charset=ISO-8859-6” for Macintosh
computers. In addition, as Arabic reads from right to left, this had to be specified in the
encoding.
Secondly, on a computer without a multilingual interface (or without Arabic and/or Chinese
fonts) the text would appear as incomprehensible accented characters (for example:
1
ÇáÊãÇÓ äÞÇØ ÇáÇÊÝÇÞ). This, of course, makes it very difficult to modify Arabic and
Chinese content.
Lastly, it was extremely complicated to display more than one language on the same Web
page (for example: English and Arabic, or worse, Arabic and Chinese). This would create
many problems, especially for dynamic queries to databases or repositories where
information was stored in various languages and had to be displayed on the same results
page.
Another important technical issue was the storage of text in database tables. For databasedriven Web sites and metadata repositories, it was a complex process to store data in
different languages, because the correct encoding had to be used when inserting and
retrieving text. Procedures in different programming languages were developed to
circumvent this issue, increasing the complexity of the code of the information system, which
resulted in a decrease in the performance and the stability of the system.
Since Windows NT and Windows 2000, the use of unicode makes our job much easier, as it
is a universal standard that enables the encoding of all characters used in all of the world’s
written languages. UTF-8 is an encoding form of the unicode standard, which is now widely
used on the Internet. The main advantage to the Organization of using UTF-8 is that multiple
languages can now be displayed on the same page, regardless of the platform. There are
still some technical issues to consider, but they are much easier to address than those we
had when we worked with different character sets.
Unicode is now supported by all Database Management System (DMBS); content is stored in
UTF-8 for all languages, simplifying data insertion and retrieval. Using unicode in DBMSs,
the speed of applications have been improved and the management of database schemas
simplified. However, the conversion from legacy databases to UTF-8 based schemas is a
major undertaking and unsuspected problems can always occur.
One technical issue remains with Arabic text, because it reads from right to left. From a
graphical point of view the interface has to be flipped horizontally. To accommodate this
structure the Web page has to be adapted. This often creates problems that have to be
addressed on a case-by-case basis. This issue is complicated by using Open Source based
tools, because few of these tools have been designed to manage right to left reading text.
Therefore, the management of Arabic Web pages, especially with Open Source content
management systems, is difficult.
2
Another important technical issue to be considered is multilingual support for full text search
engines, especial Arabic and Chinese. Several tests have been carried out using different
tools and methodologies before selecting an engine to index FAO’s Web sites and systems.
FAO currently has two search engines: Google Custom Search for FAO Web sites and an
Open Source search tool for imbedded specialized searches (for example: searching FAO’s
Corporate Document Repository). Both search types support multilanguage use.
2. Translation costs
There is always the cost of developing multilingual Web pages and systems, however,
content also has to be made available in the official languages of the Organization for our
Member Countries. Translation costs are a major issue for the Organization when preparing
a Web site; translating text into four languages incurs a cost that technical divisions are not
always willing to pay. Most of the time compromises occur and a Web site will be fully
available in English, French and Spanish, while only the first level of information will be
translated in Arabic and Chinese.
3. Maintenance
The issue of maintenance is related to the issue of translation costs. When new content is
inserted into a Web site that exists in multiple languages, it must be translated into each
language. Requesting and tracking the translation of content is labour intensive and incurs
further costs. It is very difficult to update a Web site in all its languages – it requires good
procedures and clear policy to ensure that multilingual Web site are developed and
maintained in an efficient way.
4. Conclusions
Running a multilingual site is a constant challenge,
which grows exponentially with the
number of languages to be included. While there have been giant improvements in the
techniques and technologies available for the management of multiple languages in technical
terms, the biggest challenge that remains is the mobilization and synchronization of content
across the languages, particularly if the goal is, as it is with us, to obtain parity of language
3
coverage. In fact, the better the tools, the more evident the discrepancy between languages
versions becomes.
During a prolonged period of budgetary reductions, such as the one most U.N. institutions
have been subjected to, achieving and then maintaining parity of language coverage within
a corporate web-site has become increasingly difficult, if not impossible.
4
Download