MULTILINGUAL WEB SITES AT FAO OF THE UN Introduction Since the onset of the internet, FAO has been using the web as a major platform for disseminating information and knowledge in multiple languages. The Organization’s official languages are English, French, Spanish, Arabic and Chinese. In some cases, other languages are used, such as Portuguese, Russian and Italian. There are over 3 million HTML files indexed on www.fao.org, over 50,000 full text FAO publications, and more than 4 million user visits per month. The issues related to multilingual Web sites can be divided in three main areas: 1. Technical aspects 2. Translation costs 3. Maintenance 1. Technical aspects The main technical issue when developing a Web site in multiple languages is the encoding scheme. For many years, the only way to instruct a browser how to display a page in a specific language was to refer to a character set. For pages in English, French and Spanish, it was relatively easy to encode, because they used the same character set (charset=iso-8859-1). Even if the character set was not specified, most computers recognized the characters. For Arabic and Chinese languages, it was much more complicated. Firstly, the correct character set had to be selected. The use of Chinese was made possible by referring to the character set “charset=gb_2312-80”. For Arabic, the character set would vary according to the platform: “charset=windows-1256” for PCs and “charset=ISO-8859-6” for Macintosh computers. In addition, as Arabic reads from right to left, this had to be specified in the encoding. Secondly, on a computer without a multilingual interface (or without Arabic and/or Chinese fonts) the text would appear as incomprehensible accented characters (for example: 1 ÇáÊãÇÓ äÞÇØ ÇáÇÊÝÇÞ). This, of course, makes it very difficult to modify Arabic and Chinese content. Lastly, it was extremely complicated to display more than one language on the same Web page (for example: English and Arabic, or worse, Arabic and Chinese). This would create many problems, especially for dynamic queries to databases or repositories where information was stored in various languages and had to be displayed on the same results page. Another important technical issue was the storage of text in database tables. For databasedriven Web sites and metadata repositories, it was a complex process to store data in different languages, because the correct encoding had to be used when inserting and retrieving text. Procedures in different programming languages were developed to circumvent this issue, increasing the complexity of the code of the information system, which resulted in a decrease in the performance and the stability of the system. Since Windows NT and Windows 2000, the use of unicode makes our job much easier, as it is a universal standard that enables the encoding of all characters used in all of the world’s written languages. UTF-8 is an encoding form of the unicode standard, which is now widely used on the Internet. The main advantage to the Organization of using UTF-8 is that multiple languages can now be displayed on the same page, regardless of the platform. There are still some technical issues to consider, but they are much easier to address than those we had when we worked with different character sets. Unicode is now supported by all Database Management System (DMBS); content is stored in UTF-8 for all languages, simplifying data insertion and retrieval. Using unicode in DBMSs, the speed of applications have been improved and the management of database schemas simplified. However, the conversion from legacy databases to UTF-8 based schemas is a major undertaking and unsuspected problems can always occur. One technical issue remains with Arabic text, because it reads from right to left. From a graphical point of view the interface has to be flipped horizontally. To accommodate this structure the Web page has to be adapted. This often creates problems that have to be addressed on a case-by-case basis. This issue is complicated by using Open Source based tools, because few of these tools have been designed to manage right to left reading text. Therefore, the management of Arabic Web pages, especially with Open Source content management systems, is difficult. 2 Another important technical issue to be considered is multilingual support for full text search engines, especial Arabic and Chinese. Several tests have been carried out using different tools and methodologies before selecting an engine to index FAO’s Web sites and systems. FAO currently has two search engines: Google Custom Search for FAO Web sites and an Open Source search tool for imbedded specialized searches (for example: searching FAO’s Corporate Document Repository). Both search types support multilanguage use. 2. Translation costs There is always the cost of developing multilingual Web pages and systems, however, content also has to be made available in the official languages of the Organization for our Member Countries. Translation costs are a major issue for the Organization when preparing a Web site; translating text into four languages incurs a cost that technical divisions are not always willing to pay. Most of the time compromises occur and a Web site will be fully available in English, French and Spanish, while only the first level of information will be translated in Arabic and Chinese. 3. Maintenance The issue of maintenance is related to the issue of translation costs. When new content is inserted into a Web site that exists in multiple languages, it must be translated into each language. Requesting and tracking the translation of content is labour intensive and incurs further costs. It is very difficult to update a Web site in all its languages – it requires good procedures and clear policy to ensure that multilingual Web site are developed and maintained in an efficient way. 4. Conclusions Running a multilingual site is a constant challenge, which grows exponentially with the number of languages to be included. While there have been giant improvements in the techniques and technologies available for the management of multiple languages in technical terms, the biggest challenge that remains is the mobilization and synchronization of content across the languages, particularly if the goal is, as it is with us, to obtain parity of language 3 coverage. In fact, the better the tools, the more evident the discrepancy between languages versions becomes. During a prolonged period of budgetary reductions, such as the one most U.N. institutions have been subjected to, achieving and then maintaining parity of language coverage within a corporate web-site has become increasingly difficult, if not impossible. 4