Using Hebrew language with Microsoft SQL Server 7.0 Version 1.0 Page 1 2/16/2016 The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 1998 Microsoft Corporation. All rights reserved. Microsoft, ActiveX, Visual Basic, Windows, and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. Other trademarks and tradenames mentioned herein are the property of their respective owners. The names of companies, products, people, characters, and/or data mentioned herein are fictitious and are in no way intended to represent any real individual, company, product, or event, unless otherwise noted. Draft – Do not distribute Page 2 2/16/2016 Contents Contents ................................................................................................................................................... 3 Introduction ............................................................................................................................................. 4 Code Pages ............................................................................................................................................... 5 What is a code page? .......................................................................................................................... 5 Types of code pages ........................................................................................................................... 6 Unicode .............................................................................................................................................. 7 SQL Server multilingual support .......................................................................................................... 8 SQL Server Code page ....................................................................................................................... 8 Conversions between Code Pages ...................................................................................................... 9 ODBC / OLEDB / DBLib conversion .................................................................................................. 10 Changing translation behavior ...................................................................................... 12 Tests which we have done ..................................................................................................................... 13 General configuration description .................................................................................................... 13 VB client accessing SQL Server 7.0 Using ODBC or OLEDB ....................................................... 14 Access accessing SQL Server 7.0 .................................................................................................... 16 VB Client accessing SQL Server 7.0 Thru a COM MTS Object ..................................................... 17 Web Client accessing SQL Server 7.0 Thru IIS 4.0 ......................................................................... 19 DTS Accessing Text File ................................................................................................................. 20 Windows 2000 – A possible solution .................................................................................................... 22 Conclusions ............................................................................................................................................ 23 Finding More Information ................................................................................................................... 23 Draft – Do not distribute Page 3 2/16/2016 Introduction With the rapidly growing number of SQL Server applications and users in Israel, many Hebrew issues were raised. Most of the issues are relevant to the operating system, ODBC, and are not related to SQL Server specifically. Also, the number of configuration options grows dramatically with different versions of SQL Server, Windows OS, ODBC Drivers, MDAC, and all kind of Service Packs applicable for all kind of Microsoft products. With that in mind, we have decided to do several tests that represent typical scenarios. We have decided to check SQL Server 7.0 only, and chose specific versions for each component in the configuration. We hope that as the result of these scenarios, we will be able to deliver to the readers the understanding of ‘how does it work’. I want to thank Dan Shamir, Gal Cohen, and Rona Lustig for the work they have done in creating this document Yours Tomer Ben-Moshe Database technologies Manager Microsoft Israel Draft – Do not distribute Page 4 2/16/2016 Code Pages What is a code page? A code page is a group of letters, digits, and symbols that make up the possible values. Sometime it is referred to as a character set, we will use the two terms interchangeably. A code page contains 256 values. The printable characters of the first 128 values are the same for all character sets. The last 128 characters, referred to as extended characters, differ from set to set. The extended characters often represent language-specific letters and symbols. These code pages are sometimes called ANSI code pages. Each code page is defined by a table that has the integer representation of every character in the code page, for example in the 1252 (ISO) code page the “a” character is – 97, “b” is – 98, “A” is 65. Some Important Code Pages are: 1252 ISO character set 850 Multilingual 437 U.S. English 1255 Hebrew 1256 Arabic Draft – Do not distribute Note: Include most European languages, not including Hebrew. Page 5 2/16/2016 Types of code pages There are some types of Code pages: OEM - is the code page the computer manufacturer implemented in the computer’s hardware. DOS uses this code page. Windows system can translate between ANSI and OEM. When setting up windows system, the setup program will identify the OEM code page on the current computer and install an OEM – ANSI translation table and the suitable fonts. If the user changes the OEM code page after installing windows she has to reinstall windows so the correct translation table will be installed. The above is the main different between English Windows, and Hebrew Enable Windows. This is the reason for most conversions problems. ASCII – an historical code page that includes Control Characters except from the regular Alfa numeric characters. ANSI – common code page today contains most characters from ASCII. UTF 7 / UTF 8 - are used for Internet applications. Draft – Do not distribute Page 6 2/16/2016 Unicode Using ANSI code page may result conversion problems when building multi lingual systems, this is due to the fact that we need to define a specific code page for each language. The way to deal with this problem is to define a new code page that will contain all known languages. This code page is a two-byte code page called Unicode. Unicode stores every character in two bytes instead of one byte as in ANSI code pages. The range of characters is 2^16 (65,536) instead of 2^8 (256) as in the ANSI code pages. This range should be sufficient to store all characters of most languages in the world. Microsoft SQL Server 7.0 introduces new data types – NVARCHAR, NCHAR, and NTEXT – which store data in Unicode format. When using Unicode data in SQL Server 7.0 it is important to remember that every TransactSQL statement that uses constants with Unicode data should prefix the constant with N. For example, if you want to search for all records that their Unicode first_name field begins with – d. the correct SQL statement will be: SELECT * FROM TableNames WHERE first_name LIKE N’d’ If you will not use the N’ constant, The system will behave like inserting regular CHAR or VARCHAR data type, based on the underlying code page. Draft – Do not distribute Page 7 2/16/2016 SQL Server multilingual support SQL Server Code page When installing SQL Server 7.0 the user decides which code page to install on that server. This code page will apply to all CHAR, VARCHAR and TEXT data type fields. The code page definition is irrelevant to the code page of the operating system the SQL Server is installed on. For example you can install a SQL Server with a Hebrew character set (1255) on an English NT machine (1252) even with out the Hebrew language pack installed. There can be other implication concerning auxiliary application such as DTS. The character set definition is the same for all data in that SQL Server. SQL Server uses this definition to identify the data characters that are stored in that SQL Server. These will affect translation of data received; the physical storage structure and the way SQL Server Will execute string manipulation. Changing the character set after SQL Server setup requires building the master database from scratch with the BUILDM.exe utility that can be found in the binn sub directory. Changing the character set changes the physical storage of databases; so previous databases that were stored in the previous code page cannot be attached to the SQL Server that now uses a different code page. The same rule applies to backup files: backup files that were backed up from SQL Server with one code paged cannot be restored on an SQL Server with a different code page. If you want to change a code page in SQL Server but keep the data in the data stored in the server you can bcp out the data (not in native form) then change the code page and bcp in the data back. The bcp utility in SQL Server 7.0 introduce a new flag, allowing you to control the code page it uses when exporting data. Regardless of the code page defined in SQL Server setup, you can always use Unicode data. This data will be stored in two bytes for a character and can store the characters for most languages in the world in the same field. The Unicode data types in SQL Server are NCHAR, NVARCHAR and NTEXT. Draft – Do not distribute Page 8 2/16/2016 The sort order is another option specified during the Setup program. The sort order specifies the rules used by SQL Server to collate, compare, and present character data. It also specifies whether SQL Server is case-sensitive. SQL Server 7.0 uses two sort orders, one for the character code page and one for Unicode data. Both sort orders are specified during setup. The sort orders are different for each SQL Server code page. Conversions between Code Pages SQL Server translates the bit patterns in CHAR, VARCHAR, and TEXT columns to characters using the definitions in the code page installed with SQL Server. Client computers use the code page installed with the operating system to interpret character bit patterns. There are many different code pages. Some characters appear on some code pages, but not on others. Some characters are defined with one bit pattern on some code pages, and with a different bit pattern on other code pages. When you build international systems that must handle different languages, it becomes difficult to pick code pages for all the computers that meet the language requirements of multiple countries. It is also difficult to ensure that every computer performs the correct translations when interfacing with a system using a different code page. Draft – Do not distribute Page 9 2/16/2016 ODBC / OLEDB / DBLib conversion Data can be sent from a client machine to SQL Server in several ways thru an application or from some SQL Server utility for example: Query analyzer or bcp. A developer writing an application has 3 ways to access the SQL Server: 1. DB – lib 2. ODBC 3. OLE DB In each method there is an option to convert data from the client’s code page to the SQL Server Code page. This conversion is executed on the client machine. On the server, SQL Server will always try to translate incoming data to the installed code page in ANSI data types. The code page conversion method used by the SQL 7.0 Access methods will convert to the Client code page from any server code page, as long as the server's code page is installed on the client. This differs from the AutoAnsiToOem conversion used by SQL Server 6.5 utilities, which only converts characters when the client has an ANSI code page and the server has an OEM code page, or vice versa. When all the system (client, middle tier and SQL server) using the same code page, there is no need to use the conversion when the client is different from the server we may need to use the conversion. Some systems try to store data from one code page in a server that using a different code page. These applications ‘tells’ the client not to perform translation; the data from the client is transferred as is to the server and stored as the same byte code but in the server’s code page. For example: An application tries to store Hebrew data in multilingual code page (850) on the server. 1. The client application sends an “Alef” character to the server. The ASCII value of “Alef” in Hebrew code page (1255) is 224. Draft – Do not distribute Page 10 2/16/2016 2. There is no translation on the client, so the byte character 224 is received at the server and stored as 224 in code page 850 (some kind of Q). 3. When the client retrieve data, the 224-byte character is sent to the client, which represent it using the code page of the client operating system (1255) as the Hebrew “Alef”. We can call this method a “double error”; the client has an error inserting data and another error retrieving data so it looks correct, but there are several problems. First, the server does not have the right information on the data stored in it; this can affect string manipulation and bcp out. Second, in many cases conversion will be executed even when conversion is off (see KB article Q234748). Auto-translation is not the only mechanism that can result in code page Conversion. The SQL Server 7.0 ODBC driver and OLEDB provider introduce a new behavior when connecting to MSDE 1.0, SQL Server 7.0, or later versions of either. All SQL statements sent as a language event are converted to Unicode on the client before being sent to the server. The end effect of this is similar to an Auto-translation of all data flowing from the client to the server through a Language event, regardless of the current Auto-translation setting for the Connection. This will not introduce any difficulties except when trying to store Non-translated character data from a code page other than SQL Server's code page. Draft – Do not distribute Page 11 2/16/2016 Changing translation behavior Note: It is highly recommended not to the change the defaults in any of transport methods, and uses the system defaults. Db-Lib applications – dblib was used frequently in old applications. To change translation activate the “client network utility” on the client machine, choose the tab “DB Library options” and mark/unmark the box “Automatic ANSI to OEM conversion”. This option will apply to all applications using DBLIB in this machine. ODBC DSN’s – when defining an ODBC Data Source Name, the user can choose to mark/unmark the option “Perform translation to character set ” on the fourth step of the definition process. This behavior will be applied to all applications using this dsn. OLE DB connections – the programmer should include in the connection string the phrase “autotranslate =true;” or “autotranslate =false;”. This will apply to that specific application. (If you use a data link file it applies to all applications using that data link). SQL Server Enterprise Manager -From the Tools drop-menu menu, select Options. On the Connection tab toggle the "Perform translation for character data" checkbox. Query Analyzer (ISQLW.EXE) - To turn on/off character translation for the current connection, select Current Connection Options from the Query dropdown menu and toggle the "Perform translation for character data" checkbox. To change the default setting for all future connections, select Configure under the File drop-down menu. On the "New Connections" tab toggle the "Perform translation for character data" checkbox. BCP - The new /C command-line switch controls how ODBC character translation affects the data transfer. For more information on this parameter, see: Q199819 INF: SQL Server 7.0 BCP and Code Page Conversion Draft – Do not distribute Page 12 2/16/2016 Tests which we have done General configuration description Several applications were written to simulate real-world applications. Since the number of different configurations is large, number of selected scenarios where chosen, and a detailed description of each scenario is follows hereby. Data Server - If not mentioned otherwise, we used SQL Server 7.0 with Service Pack 1 on an English Windows NT4 with SP5. The Windows NT does not have a language pack support. Client - The application was written in VB 6.0 SP3 using MDAC version2.1 . The client was always executed from a Hebrew windows 98 localized machine. Middle tier servers - When writing 3-Tier Internet applications, we used IIS 3.0 from Windows NT 4.0 option pack on an English Windows NT 4.0 with SP5 with Hebrew language pack installed from the NT setup disk. For 3-tier applications we used Microsoft Transaction Server on an English Windows NT Server 4.0 sp5 with Hebrew language pack. The COM objects where written with VB 6.0 sp3. Draft – Do not distribute Page 13 2/16/2016 VB client accessing SQL Server 7.0 Using ODBC or OLEDB The first Example we will discuss is a 2- tier application where a VB client communicates directly with SQL server 7.0. The VB application can connect the server using Remote Data Objects-RDO (ODBC); Active Data Objects-ADO using SQL Server native OLEDB provider or ADO using ODBC provider. The application used “select * from MyTable” to populate a form, then added data to a recordset. The application lets the user filter the data with the next SQL command “select * from MyTable where MyField like ‘d’% “. The user can set where the character we filter by. First we checked with a SQL Server with Hebrew code page 1255. In this situation the SQL server code page is the same as the clients operating system (1255), so no translation is needed. We noticed that all data (CHAR, TEXT, Unicde, Unicode TEXT) is saved correctly and can be accessed properly with no regard to the way it was accessed – ODBC, OLEDB with translation or without translation. Important to notice is that the machine is running an English Windows NT server without any Hebrew language support. The second situation we checked is when SQL Server’s code page is different from the client’s Os code page. The server was installed with (1252 or 850) code page while the client was running 1255 code page. We found out that if Unicode data is used then everything will work correctly including searches and string manipulation. When we used CHAR, VARCHAR or TEXT data type with translation, the following occurred: The Client writes 1255 characters, OLEDB or ODBC translates them to Unicode as defined, the Unicode data is transferred to SQL Server. Then, SQL Server tries to translate the Unicode data to the code page installed in the SQL Server, it can not find the relevant Character in the SQL Server’s installed code page so it writes into the database the ANSI 63 character which is – “?”. Draft – Do not distribute Page 14 2/16/2016 Data can be transferred to SQL Server in a way that can mislead the developer to think it is correct. One can use OLEDB or ODBC without translation. In this scenario, the data will not be translated from the clients code page (1255) to the SQL Servers installed code page (850 or 1252), and the byte character will be transferred as is to the server. For example, assuming a client application is sending to the server the Hebrew character (“Alef”) – 224 in code page 1255. Since we have chosen not to translate the data, the code 224 will be transferred to the Server and stored as 224 but in code page 1252 (or 850) .When the client tries to read that data with “no translation”, the code 224 will be transferred to the client as 224, where it will be interpreted as 1255 code page 224 that is a Hebrew “Alef” character. This “double error” method might raise a problem that SQL Server does not recognize that character, the way we think it should, and if we try to sort the data or filter it, we may get unpredictable results. The other problem occurs when the data will be translated in some language events as described in the above section. Draft – Do not distribute Page 15 2/16/2016 Access accessing SQL Server 7.0 Access 2000 can connect to SQL Server 7.0 database using an ODBC Data Source Name thru a linked table. A developer developing with access 2000 can connect to SQL Server in a VBA program using OLEDB. In this case, the results are the same as using a VB application and are described in the recent paragraph. When defining a DSN, the user can decide if he wants to “perform translation for character set”. This definition is for all applications using the same DSN. The results are similar as with a VB application accessing SQL Server via ODBC. If the user is using a machine running an operating system with the same code page as the code page installed in SQL Server there are no problems concerning Hebrew data. If the user’s operating system is different from the SQL Server code page then the developer should use Unicode data and not CHAR ANSI code page data. In access 97 the results are the same, but the developer has to use a workaround to access the SQL Server. When trying to link a table in SQL Server, Access can show the Unicode data columns but the user cannot update those fields. The workaround is: Rename the original table on SQL Server 7.0, Create a new table with the desired table name. Create all the fields as were in the original table replacing all Unicode columns with ANSI data types. Then in access 97, link to the new table, that is named in the original table name. And last, on the SQL Server side, delete the new table and rename the original table with the original name. This method “tricks” access 97 that preceded Unicode to think the data is regular ANSI CHAR data. Draft – Do not distribute Page 16 2/16/2016 VB Client accessing SQL Server 7.0 Thru a COM MTS Object The next scenario is a 3-tier application that access SQL Server 7.0. The client is a VB application running on a Hebrew windows 98 machine. The intermediate layer is a COM object on a Microsoft transaction server installed on an English Windows NT 4.0 machine. We had to install the Hebrew language pack support on the intermediate machine from reasons explained later. The COM object had several methods that returned a recordset to the client’s VB application. The object allows the user to retrieve the entire table, filter the records returned from the table add new records to the table. The object was built using both OLEDB with ODBC provider and using native OLEDB SQL Server provider. In this scenario the COM object opens a connection to the database but the client doesn’t have any connection directly with the SQL Server. The COM object gets a recordset from the server and transfers it as is to the client and vise versa. IF Translation defined by the developer, then it will happen in two points, on the SQL Server side, the server will translate ANSI data received from the client (the intermediate server) to the server installed code page. The second point may be on the COM object on the intermediate MTS. Because the data sent from the client is translated to Unicode in the intermediate server, that server has to recognize the client’s OS code page. For example, if the client is using Hebrew code page (1255), the server where MTS is installed on, has to have the Hebrew Language pack installed, to ensure accurate translation. Installing the Language Pack doesn’t enable the user to type Hebrew characters on the intermediate server but allows the servers Operating System to recognize the Hebrew code page. The Language Pack is installed from the NT 4.0 Installation disk. In a directory named “langPack” you can find a file with the required language name and “inf” suffix. Right click on that file, will allow you to choose “install” from the popup menu. If the language pack is not installed and translation is chosen the intermediate server will try to translate the ANSI characters to Unicode, but without knowledge of the “right” code page it will give wrong results. Draft – Do not distribute Page 17 2/16/2016 After installing the language pack we got the expected results. If Data is Unicode data, it is always transferred correctly. Because all elements of the system use Unicode there is no translation and the character is sent as is from the VB client (that stores variables in Unicode), to the COM object in the MTS that isn’t required to translate, and from there to the server that accepts the data and stores it in Unicode with out any processing required. If the data is not Unicode, that is, CHAR data type, there are two possibilities. The first, the Client’s operating system is using the same Code page as the SQL Server installed code page. In this situation there should not be any problems. The second, when the code page is different only Unicode data can be used any other method, with translation or without the data gets to database garbled. Draft – Do not distribute Page 18 2/16/2016 Web Client accessing SQL Server 7.0 Thru IIS 4.0 In a web environment, the scenario is a bit different. The players are: The client using a browser on a machine that we do not know what code page it is running, Internet Information System server that acts as an intermediate machine. And the SQL server machine. Using MTS and IIS is a very common scenario, Where ASP pages calls COM objects on MTS, and these objects handle interaction with the database. We tested a system including SQL Server 7.0 with SP1 on an English NT 4.0 SP machine without Hebrew Language Pack support. The Internet information system was on a NT 4.0 sp5 machine with Hebrew Language pack installed. The client was using a Hebrew windows 98 machine with Internet explorer 5.0. When a client request data, an ASP script running on IIS opens a connection to the SQL Server machine, which builds a recordset and returns it to the IIS machine, which builds a HTML page on base of the data in the recordset. The page is sent to the client machine that is responsible to present it to the client. Although the IIS server can use Unicode data in a format named: UTF-8. SQL Server uses a different Unicode scheme called: UCS-2 that is different from UTF-8. Data from UTF-8 should be translated to SQL Server’s UCS-2 before transferring it to the SQL Server or store the data as binary data in SQL Server. (See KB article Q232580). The client can use whatever code page it wants. Since we can read his code page and check if it correct for the data we are expecting to receive. We can control the session code page, this will define what code page is used as the code page for data in our ASP programs. If the combination of all these factors is correct, then we can enter, update and filter Hebrew data. The safest way to use Hebrew data was to use 1255 code page on the SQL Server, and Unicode data types. In this way the data was correct using OLEDB without any regard to translation or to the provider chosen. When using a different code page on the SQL Server we couldn’t transfer ANSI data correctly to the SQL Server and back to the client. Draft – Do not distribute Page 19 2/16/2016 DTS Accessing Text File Data Transformation Services are set of utilities that are part of SQL Server 7.0, and used very often to import/Export data. We checked how does DTS behave when transferring data into a text file. A DTS package can be defined using a Wizard or via the Designer. The only difference between the two, with regards to Hebrew, is that the wizard uses a default of ‘auto translation off’ for connection object. This can affect the data when transferring non-Unicode data. When building a package manually using the designer, the developer can configure the connection object properties to “auto translate” = 0 or 1 for Off or On (to change this property right click the connection icon, choose properties, click advanced button on the right bottom of the window and change the value in the table that will appear). We saw that there it is important which machine running the DTS package. DTS packages can be activated from the Enterprise manager or by the command prompt utility “dtsrun”. There is no difference between the two. In both ways if the package is configured to execute a translation it has to run on a machine with the adequate code page. So, if we have Hebrew Character data and we want to download it to an ANSI code page file, we will have to run the package from a Hebrew enabled machine or a Windows NT machine that has Hebrew Language pack installed. An interesting issue is when using an English Windows NT 4 machine without Hebrew language support, and trying to export Unicode data in Hebrew. The data is garbled. Another problem we found was when we tried to transfer text fields to Unicode, the transfer failed with no regard to the data in the text field. There is no way to tell DTS to auto translate data when connecting to a SQL Server database using ODBC. If not defined elsewhere, it executes the transfer as with auto translate = false. So, when transferring to an ANSI file - as defined in the package definition - we could transfer the ANSI data correctly (CHAR, VARCHAR, TEXT). When transferring Unicode data using the OLEDB provider, the data was garbled. Transferring data to a Unicode file (this requires Hebrew Language pack) transferred all the Unicode data correctly. When we defined ‘auto translation = off’, the data was garbled. When the translation was off we got the Unicode data correct but the ANSI data came out with a space character between every character in the database. Draft – Do not distribute Page 20 2/16/2016 Another problem we encountered is that when defining that we want columns headers in Hebrew, they appeared in the file from right to left, while the data is from left to right. Draft – Do not distribute Page 21 2/16/2016 Windows 2000 – A possible solution All of the above language problems result in from the fact that the ANSI to OEM code page conversion in Windows NT/95/98 is fixed, and determine upon installations. However, Windows 2000 introduce a new behavior that in our tests solves all the language problems whether we used ANSI or Unicode data. Windows 2000 uses Unicode internally, so support for all languages is part of the operating system. More then that, you can change the ANSI to OEM code page conversion without the needs to re-install it. This can be done by changing the default System Locale can do this. How to change the default System Locale: Select Regional Options in the Control Panel, then in the ‘language setting for the system’ section, press the ‘Set Default’ button, and choose ‘Hebrew’. Draft – Do not distribute Page 22 2/16/2016 Conclusions The results of all the tests are very clear, and shows the following conclusions: When writing multilingual applications, it is recommended to use Unicode data types. It is important to use the N’ constant before the actual value, or else, it will be treated as ANSI datatype. When writing Hebrew application, it is recommended to use SQL Server Hebrew code page – 1255. It is recommended to have Hebrew support on the operating system of the machine where the connection to the database is done. Installing the Hebrew language pack will solve many Hebrew problems, without the need to re-install Hebrew enabled version. In most cases, Windows 2000 solves the Hebrew problem. Draft – Do not distribute Page 23 2/16/2016 Finding More Information You can use these additional resources to find out more about migrating to SQL Server: Client/Server development information SQL Server Books Online Microsoft Knowledge Base: http://support.microsoft.com. Draft – Do not distribute Page 24 2/16/2016