Encoding on the Internet Elizabeth J. Pyatt CETS © 2001, Penn State University Computers Do Numbers • All data on computers are ultimately stored as numbers • Letters are assigned numbers via an encoding sytem • Numbers in the encoding system determine the alphabetical order of the letters • Keyboards input a number which correponds to that letter © 2001, Penn State University ASCII Encoding • ASCII - American Standard Code for Information Exhange • Invented in the 1960s • Limited to 128 (27) characters (English only) • ASCII encoding on all modern computers • ASCII encodes letters, digits, punctuation and the blank space character • Distinguishes capital letters from lower case © 2001, Penn State University ASCII Chart (Excel) © 2001, Penn State University First Steps Beyond ASCII • Vendors add an additional 128 characters for 256 total characters ( or 28 “8-bit”) • Characters #0-127 = ASCII • Characters #128-255 = non-English letters and punctuation • Each accented letter (e.g. á,â or Á) is a separate character. • Multiple vendors = multiple standards © 2001, Penn State University ISO-8859-1 / Latin 1 • Internet standard for English and Western European Languages is ISO-8859-1 • ISO = International Organization of Standards • 8859 = encoding standard • 1 = 1st one registered at ISO • Latin / Roman = English alphabet • Almost identical to Windows-1252 encoding • Differs from “MacRoman” on Macintosh © 2001, Penn State University Latin-1 vs. Mac Roman (GIF) © 2001, Penn State University Encoding Non-Roman Scripts • Alternate encodings developed for other scripts like Russian, Arabic, Greek, Hebrew • Template is: – Character #0-127=ASCII – Character #128-255=Non-Roman script • Some scripts also developed multiple encodings, typically an ISO version and a Windows version (e.g.Hebrew = ISO-8859-8 or Windows-1255) © 2001, Penn State University Encoding Schemes Encoding Hebrew #0-127 Greek Cent Eur Arabic ISO-8859-8 ISO-8859-7 ISO-8859-2 ISO-8859-6 ASCII ASCII ASCII ASCII Greek Special Arabic Accented Letters #128-255 Hebrew © 2001, Penn State University 16 Bit and Beyond • Chinese, Japanese and Korean have more than 256 characters • 16-Bit encodings with 216 or 65,536 characters developed • Unicode, which attempts to combine all modern scripts into one super encoding block, is currently being developed • Increasing Unicode support on Windows and Macintosh, but still limited in application © 2001, Penn State University How browsers read a site • Web site specifies encoding to browser • Browser matches encoding with the right font on your machine • Browser displays the appropriate characters (English and non-English) © 2001, Penn State University How to mess up the browser • Web site doesn’t specify encoding, so browser stays on default (usually Latin-1) • Web site specifies font not on user’s machine • Font doesn’t match encoding • Font doesn’t have all the right characters (e.g. € (Euro currency symbol)) © 2001, Penn State University Keyboards & Fonts • Normal fonts (e.g. Times, Arial) match the character to its ASCII/Latin-1 number based on the keyboard • “Dingbat Fonts” (e.g. Symbol, Wingdings) do not match the character to the ASCII Code • Most keyboards still access only 128 characters at time (Mac can do 256) • Therefore, older non-English script fonts (e.g. Symbol) do not always match script encoding © 2001, Penn State University Where to get good fonts • Microsoft provides free, properly encoded, fonts with Windows NT and Windows 2000 • Apple provides free, properly encoded fonts via its Language Kits (free with System 9) • Third party fonts are available (but can be glitchy) © 2001, Penn State University