Chapter 8 Characters and Strings Principle of enumeration • Computers tend to be good at working with numeric data. • The ability to represent an integer value, however, also makes it easy to work with other data types as long as it is possible to represent those types using integers. For types consisting of a finite set of values, the easiest approach is simply to number the elements of the collection. • Types that are identified by counting off the elements are called enumerated types. Characters • Computers use the principle of enumeration to represent character data inside the memory. If you assign an integer to each character, you can use that integer as a code for the character it represents • Character codes, however, are not particularly useful unless they are standardized. • The first widely adopted character coding was ASCII: American Standard Code for Information Interchange. • With only 256 characters, the ASCII system proved inadequate to represent the many alphabets in use throughout the world. • ASCII has been superseded by Unicode. • Figure 8-1, p. 256, table. Some notes • The first thing to remember about the Unicode table is that you don’t actually have to learn the numeric code for the characters. The important observation is that a character has a numeric representation, and not what that representation happens to be. • A character constant consists of the desired character enclosed in single quotation marks. Thus, the constant ‘A’ in a program indicates the Unicode representation of an upper case A. That it has the value 1018 = 6510 is irrelevant detail. Important properties • The codes for the digits 0 through 9 are consecutive. ‘0’ + 9 is ‘9’ • The codes for the uppercase letters A through Z are consecutive; the codes for the lowercase letters a through z are consecutive. ‘a’ + 2 is ‘c’ The arithmetic operations can be used with character values just as with integers. Avoid using integer constants to refer to Unicode characters . Special characters • Most of the characters in the Unicode table appear on the keyboard. They are called printing characters. • The table also includes special characters. They are indicated in the Unicode table by an escape sequence, which consists of a backslash followed by a character or sequence of digits. \b Backspace \f Form feed (starts a new page) \n Newline (moves to the next line) \r Return (moves to the beginning of the current line) \t Tab (moves to the next tab) \\ Backslash character itself \’ The character ‘ \” The character “ \ddd The character whose Unicode is the octal number ddd Conversion • It is better to make the conversion between int (Unicode) and char (character) explicit by introducing type casts. Example Randomly generate an uppercase letter. private char randomLetter() { return (char) rgen.nextInt((int) ‘A’, (int) ‘Z’); } The operations that generally make sense: • Adding an integer to a character (usually a digit). • Subtracting one character from another. ‘a’ – ‘A’ gives the distance between a lowercase letter and its corresponding uppercase letter. ‘M’ + (‘a’ – ‘A’) gives ‘m’ This can be used to convert uppercase letters into lowercase letters. • Comparing two characters (ch >= ‘a’) && (ch <= ‘z’) is true if ch is a lowercase letter Useful methods in the character class static boolean isDigit(char ch) static boolean isLetter(char ch) static boolean isLetterOrDigit(char ch) static boolean isLowerCase(char ch) static boolean isUpperCase(char ch) static boolean isWhitespace (char ch) static char toLowerCase(char ch) Static char toUpperCase(char ch) Strings • Java defines many useful methods that operate on the String class. • The String class uses the receiver syntax when you call a method on a string • String class is immutable. None of its methods ever changes the internal state. Classes that prohibit clients from changing an object’s state is said to be immutable. • What happens is that these methods return a new string on which the desired changes have been performed. • To change a string, you can overwrite a string: str = str.toLowerCase(); Strings vs. characters • Both the String and the Character classes export a toUpperCase method. • In the Character class, you call toUpperCase as a static method ch = Character.toUpperCase(ch); • In the String class, you apply toUpperCase to an existing string str = str.toUpperCase(); Selecting characters from a string • In Java, positions within a string are numbered starting from 0. str.charAt(1) gives the second character in str. • A substring can be extracted from a larger string. If a string variable str contains “hello, world” str.subString(1, 4); returns “ell” Comparing strings • Equality: Use s1.equals(s2) instead of s1 == s2 for equality, since s1 == s2 compares objects s1 and s2 (references) not values (content) of objects. • Order: Use s1.compareTo(s2). It compares two strings s1 and s2 using the numeric ordering imposed by the underlying character codes (lexicographic order), different from conventional dictionary ordering. • For characters, c1 < c2, compares the codes of c1 and c2. Other methods in the String class, Figure 8-4, p. 266. Searching within a string /** Given a string composed of separate words, this method returns its * acronym. * @param str Given string composed of separate words. * @return The acronym of the given string. */ private String acronym(String str) { String result = str.substring(0,1); /* get the first character */ int pos = str.indexOf(‘ ‘); /* position of the first space */ while (pos != -1) { /* while not the end */ result += str.substring(pos + 1, pos + 2); /* concat a leter */ pos = str.indexOf(‘ ‘, pos + 1); /* position of next space */ } return result; } Simple string idioms • Iterating through the characters in a string for (int i = 0; i < str.length(); i++) { char ch = str.charAt(i); code to process each character in turn . . . } • Growing a new string character by character String result = “”; for (whatever limits) { code to determine next ch to be added . . . result += ch; } A case study /* * File: PigLatin.java * -----------------------* This file takes a line of text and converts each word into Pig Latin while * keeping punctuation marks. * The rules for forming Pig Latin words are as follows: * - If the word begins with a vowel, add “way” to the end of the word. * - If the word begins with a consonant, extract the set of consonants up * to the first vowel, move that set of consonants to the end of the word * and add “ay”. * - If the word contains no vowel, the word is unchanged. */ • Top level English pseudo code public void run() { Tell the user what the program does. Ask the user for a line of text. Translate the line into Pig Latin and print it on the console. } • Implementation at the current level public void run() { println(“This program translates a line into Pig Latin.”); String line = readLine(“Enter a line: “); Translate the line into Pig Latin and print it on the console. } • Define a method to replace English, interface design public void run() { println(“This program translates a line into Pig Latin.”); String line = readLine(“Enter a line: “); println(translateLine(line)); } /** * Translates a line into Pig Latin * @param line An English line * @return The Pig Latin * */ Private String translateLine(String line) • Next level English pseudo code Apply a pattern, recalling the acronym pattern. private String translateLine(String line) { String result = “”; while not end { Get the next word; Translate that word into Pig Latin; Append the translated word to result; } return result; } • As a programmer, you will often trip over some detail that the framers of the problem either overlooked or considered too obvious to mention. In some cases, the omission is serious enough that you have to discuss it with the person who assigned you the programming task. In many cases, however, you will have to choose for yourself a policy that seems reasonable. – In this case, the specification is unclear about spaces and punctuation marks. A reasonable decision is: Keep spaces and punctuation marks, translate words only. Implementation guideline • Identify reusable codes. • Use library whenever possible. StringTokenizer class import java.util.*; Token is a sequence of characters that acts as a constant unit. – In this case, take a word as a token, punctuation marks as delimiters. Define DELIMITERS: check wikipedia or keyboard. Implementation guideline (cont.) • Use the character methods, FIGURE 8-3, and string methods, FIGURE 8-4. • Use for instead of while whenever possible. – Use for in findFirstVowel, since we can get word.length – Use for in isWord, since we can get token.length • Use table to exhaust cases. – findFirstVowel, which is called by translateWord, returns a value -1 or 0 or a positive integer. Thus translateWord must handle all the cases. Summary For each level • English pseudo code • Straight implementations at the current level • Design methods to replace English pseudo code • Go to next level methods Apply implementation guideline. English pseudo code can be used as comments. Testing • Bottom-up testing (start with testing methods at the lowest level and move up, test callees before the caller) • Test normal cases • Test special or extreme (boundaries of input variables) cases • Black-box testing (verify input/output specifications) • White-box testing (execute every part of the code, conditions in if, switch)