CS203 LECTURE 6 John Hurley Cal State LA 2 Hashing An object may contain an arbitrary amount of data, and searching a data structure that contains many large objects is expensive • suppose your collection of Strings stores the text of various books, you are adding a book, and you need to make sure you are preserving the Set definition – ie that no book already in the Set has the same text as the one you are adding. A hash function maps each datum to a value to a fixed and manageable size. This reduces the search space and makes searching less expensive Hash functions must be deterministic, since when we search for an item we will search for its hashed value. If an identical item is in the list, it must have received the same hash value 3 Hashing Any function that maps larger data to smaller ones must map more than one possible original datum to the same mapped value Diagram from Wikipedia When more than one item in a collection receives the same hash value, a collision is said to occur. There are various ways to deal with this. The simplest is to create a list of all items with the same hash code, and do a sequential or binary search of the list if necessary 4 Hashing Hash functions often use the data to calculate some numeric value, then perform a modulo operation. • This results in a bounded-length hash value. If the calculation ends in % n, the maximum hash value is n-1, no matter how large the original data is. • It also tends to produce hash values that are relatively uniformly distributed, minimizing collisions. • Modulo may be used again within a hash-based data structure in order to scale the hash values to the number of keys found in the structure 5 Hashing The more sparse the data, the more useful hashing is Sparse: A AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU AV AW AX AY AZ Not Sparse: Juror 1 Juror 2 Juror 3 Juror 4 6 Hashing Hashing is used for many purposes in programming. The one we are interested in right now is that it makes it easier to look up data or memory addresses in a table 7 Sets A Set differs from a List in these ways: • a Set does not have any inherent order • a Set may not contain any duplicate elements • This means that a set may not contain any two elements e1 and e2 such that e1.equals(e2) There are several types of Set in the Java Collections Framework. The most important ones are HashSet and TreeSet 8 The Collection interface is the root interface for manipulating a collection of objects. The Set Interface Hierarchy 9 10 The AbstractSet Class The AbstractSet class is a convenience class that extends AbstractCollection and implements Set. The AbstractSet class provides concrete implementations for the equals method and the hashCode method. The hash code of a set is the sum of the hash codes of all the elements in the set. Since the size method and iterator method are not implemented in the AbstractSet class, AbstractSet is an abstract class. 11 The HashSet Class The HashSet class is a concrete class that implements Set. Objects added to a hash set need to implement the hashCode() method 12 Duplicates Attempts to add duplicate records to a set are ignored: package demos; import java.util.HashSet; import java.util.Set; public class Demo { public static void main(String[] args) { Set<String> nameSet = new HashSet<String>(); nameSet.add("Brutus"); nameSet.add("Cicero"); nameSet.add("Spartacus"); printAll(nameSet); nameSet.add("Spartacus"); printAll(nameSet); nameSet.add("Spartacus"); printAll(nameSet); } public static <T> void printAll(Set<T> set){ System.out.println(" set contains these records: "); for(T t: set){ System.out.println(t); } } } 13 Hash Set Demo package demos; import java.util.HashSet; import java.util.Iterator; public class HashSetDemo { // heavily adapted from http://www-inst.eecs.berkeley.edu/~cs61c/sp13/labs/06/ // public static void main(String args[]) { String input = "The right of the people to be secure in their persons, houses, papers, and " +" effects, against unreasonable searches and seizures, shall not be violated, and no " + "Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and " + "particularly describing the place to be searched, and the persons or things to be seized."; Set<String> stringSet = new HashSet<String>(); String[] words = input.split("\\W+"); // \\W+ means "one or more characters that are not alphanumeric or underscores" System.out.println("Number of words in the input is " + words.length); for (String s: words) { stringSet.add(s.toLowerCase()); } System.out.println("The number of unique words in the input is " + stringSet.size()); Iterator<String> myIterator = stringSet.iterator(); while (myIterator.hasNext()) { System.out.println(myIterator.next()); } } 14 LinkedHashSet Note that the words in the output from the last example are not in the same order as they appeared in the original input. LinkedHashSet preserves the input order by using a linked list to implement a HashSet Change the HashSet in the example to LinkedHashSet and compare the output. 15 SortedSet Interface and TreeSet Class • SortedSet is a subinterface of Set, which guarantees that the elements in the set are sorted. • TreeSet is a concrete class that implements the SortedSet interface. • You can use an iterator to traverse the elements in the sorted order. • The elements can be sorted in two ways. • One way is to use the Comparable interface. • The other is to specify a comparator for the elements in the set This approach is referred to as order by comparator • To see a TreeSet in action, return to the last demo and replace the HashSet with a TreeSet. • We will discuss the implementation of trees next week. 16 SortedSet Interface and TreeSet Class • To order by Comparator, write a class that implements Comparator and pass it to the TreeSet constructor • Be careful: ordering by comparator results in only one entry per group of inputs that are equal according to the compare() method in the Comparator 17 SortedSet Interface and TreeSet Class package demos; import java.util.Comparator; public class StringSortByLength implements Comparator <String>{ @Override public int compare(String s1, String s2) { return s1.length() - s2.length(); } } 18 SortedSet Interface and TreeSet Class package demos; import java.util.Iterator; import java.util.Set; import java.util.TreeSet; public class TreeSetDemo { // heavily adapted from http://www-inst.eecs.berkeley.edu/~cs61c/sp13/labs/06/ // public static void main(String args[]) { String input = "The right of the people to be secure in their persons, houses, papers, " +"and effects, against unreasonable searches and seizures, shall not be " +"violated, and no Warrants shall issue, but upon probable cause, supported " +"by Oath or affirmation, and particularly describing the place to be " +"searched, and the persons or things to be seized."; Set<String> stringSet = new TreeSet<String>(new StringSortByLength()); String[] words = input.split("\\W+"); // \\W means "any character that is not alphanumeric or an underscore System.out.println("Number of words in the input is " + words.length); for (String s: words) { stringSet.add(s.toLowerCase()); } System.out.println("The number of distinct word lengths in the input is " + stringSet.size()); Iterator<String> myIterator = stringSet.iterator(); while (myIterator.hasNext()) { String nextString = myIterator.next(); System.out.println(nextString.length() + ": First example added: " + nextString); } } } 19 Comparator package booksdemo; public class Book implements Comparable<Book>{ String author; String title; String isbn; public Book(String author, String title, String isbn) { super(); this.author = author; this.title = title; this.isbn = isbn; } public String getAuthor() { return author; } public String getTitle() { return title; } public String getIsbn() { return isbn; } @Override public int compareTo(Book otherBook) { int authDiff = author.compareTo(otherBook.getAuthor()); if(authDiff != 0) return authDiff; else return title.compareTo(otherBook.getTitle()); } public String toString(){ return "Author: " + author + " Title: " + title + " ISBN: " + isbn; } } Comparator package booksdemo; import java.util.Collection; import java.util.Comparator; import java.util.Iterator; import java.util.LinkedList; import java.util.List; import java.util.TreeSet; public class BookTreeSetDemo { TreeSet<Book> theSet; public BookTreeSetDemo(){ theSet = new TreeSet<Book>(); } public BookTreeSetDemo(Comparator<Book> comp){ theSet = new TreeSet<Book>(comp); } public void addAll(Collection <Book> c){ theSet.addAll(c); } public void printAll(){ System.out.println("The number of items in the set is " + theSet.size()); Iterator<Book> myIterator = theSet.iterator(); while (myIterator.hasNext()) { System.out.println(myIterator.next().toString()); } } public static void main(String[] args){ Book b1 = new Book("Smith", "Basketweaving 101", "1234-5678-9012"); Book b2 = new Book("Smith", "Basketweaving 101", "2345-6789-0123"); Book b3 = new Book("Smith", "Basketweaving 101", "2345-6789-0123"); Book b4 = new Book("Jones", "Basketweaving 102", "3456-7890-1234"); List<Book> l = new LinkedList<Book>(); l.add(b1); l.add(b2); l.add(b3); l.add(b4); BookTreeSetDemo c1 = new BookTreeSetDemo(); c1.addAll(l); c1.printAll(); Comparator<Book> comp = new BookISBNComparator(); BookTreeSetDemo c2 = new BookTreeSetDemo(comp); c2.addAll(l); c2.printAll(); } } 20 21 Comparator package booksdemo; import java.util.Comparator; public class BookISBNComparator implements Comparator<Book>{ @Override public int compare(Book b1, Book b2) { return b1.getIsbn().compareTo(b2.getIsbn()); } } 22 Maps A List or array can be thought of as a set of key-value pairs in which the keys are integers (the indexes) and the values are the data being stored. Suppose we want to be able to look up values using a key other than in integer index. For example, we need to look up friends' addresses based on their names. We could write a class with instance variables for name and address and then construct a List or Set. When we need to look up an address, we iterate through the list looking for a match for the name of the person whose address we want to look up. Maps provide a simpler alternative by mapping keys of any type to values of any other type. 23 The Map Interface The Map interface maps keys to the elements. The keys are like indexes. In List, the indexes are integer. In Map, the keys can be any objects. 24 Map Interface and Class Hierarchy An instance of Map represents a group of objects, each of which is associated with a key. You can get the object from a map using a key, and you have to use a key to put the object into the map. 25 The Map Interface UML Diagram Concrete Map Classes 26 Entry 27 28 HashMap and TreeMap The HashMap and TreeMap classes are two concrete implementations of the Map interface. • HashMap is efficient for locating a value, inserting a mapping, and deleting a mapping. • TreeMap, which implements SortedMap, is efficient for traversing the keys in a sorted order. 29 HashMap Map<String, String> myDict = new HashMap<String, String>(); myDict.put("evacuate", "remove to a safe place"); myDict.put("descend", "move or fall downwards"); myDict.put("hypochondriac", "a person who is abnormally anxious about their health"); myDict.put("injunction", "an authoritative warning or order"); myDict.put("creek", "a stream, brook, or a minor tributary of a river"); myDict.put("googol", "10e100"); String defString = "The definition of "; System.out.println(defString + "descend : " + myDict.get("descend")); System.out.println(defString + "injunction : " + myDict.get("injunction")); System.out.println(defString + "googol : " + myDict.get("googol")); // http://www-inst.eecs.berkeley.edu/~cs61c/sp13/labs/06/ 30 LinkedHashMap • The entries in a HashMap are not ordered. • LinkedHashMap extends HashMap with a linked list implementation that supports an ordering of the entries in the map. Entries in a LinkedHashMap can be retrieved in the order in which they were inserted into the map (known as the insertion order), or the order in which they were last accessed, from least recently accessed to most recently (access order). The no-arg constructor constructs a LinkedHashMap with the insertion order. • LinkedHashMap(initialCapacity, loadFactor, true). 31 Example: LinkedHashMap // adapted from http://www.tutorialspoint.com/java/java_linkedhashmap_class.htm public static void main(String args[]) { // Create a hash map LinkedHashMap<String, Double> lhm = new LinkedHashMap<String, Double>(); // Put elements to the map lhm.put("Zara", new Double(3434.34)); lhm.put("Mahnaz", new Double(123.22)); lhm.put("Ayan", new Double(1378.00)); lhm.put("Daisy", new Double(99.22)); lhm.put("Qadir", new Double(-19.08)); // Get a set of the entries Set<Entry<String, Double>> set = lhm.entrySet(); // Get an iterator Iterator<Entry<String, Double>> i = set.iterator(); // Display elements while (i.hasNext()) { Entry<String, Double> me = i.next(); System.out.print(me.getKey() + ": "); System.out.println(me.getValue()); } System.out.println(); // Deposit 1000 into Zara's account double balance = lhm.get("Zara").doubleValue(); lhm.put("Zara", new Double(balance + 1000)); System.out.println("Zara's new balance: " + lhm.get("Zara")); } 32 TreeMap package demos; import java.util.TreeMap; public class Demo { //adapted from http://www.roseindia.net/java/jdk6/TreeMapExample.shtml public static void main(String[] args) { TreeMap<Integer, String> tMap = new TreeMap<Integer, String>(); // inserting data in alphabetical order by entry value tMap.put(6, "Friday"); tMap.put(2, "Monday"); tMap.put(7, "Saturday"); tMap.put(1, "Sunday"); tMap.put(5, "Thursday"); tMap.put(3, "Tuesday"); tMap.put(4, "Wednesday"); // data ends up sorted by key value System.out.println("Keys of tree map: " + tMap.keySet()); System.out.println("Values of tree map: " + tMap.values()); System.out.println("Key: 5 value: " + tMap.get(5) + "\n"); System.out.println("First key: " + tMap.firstKey() + " Value: " + tMap.get(tMap.firstKey()) + "\n"); System.out.println("Last key: " + tMap.lastKey() + " Value: " + tMap.get(tMap.lastKey()) + "\n"); 33 TreeMap System.out.println("All values with an enhanced for loop: "); for(String s: tMap.values()) System.out.println(s); System.out.println("First three values: "); for(String s: tMap.headMap(4).values()) System.out.println(s); System.out.println("Values starting with fourth value: "); for(String s: tMap.tailMap(4).values()) System.out.println(s); System.out.println("Removing first datum: " + tMap.remove(tMap.firstKey())); System.out.println("Now the tree map Keys: " + tMap.keySet()); System.out.println("Now the tree map contain: " + tMap.values() + "\n"); System.out.println("Removing last entry: " + tMap.remove(tMap.lastKey())); System.out.println("Now the tree map Keys: " + tMap.keySet()); System.out.println("Now the tree map contain: " + tMap.values()); } } 34 Case Study: Counting the Occurrences of Words in a Text This program counts the occurrences of words in a text and displays the words and their occurrences in ascending alphabetical order. The program uses a hash map to store a pair consisting of a word and its count. Algorithm for handling input: For each word in the input file Remove non-letters (such as punctuation marks) from the word. If the word is already present in the frequencies map Increment the frequency. Else Set the frequency to 1 To sort the map, convert it to a tree map. 35 package frequency; import java.io.File; import java.io.FileNotFoundException; import java.util.Map; import java.util.Scanner; import java.util.TreeMap; public class WordFrequency { public static void main(String[] args) throws FileNotFoundException { Map<String, Integer> frequencies = new TreeMap<String, Integer>(); Scanner in = new Scanner(new File("romeojuliet.txt")); while (in.hasNext()) { String word = clean(in.next()); // Get the old frequency count if (word != "") { Integer count = frequencies.get(word); // If there was none, put 1; otherwise, increment the count if (count == null) { count = 1; } else { count = count + 1; } frequencies.put(word, count); } } // Print all words and counts for (String key : frequencies.keySet()) { System.out.println(key + ": " + frequencies.get(key)); } } 36 /** * Removes characters from a string that are not letters. * * @param s * a string * @return a string with all the letters from s */ public static String clean(String s) { String r = ""; for (int i = 0; i < s.length(); i++) { char c = s.charAt(i); if (Character.isLetter(c)) { r = r + c; } } return r.toLowerCase(); } } 37 Pattern Matching Programming often requires testing strings for particular patterns • Search text for instances of a particular word, like using <ctrl> f in a word processor • Search for instances of a word that might have variant spelling, like color and colour • Search for instances of a range of different substrings, like any integer • Test a string to see whether it matches a complex pattern, like a url or a Social Security Number (xxx-xx-xxxx, where each x stands for a single digit) 38 Regular Expressions • In many cases we can specify exactly the characters we • • • • want to match, like 'c' In others, a pattern includes a single character which can match any other character Sometimes a substring can match any string, as when searching a Windows directory for all files with a particular filename extension In yet other cases, a substring must match any of a range of possible substrings, eg validating input for an email address Regular Expressions provide a way to specify characteristics of strings that is flexible enough to use in all these cases. 39 Regular Expressions • You will use RegExs in many contexts, but a very common one is the matches() method of the String class. • Consider two variants of the same Irish name, Hurley and O'Hurley. "John Hurley".equals("John O'Hurley") is false. However, we can use matches() with a RegEx if we are searching for either one. String h1 = "John Hurley"; String h2 = "John O'Hurley"; System.out.println(h1.equals(h1)); System.out.println(h1.equals(h2)); System.out.println(h1.matches(h2)); 40 RegExs • x a specified character x Java matches Java • . any single character Java matches J..a • (ab|cd) ab or cd ten matches t(en|im) • [abc] a, b, or c Java matches Ja[uvwx]a • [^abc] any character except Java matches Ja[^ars]a • a, b, or c • [a-z] a through z Java matches [A-M]av[a-d] • [^a-z] any character except a through z • Java matches Jav[^b-d] • [a-e[m-p]] a through e or m through p • Java matches [A-G[I-M]]av[a-d] • [a-e&&[c-p]] intersection of a-e with c-p • Java matches [A-P&&[I-M]]av[a-d] 41 RegExs • \d a digit, same as [0-9] Java2 matches "Java[\\d]" • \D a non-digit $Java matches "[\\D][\\D]ava" • \w a word character Java1 matches "[\\w]ava[\\w]" • \W a non-word character $Java matches "[\\W][\\w]ava" • \s a whitespace character "Java 2" matches "Java\\s2" • \S a non-whitespace char Java matches "[\\S]ava" • • p* zero or more Java and av match "[A-z]*" • • • • • • • • • • occurrences of pattern p bbb matches "a*" p+ one or more occurrences b, aa, and ZZZ match "[A-z]+" p? zero or one Java and ava match "J?ava" p{n} exactly n occurrences of pattern p Java matches "Ja{1}va" Java does not match "Ja{2}va" p{n,} at least n occurrences of pattern p Java and Jaaava match "Ja{1,}va" Java does not match "Ja{2,}va" p{n,m} between n and m a matches "a{1,9}" aaaaaaaaaa does not match "a{1,9}" Java does not match "Ja{2,9}va" 42 RegExs • Backslash is a special character that starts an escape sequence in a string. So • • • • • you need to use "\\d" in Java to represent \d. A whitespace (or a whitespace character) is any character which does not display itself but does take up space. The characters ' ', '\t', '\n', '\r', '\f' are whitespace characters. So \s is the same as [ \t\n\r\f], and \S is the same as [^ \t\n\r\f\v]. Backslash is a special character that starts an escape sequence in a string. So you need to use "\\d" in Java to represent \d. *, +, ?, {n}, {n,}, and {n, m} in Table 1 are quantifiers that specify how many times the pattern before a quantifier may repeat. For example, A* matches zero or more A’s, A+ matches one or more A’s, A? matches zero or one A’s, A{3} matches exactly AAA, A{3,} matches at least three A’s, and A{3,6} matches between 3 and 6 A’s. * is the same as {0,}, + is the same as {1,}, and ? is the same as {0,1}. Do not use spaces in the repeat quantifiers. For example, A{3,6} cannot be written as A{3, 6} with a space after the comma. You may use parentheses to group patterns. For example, (ab){3} matches ababab, but ab{3} matches abbb 43 RegExs • String ssNum = "123-45-6789"; • String notSsNum = "123456789"; • String ssPat = "[\\d]{3}-[\\d]{2}-[\\d]{4}"; // Social Security Number • System.out.println(ssNum.matches(ssPat)); • System.out.println(ssNum.matches(notSsNum)); • Always provide users with guidance when they must match a regex! 44 RegExs public static void main(String[] args) { String ssNum = null; String ssPatt = "[\\d]{3}-[\\d]{2}-[\\d]{4}"; String input; String prompt = "Please enter your Social Security Number in the format XXX-XXXXXX"; do{ input = JOptionPane.showInputDialog(null, prompt); if(input.matches(ssPatt)) ssNum = input; else prompt = "Invalid input. " + prompt; } while(ssNum == null); JOptionPane.showMessageDialog(null, "Thanks, taxpayer # " + ssNum ); } 45 RegExs String tmi = "My Social Security number is 123-45-6789 and " + "my telephone number is (123)456-7890"; String sanitized = tmi.replaceAll("[\\d]{3}-[\\d]{2}-[\\d]{4}", "XXX-XX-XXXX").replaceAll("\\([\\d]{3}\\)[\\d]{3}-[\\d]{4}", "(XXX)XXX-XXXX"); System.out.println(sanitized);