Text & Patterns stat 579 Heike Hofmann Outline • Character Variables • Control Codes • Patterns & Matching Baby Names Data • The social security agency keeps track of all baby names used in applications for social security numbers (effectively all births), see http://ftp.ssa.gov/OACT/babynames/ • There is a really cool visualization by Martin Wattenberg: see http:// www.babynamewizard.com/voyager • In R there is a package called babynames Your turn • Load the babynames data from package babynames • Warm-up: • what were the top 10 names for boys and girls in 2013? • which names have been in use every single year? • extend the data set to include ranking by year and gender (use dplyr and the function mutate) Character Variables between character and factor • cast as.character(x) as.factor(x) > x <- bnames$name > is.character(x) [1] FALSE > head(sort(table(bn$name), decreasing=T), 20) Jessie Leslie Guadalupe Jean Lee 260 248 245 245 241 Robert Francis Dana Charles Marion 236 230 224 223 223 Joseph Billie Carmen Joe 216 213 212 211 ! ! > x <- as.character(bnames$name) > summary(x) Length Class Mode 260000 character character James 240 Mary 223 John 239 Willie 223 William 238 Johnnie 216 Brainstorm Thinking about the data, what are some of the trends that you might want to explore? What additional variables would you need to create? What other data sources might you want to use? Pair up and brainstorm for 2 minutes. Some Ideas • length • first/last letter • rank • percent vowels/consonants • influential people/events (brad, angelina, barack, george, ... ) Some useful commands • ... back to the reference card •nchar" •substring" •paste" •tolower, toupper" •print, cat Your turn • find length of each baby name • get first and last letter for each baby name (make sure to convert all names to lower cases before) • think about how to determine number of vowels in a name Advanced • Find graphics to answer the following questions: • Does the first/last letter change over time? does it depend on gender? • Which names are used both for girls and boys? Patterns & Matches •gsub (pattern, •grep, regexpr, strsplit replacement, x)" gregexpr, Patterns & Matches > x <- tolower(sample(unique(bn$name), size=3)) > x [1] "evette" "tabatha" "bluford" grep strsplit > strsplit(x, 'a') [[1]] [1] "evette" > grep('a', x) [1] 2 ! [[2]] [1] "t" regexpr > regexpr('a', x) [1] -1 2 -1 attr(,"match.length") [1] -1 1 -1 gregexpr ! "b" "th" [[3]] [1] "bluford" ! [[2]] [1] 2 4 7 attr(,"match.length") [1] 1 1 1 ! gsub > gsub('a','', x) [1] "evette" "tbth" > gregexpr('a', x) [[1]] [1] -1 attr(,"match.length") [1] -1 "bluford" [[3]] [1] -1 attr(,"match.length") [1] -1 Your turn • Find number of ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’s in each name • Find percentage of vowels in name • Can you spot a difference in vowels between boys and girls? Regular Expressions • ‘a|e’ a or e • [aei] a or e or i • [^aei] neither a nor e nor i • ^[aei] a, e, or i at the beginning • [aei]$ a, e , or i at the end • ‘.’ any character • (pattern) defines substring for re-use - call by \1 \2 \3 .... Repetition Quantifiers • ? preceding pattern is optional (matched 0 or 1 time) • * preceding pattern is matched zero or more times • + preceding pattern is matched at least once • {n} preceding pattern is matched exactly n times • {n, } preceding pattern is matched at least n times • {n, m} preceding pattern is matched at least n times and up to m times Your turn • Find one pattern that helps you to • Find all names that start with ‘Joh’ • find all names of length 2 (without using nchar()) • Find all names that have a pattern of consonant-vowelconsonant-vowel-consonant-vowel-consonant ... • Find all names that are palindromes (e.g. Anna, Hannah, Ava, ...) - is it possible to find one pattern that recognizes all palindromes? Advanced Patterns see ?regex • [:alpha:] Any alphabetic character • [:lower:] Any lowercase character • [:upper:] Any uppercase character • [:digit:] Any digit • [:alnum:] Any alphanumeric character (alphabetic or digit) • [:space:] Any white space character (space, tab, vertical tab) • [:graph:] Any printable character, except space • [:print:] Any printable character, including the space • [:punct:] Any punctuation (i.e., a printable character that is not white space or alphanumeric) • [:cntrl:] Any nonprintable character Special Characters • "\n" newline "\r" carriage return • "\t" tabulator • "\b" "\\" backslash Control Codes, "\a" alert • Escape Sequences • see ?Quotes More of a challenge • The IMDb database publishes data in a slightly odd format - file movies.list is an example of (the first 100) movie names published. • Read the data into R using readLines first (every line is one element of the resulting vector) • Use the text commands we discussed today to extract the following pieces of information: Type,Year, Title