Text & Patterns stat 579
 Heike Hofmann

advertisement
Text & Patterns
stat 579
Heike Hofmann
Outline
• Character Variables
• Control Codes
• Patterns & Matching
Baby Names Data
• The social security agency keeps track of all
baby names used in applications for social
security numbers (effectively all births),
see http://ftp.ssa.gov/OACT/babynames/
• There is a really cool visualization by Martin
Wattenberg: see http://
www.babynamewizard.com/voyager
• In R there is a package called babynames
Your turn
• Load the babynames data from package babynames
• Warm-up: • what were the top 10 names for boys and girls in
2013?
• which names have been in use every single year?
• extend the data set to include ranking by year and
gender (use dplyr and the function mutate)
Character Variables
between character and factor
• cast
as.character(x)
as.factor(x)
> x <- bnames$name
> is.character(x)
[1] FALSE
> head(sort(table(bn$name), decreasing=T), 20)
Jessie
Leslie Guadalupe
Jean
Lee
260
248
245
245
241
Robert
Francis
Dana
Charles
Marion
236
230
224
223
223
Joseph
Billie
Carmen
Joe 216
213
212
211 !
!
> x <- as.character(bnames$name)
> summary(x)
Length
Class
Mode 260000 character character James
240
Mary
223
John
239
Willie
223
William
238
Johnnie
216
Brainstorm
Thinking about the data, what are some of the
trends that you might want to explore? What
additional variables would you need to create?
What other data sources might you want to
use?
Pair up and brainstorm for 2 minutes.
Some Ideas
• length
• first/last letter
• rank
• percent vowels/consonants
• influential people/events (brad, angelina,
barack, george, ... )
Some useful commands
• ... back to the reference card
•nchar"
•substring"
•paste"
•tolower, toupper"
•print, cat
Your turn
• find length of each baby name
• get first and last letter for each baby name
(make sure to convert all names to lower
cases before)
• think about how to determine number of
vowels in a name
Advanced
• Find graphics to answer the following
questions:
• Does the first/last letter change over time? does it depend on gender?
• Which names are used both for girls and
boys?
Patterns & Matches
•gsub (pattern,
•grep, regexpr,
strsplit
replacement, x)"
gregexpr,
Patterns & Matches
> x <- tolower(sample(unique(bn$name), size=3))
> x
[1] "evette" "tabatha" "bluford"
grep
strsplit
> strsplit(x, 'a')
[[1]]
[1] "evette"
> grep('a', x)
[1] 2
!
[[2]]
[1] "t"
regexpr
> regexpr('a', x)
[1] -1 2 -1
attr(,"match.length")
[1] -1 1 -1
gregexpr
!
"b"
"th"
[[3]]
[1] "bluford"
!
[[2]]
[1] 2 4 7
attr(,"match.length")
[1] 1 1 1
!
gsub
> gsub('a','', x)
[1] "evette" "tbth"
> gregexpr('a', x)
[[1]]
[1] -1
attr(,"match.length")
[1] -1
"bluford"
[[3]]
[1] -1
attr(,"match.length")
[1] -1
Your turn
• Find number of ‘a’, ‘e’, ‘i’, ‘o’ and ‘u’s in each
name
• Find percentage of vowels in name
• Can you spot a difference in vowels between
boys and girls?
Regular Expressions
• ‘a|e’ a or e
• [aei] a or e or i
• [^aei] neither a nor e nor i
• ^[aei] a, e, or i at the beginning
• [aei]$ a, e , or i at the end
• ‘.’ any character • (pattern) defines substring for re-use - call by
\1 \2 \3 ....
Repetition Quantifiers
• ? preceding pattern is optional (matched 0 or 1
time)
• * preceding pattern is matched zero or more times
• + preceding pattern is matched at least once
• {n} preceding pattern is matched exactly n times
• {n, } preceding pattern is matched at least n times
• {n, m} preceding pattern is matched at least n
times and up to m times
Your turn
• Find one pattern that helps you to • Find all names that start with ‘Joh’
• find all names of length 2 (without using nchar())
• Find all names that have a pattern of consonant-vowelconsonant-vowel-consonant-vowel-consonant ...
• Find all names that are palindromes (e.g. Anna, Hannah,
Ava, ...) - is it possible to find one pattern that
recognizes all palindromes?
Advanced Patterns
see ?regex
• [:alpha:] Any alphabetic character • [:lower:] Any lowercase character • [:upper:] Any uppercase character • [:digit:] Any digit • [:alnum:] Any alphanumeric
character (alphabetic or digit) • [:space:] Any white space
character (space, tab, vertical tab)
• [:graph:] Any printable character,
except space • [:print:] Any printable character,
including the space • [:punct:] Any punctuation (i.e., a
printable character that is not
white space or alphanumeric) • [:cntrl:] Any nonprintable
character
Special Characters
• "\n" newline
"\r" carriage return
• "\t" tabulator
• "\b" "\\" backslash
Control Codes, "\a"
alert
•
Escape Sequences
• see ?Quotes
More of a challenge
• The IMDb database publishes data in a slightly
odd format - file movies.list is an example of (the
first 100) movie names published.
• Read the data into R using readLines first (every
line is one element of the resulting vector)
• Use the text commands
we discussed today to
extract the following pieces of information:
Type,Year, Title
Download