Language and Geography

advertisement
Language and Geography
Brendan O’Connor
Social Media Analysis, 3/18/2010
http://anyall.org/blog/2009/05/where-tweets-get-sent-from/
Analyze Geography and Language
Using Twitter data:
(1) Identify author & message locations
(2) Side note: opinions about self’s location
Applications:
(3) Analyze language use by geography
– Example: find regional dialects
(4) Predict geographically embedded real-world
phenomena
– Example: per-state retail sales
Application: Retail Forecasting
Identify author locations
U.S. State Identification
• String-matching approach
• Match on
1. Full names (“Pennsylvania”)
•
Case-insensitive
2. Abbreviations (“PA”)
•
Case-sensitive
Examples
AZ
Scottsdale, AZ
MO
St. Louis, MO
MI
Michigan
CA
Sacramento, CA
FL
Jacksonville, FL
CA
Santa Cruz, CA
IN
Indianapolis, Indiana
CA
2OH!9, California
TX
Dallas, TX
NY
new york
IL
Chicago, IL
CT
Hartford, CT
GA
Georgia
HI
Hawaii
WA
Seattle, WA, USA
CT
CA
DC
IA
NC
CA
CA
GA
CA
TN
CA
MS
ME
CA
NC
NY
Watertown, CT
Bay Area, California
DC Metro Area
Iowa
Raleigh, NC
California
southern california
Atlanta, GA
Porn Valley, CA
Newbern, TN
Westlake Village, CA, USA
Dourados, MS
U GOTTA CATCH ME!
Malibu, California
North Carolina
Windsor, NY
Examples
AZ
Scottsdale, AZ
MO
St. Louis, MO
MI
Michigan
CA
Sacramento, CA
FL
Jacksonville, FL
CA
Santa Cruz, CA
IN
Indianapolis, Indiana
CA
2OH!9, California
TX
Dallas, TX
NY
new york
IL
Chicago, IL
CT
Hartford, CT
GA
Georgia
HI
Hawaii
WA
Seattle, WA, USA
CT
CA
DC
IA
NC
CA
CA
GA
CA
TN
CA
MS
ME
CA
NC
NY
Watertown, CT
Bay Area, California
DC Metro Area
Iowa
Raleigh, NC
California
southern california
Atlanta, GA
Porn Valley, CA
Newbern, TN
Westlake Village, CA, USA
Dourados, MS
U GOTTA CATCH ME!
Malibu, California
North Carolina
Windsor, NY
Problems?
AL AK AS AZ AR CA CO CT DE DC
FM FL GA GU HI ID IL IN IA KS KY
LA ME MH MD MA MI MN MS
MO MT NE NV NH NJ NM NY NC
ND MP OH OK OR PW PA PR RI
SC SD TN TX UT VT VI VA WA WV
WI WY
Brazil
• Brazilian states have two-letter abbreviation
conventions like U.S., and many overlaps
– Belém,PA
– São Luís, MA
– Maceió AL
• “SC”
–
–
–
–
–
–
Myrtle Beach, SC
Charleston, SC, U.S.A.
Joinville - SC
Mafra - SC
Palmtios – SC
FLORIANÓPOLIS, SC, BRASIL
U.S. State Identification
• String-matching approach
• Match on
1. Full names (“Pennsylvania”)
•
Case-insensitive
2. Abbreviations (“PA”)
•
•
•
Case-sensitive
Brazil check
Common words check
–
DE, ME
Experiment
• 4,793,729 messages – stream sample
• 2,309,284 unique users
• 1,624,983 unique users with non-blank
location
• Detections
– 838,012 U.S. State
– 346,553 Latitude, Longitude
– 3,163 Five-digit ??Zip Code
OH
cleveland ohio sadly :(
NY
Syracuse, NY :)
IL
Close to ur heart =],Illinois
TX
S.A TX :D
MN
Minnesota :)
CA
California, Newport Beach :)
SC
JERSEY but in Cola SC 4 now:-)
NC
Charlotte,NC =(
CA
Playboy Mansion California. :)
NY
Bronx,NY :)
States, happy:sad, %happy
ND
NV
MO
ID
WY
RI
UT
KY
MT
NE
NH
SD
MA
WI
WV
NM
AR
2:3
2:3
6:7
2:2
2:2
5:3
5:3
12:6
2:1
6:3
2:1
4:2
13:6
11:5
7:3
5:2
6:2
0.400
0.400
0.462
0.500
0.500
0.625
0.625
0.667
0.667
0.667
0.667
0.667
0.684
0.688
0.700
0.714
0.750
CT
DE
SC
MS
OR
IN
AK
GU
KS
MN
IA
OH
HI
MD
MI
AL
VA
12:4 0.750
6:2 0.750
10:3 0.769
11:3 0.786
11:3 0.786
19:5 0.792
4:1 0.800
4:1 0.800
12:3 0.800
17:4 0.810
9:2 0.818
41:9 0.820
15:3 0.833
22:4 0.846
29:5 0.853
18:3 0.857
25:4 0.862
PA
19:3 0.864
NC
15:2 0.882
CO
8:1 0.889
TN
16:2 0.889
WA
18:2 0.900
NJ
40:4 0.909
PR
10:1 0.909
OK
11:1 0.917
FL
90:8 0.918
GA
24:2 0.923
ME
12:1 0.923
LA
55:4 0.932
AZ
31:2 0.939
DC
16:1 0.941
NY
146:9 0.942
IL
19:1 0.950
TX
151:5 0.968
CA
211:6 0.972
Emoticon parsing
5658 :)
845 =)
554 =]
338 =D
197 :-)
122 :p
51 :o
33 =p
25 ;o
20 :[
14 =O
9 ;(
7 :d
5 =o
3 ;-D
2 =d
1 :-d
1 ;-]
2032 :D 1391 ;)
701 :] 583 :/
461 ;D 437 :P
278 :( 245 ;]
138 ;-) 128 =P
93 :O 67 ;P
44 =/ 42 ;p
31 =( 26 :\
22 =[ 20 :-D
15 :-P 15 ;O
11 :-p 9 :-/
8 :-( 8 ;/
7 ;d
7 :-]
3 ;-P 3 :-O
3 ;[
2 ;-p
2 ;-( 2 =\
1 :-[ 1 ;\
1 =-] 1 =-)
NormalEyes = r'[:=]'
Wink = r'[;]'
NoseArea = r'(|o|O|-)’
HappyMouths = r'[D\)\]]'
SadMouths = r'[\(\[]'
Tongue = r'[pP]'
OtherMouths = r'[doO/\\]’
Happy = NormalEyes +
NoseArea + HappyMouths
Sad = NormalEyes +
NoseArea + SadMouths)
Download