Language and Geography Brendan O’Connor Social Media Analysis, 3/18/2010 http://anyall.org/blog/2009/05/where-tweets-get-sent-from/ Analyze Geography and Language Using Twitter data: (1) Identify author & message locations (2) Side note: opinions about self’s location Applications: (3) Analyze language use by geography – Example: find regional dialects (4) Predict geographically embedded real-world phenomena – Example: per-state retail sales Application: Retail Forecasting Identify author locations U.S. State Identification • String-matching approach • Match on 1. Full names (“Pennsylvania”) • Case-insensitive 2. Abbreviations (“PA”) • Case-sensitive Examples AZ Scottsdale, AZ MO St. Louis, MO MI Michigan CA Sacramento, CA FL Jacksonville, FL CA Santa Cruz, CA IN Indianapolis, Indiana CA 2OH!9, California TX Dallas, TX NY new york IL Chicago, IL CT Hartford, CT GA Georgia HI Hawaii WA Seattle, WA, USA CT CA DC IA NC CA CA GA CA TN CA MS ME CA NC NY Watertown, CT Bay Area, California DC Metro Area Iowa Raleigh, NC California southern california Atlanta, GA Porn Valley, CA Newbern, TN Westlake Village, CA, USA Dourados, MS U GOTTA CATCH ME! Malibu, California North Carolina Windsor, NY Examples AZ Scottsdale, AZ MO St. Louis, MO MI Michigan CA Sacramento, CA FL Jacksonville, FL CA Santa Cruz, CA IN Indianapolis, Indiana CA 2OH!9, California TX Dallas, TX NY new york IL Chicago, IL CT Hartford, CT GA Georgia HI Hawaii WA Seattle, WA, USA CT CA DC IA NC CA CA GA CA TN CA MS ME CA NC NY Watertown, CT Bay Area, California DC Metro Area Iowa Raleigh, NC California southern california Atlanta, GA Porn Valley, CA Newbern, TN Westlake Village, CA, USA Dourados, MS U GOTTA CATCH ME! Malibu, California North Carolina Windsor, NY Problems? AL AK AS AZ AR CA CO CT DE DC FM FL GA GU HI ID IL IN IA KS KY LA ME MH MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND MP OH OK OR PW PA PR RI SC SD TN TX UT VT VI VA WA WV WI WY Brazil • Brazilian states have two-letter abbreviation conventions like U.S., and many overlaps – Belém,PA – São Luís, MA – Maceió AL • “SC” – – – – – – Myrtle Beach, SC Charleston, SC, U.S.A. Joinville - SC Mafra - SC Palmtios – SC FLORIANÓPOLIS, SC, BRASIL U.S. State Identification • String-matching approach • Match on 1. Full names (“Pennsylvania”) • Case-insensitive 2. Abbreviations (“PA”) • • • Case-sensitive Brazil check Common words check – DE, ME Experiment • 4,793,729 messages – stream sample • 2,309,284 unique users • 1,624,983 unique users with non-blank location • Detections – 838,012 U.S. State – 346,553 Latitude, Longitude – 3,163 Five-digit ??Zip Code OH cleveland ohio sadly :( NY Syracuse, NY :) IL Close to ur heart =],Illinois TX S.A TX :D MN Minnesota :) CA California, Newport Beach :) SC JERSEY but in Cola SC 4 now:-) NC Charlotte,NC =( CA Playboy Mansion California. :) NY Bronx,NY :) States, happy:sad, %happy ND NV MO ID WY RI UT KY MT NE NH SD MA WI WV NM AR 2:3 2:3 6:7 2:2 2:2 5:3 5:3 12:6 2:1 6:3 2:1 4:2 13:6 11:5 7:3 5:2 6:2 0.400 0.400 0.462 0.500 0.500 0.625 0.625 0.667 0.667 0.667 0.667 0.667 0.684 0.688 0.700 0.714 0.750 CT DE SC MS OR IN AK GU KS MN IA OH HI MD MI AL VA 12:4 0.750 6:2 0.750 10:3 0.769 11:3 0.786 11:3 0.786 19:5 0.792 4:1 0.800 4:1 0.800 12:3 0.800 17:4 0.810 9:2 0.818 41:9 0.820 15:3 0.833 22:4 0.846 29:5 0.853 18:3 0.857 25:4 0.862 PA 19:3 0.864 NC 15:2 0.882 CO 8:1 0.889 TN 16:2 0.889 WA 18:2 0.900 NJ 40:4 0.909 PR 10:1 0.909 OK 11:1 0.917 FL 90:8 0.918 GA 24:2 0.923 ME 12:1 0.923 LA 55:4 0.932 AZ 31:2 0.939 DC 16:1 0.941 NY 146:9 0.942 IL 19:1 0.950 TX 151:5 0.968 CA 211:6 0.972 Emoticon parsing 5658 :) 845 =) 554 =] 338 =D 197 :-) 122 :p 51 :o 33 =p 25 ;o 20 :[ 14 =O 9 ;( 7 :d 5 =o 3 ;-D 2 =d 1 :-d 1 ;-] 2032 :D 1391 ;) 701 :] 583 :/ 461 ;D 437 :P 278 :( 245 ;] 138 ;-) 128 =P 93 :O 67 ;P 44 =/ 42 ;p 31 =( 26 :\ 22 =[ 20 :-D 15 :-P 15 ;O 11 :-p 9 :-/ 8 :-( 8 ;/ 7 ;d 7 :-] 3 ;-P 3 :-O 3 ;[ 2 ;-p 2 ;-( 2 =\ 1 :-[ 1 ;\ 1 =-] 1 =-) NormalEyes = r'[:=]' Wink = r'[;]' NoseArea = r'(|o|O|-)’ HappyMouths = r'[D\)\]]' SadMouths = r'[\(\[]' Tongue = r'[pP]' OtherMouths = r'[doO/\\]’ Happy = NormalEyes + NoseArea + HappyMouths Sad = NormalEyes + NoseArea + SadMouths)