Lynne Grewe and Sushmita Pandey California State University East Bay lynne.grewe@csueastbay.edu The Goal Using Social Data to make Social Advertisement Recommendations. Your friends Nathan and Marty will like this User and Friends Social Network Application Social Network PPARS Advertisements The Problems What is the Social data? Which Social Data is useable/best? How do we capture and analyze it? How do relate Social data to Advertisements? How do we deliver a Social Advertisement? The Environment Social Network: MySpace, Facebook, Hi5, Orkut, LinkedIn, Netlog, more Overview of Talk PPARS overview Data – problem of multiple networks Example of Data Parsing Quantization Results Advertisement Recommendation Results Future Work Our System Overview PPARS = Peer Pressure Advertisement Recommendation System DATA INPUT User-origin FRONT END Model Ads Quantized Get user-friends quantized Process groups Group / Ad matches & socialize Peer – Pressure Ad Selection User Ad choice Ad Social Data Every network can provide different social data Two main splits: Facebook and OpenSocial (majority of others). OpenSocial is an open standard adopted by over 30 containers and growing --- international audience. Allows for “standardized” access. Popular containers like MySpace, Linkedin, Google, Yahoo!, etc. Corporate support Google, Yahoo!, IBM, Microsoft, and more. Data Fields About Me Activities Addresses Age Body_type Books Cars Cars Children Current_Location Date_Of_Birth Drinker Drinker Emails Ethnicity Fashion Food Gender Happiest_when Has_app Heroes Humor ID Interests Job_interests Jobs Languages_Spoken Living_Arrangme Looking_for nts Movies Music Name Network Prescense Nick Name Pets Phone Political Views Profile song Profile url Profile video quotes Relationship status Religion Romance Scared Of Schools Sexual Orientation Sports Status Tags Thumbain Url Addresses Time Zone Turn Ons Turn Offs TV Shows URLS Some Example Data AboutMe Ok, so I am a graduate of with degrees in Philosophy, and Religion. I currently live in with my wife and daughter. I enjoy Snowboarding/skiing, Motorcycles, computers, sports cars, and hanging out with friends. Some Example Data Age 33 Books The Professor and the Madman, Plato, Aristotle, Locke, Hume, Kant, luscombe Movies Things to do in Denver when yer dead, The Departed, Encino Man, Real Genius Music Very Eclectic, including Pennywise, Disturbed, System of a Down, Linkin Park, Senses Fail, Mudvayne, Goldfinger, and a bunch of others I am sure I cannot remember at this time Music allen to // chimaira // sw1tched // bleed the sky // destiny // 40 below summer // endo // nothingface // enhancer // watcha // lamb of god // soilwork // skrape // flaw // unearth // slodust // deftones // raunchy // devildriver // reveille // american head charge // nonpoint // stutterfly // factory 81 // in flames // (hed) p.e. // dry kill logic // primer 55 // 36 crazyfists // sevendust // taproot // candiria // bionic jive // funeral for a friend // ..... Smallville, heros Television Some Example Data Interests Snowboarding/skiing, Motorcycles, computers, sports cars, and hanging out with friends. Some Example Data Status Status Married In a Relationship Smoker No Drinker Heroes Heroes Yes Father Freie Stelle als Held zu vergeben, Bewerbungen bitte an mich... Networking , Friends White / Caucasian Proud parent Straight Looking_for Ethnicity Children Sexual_Orientation Some Example Data Schools University Of Nevada-Reno Reno, NV Graduated: N/A Degree: Master's Degree Major: Hydrogeology 2007 to Present Purdue University-Main Campus West Lafayette,Indiana Graduated: 2003 Student status: Alumni Degree: Bachelor's Degree Major: Philosophy Minor: CPT Clubs: Purdue Student Government Liberal Arts Student Council Greek: Delta Chi 2001 to 2003 Reed Hs Sparks, NV Graduated: N/A Student status: Alumni Degree: High School Diploma Social Data – which? Not all networks provide access to same data Users can keep information private Not all data is “social” Not all data is directly useful for advertisers Data Not typically available / private Current_Location Date_Of_Birth Addresses Phone Not all data is “social” ID Name Has_app Nich_Name Network Presence Profile url Profile song Profile video Thumnail URL URLs Drinker Emails Ethnicity Fashion Food Not all data is directly useful for advertisers Infrequent data For our scheme need in common data to be able to reason over in common feature space. Data that is NOT frequent: Cars Fashion Food Political Views Pets Heroes Humor Social Data - which First go around- based on network availability and commonality, user prevalence and estimated advertisement usefulness Balance between small sample space and feature dimensionality About Me Activities Age Gender Books TV Music Looking For Drinker Relationship Ethnicity Religion Language Interests Date_Of_Birth Smoker PPARS – Front End User Data Friend 1 data Friend2 data I like cars, have 2 kids, ….. PARSING FriendX data Movies: Star Wars Age= 30 ….. Individual Social Data Tokens Web Services User-origin Ontology Codebook Set of User and Friend Quantized Data Vectors QUANTIZATION Codebooks Quantized Parsing Raw Social Data I like lots of movies. Like: Star Wars, Star Wars II, Jaws. And I love Harrison Fords acting. Create small social data tokens to pass to Quantization Null Data Test Split by . / ! / ? Split by : Split by - Split by •I like lots of movies • Like •Star Wars •Star Wars II •Jaws •And I love Harrison Fords acting. ; Hierarchical Segmentation Split by , Individual Social Data Tokens Parsing Example About Me input = "I work as an engineer at Motorola. I work in the peripherals department and do chip design. I am doing some management.“ Resulting Social Data Tokens: I work as an engineer at Motorola I work in the peripherals department and do chip design I am doing some management Parsing Example Interests input = “Internet, Movies, Reading, Karaoke,Building alternate communities” Resulting Social Data Tokens: Internet Movies Reading Karaoke Language Building alternative communities Parsing Example Music input = “Bands: Superdrag, Weezer, The Doors, The Beach Boys, Journey Solo Artists: Billy Joel, Albums: Appetite for Destruction - Guns & Roses; Blue - Weezer“ Resulting Social Data Tokens: Bands Superdrag The Doors Cheap Trick The Beach Boys Journey Solo Artists Billy Joel Albums Appetite for Destruction Guns & Roses Blue Weezer Lost formatting of line return between Journey and Solo Artists Parsing Simple technique of segmentation Future work – include semantics of phrases to detect potential “headings”, syntax rules around delimiters like : and – Quantization Take a social data token and translate it into a numerical feature vector. “I like cars” Cars = 0.2 For each social data field need to create meaningful feature vector elements. For each social data field need to come up with techniques/algorithms to translate the raw social data token into support for its different feature vector elements. Quantization- feature vector Pattern Recognition and Matching are later parts of PPARS Need numerical representations for this of our user, friend social data and also to represent Ads. “I like cars” =???what ad?? Cars = 0.2 Ad with cars around 0.2 Quantization – feature vector For each social data element like “About Us”, “Gender”, “Movies” we have designed its own feature vector. Result of technique used to quantize the input social token data Result of studying keywords /trends in user database of sample social tokens. To understand this ---- lets first discuss techniques used to quantize social data tokens as it related to the “type” of data element. Quantization and Social Data Type Numerical Data Data is naturally numerical – i.e. Age, date of birth Can be quickly and effectively translated into number in some defined range: Address – can be translated into lattitude and longitude Phone – again limited in digits Time zone – again predefined ranges Categorizable Data Data where there is a predefined accepted taxonomy – i.e. movies their genre Data where through sample analysis and advertisement goals categories can be derived Example: interests, about me, food, fashion Indexed Data This is data that has defined sets of values specific to either container or OpenSocial. Example : smoker = yes, no, occasionally, quit, never Other examples: gender, relationship, drinker, sexual orientation Other This is data for which we can not easily derive an algorithm for categorizing. Examples Profile Image , Profile Song URL, etc. Collapsing of Data Some data fields have almost same meaning or content typically greatly overlaps About Me and Interests (and even Status) Age and Date of Birth Categorizable Data This is the bulk of the data fields: About Me, Interests, Music, Movies, TV, Books, Looking For, Religion, Ethnicity, Language Determine Feature Elements: Accepted “standard” taxonomies Web Service taxonomies Advertisement driven taxonomies PPARS – Front End User Data Friend 1 data Friend2 data I like cars, have 2 kids, ….. PARSING FriendX data Movies: Star Wars Age= 30 ….. Individual Social Data Tokens Web Services User-origin Ontology Codebook Set of User and Friend Quantized Data Vectors QUANTIZATION Codebooks Quantized Categorization: Web Service For some of our social data fields we are able to utilize popular web services to convert our social data tokens into search hits that have categorized information associated with them. Example: Internet Video Archive and IMDB Use movie genre IVA – movie search by actor “Robert Redford” http://api.internetvideoarchive.com/Video/MoviesByActorName.aspx?DeveloperId=f377f57f-3bad-47048e80-1b643b206abd&SearchTerm=Robert+Redford Some of the Results : - <item> - <Description> - <![CDATA[ The Unforeseen movie trailer - starring Robert Redford, Willie Nelson, Ann Richards, Gary Bradley, Judah Folkman, William Greider. Directed by Laura Dunn. Theatrical Release Date: 2/29/2008 Genre: Documentary Rating: Not Rated ]]> </Description> <Title>THE UNFORESEEN</Title> <Language>English</Language> <Country>United States</Country> <SiteUrl /> <Studio>Two Birds Films</Studio> <StudioID>3018</StudioID> <Rating>Not Rated</Rating> <Genre>Documentary</Genre> <GenreID>13</GenreID> IVA – movie search continued http://api.internetvideoarchive.com/Video/MoviesByActorName.aspx?DeveloperId=f377 f57f-3bad-4704-8e80-1b643b206abd&SearchTerm=Robert+Redford <HomeVideoReleaseDate>9/16/2008</HomeVideoReleaseDate> <TheatricalReleaseDate>2/29/2008</TheatricalReleaseDate> <Director>Laura Dunn</Director> <DirectorID>36635</DirectorID> <Actor1>Robert Redford</Actor1> <ActorId1>7105</ActorId1> <Actor2>Willie Nelson</Actor2> <ActorId2>8591</ActorId2> <Actor3>Ann Richards</Actor3> <ActorId3>36642</ActorId3> <Actor4>Gary Bradley</Actor4> <ActorId4>36637</ActorId4> IVA – movie search continued http://api.internetvideoarchive.com/Video/MoviesByActorName.aspx?DeveloperId=f377f57f-3bad4704-8e80-1b643b206abd&SearchTerm=Robert+Redford <HomeVideoReleaseDate>9/16/2008</HomeVideoReleaseDate> <Link>http://videodetective.com/titledetails.aspx?publishedid=947964</Link> <BoxOfficeInMillions>-1</BoxOfficeInMillions> - <!-- Television Content --> <AirDayOfWeek>-1</AirDayOfWeek> <AirStartTime /> <ShowLengthInMinutes>-1</ShowLengthInMinutes> <IsTelevisionContent>false</IsTelevisionContent> <FirstReleasedYear>2008</FirstReleasedYear> <Image>http://content.internetvideoarchive.com/content/photos/1250/05253626_.jpg</Image> <Duration>164</Duration> <DateCreated>3/20/2008 8:00:00 AM</DateCreated> <Media>Movie</Media> <PublishedId>947964</PublishedId> <DateModified>4/22/2011 1:57:00 PM</DateModified> AND MORE !!!! selected GENRE IVA genres --- our movie feature elements VideoCategory Not Assigned Western Action-Adventure Children's Comedy Drama Family Horror Musical Mystery-Suspense Non-Fiction Sci-Fi War Health/ Workout Documentary Thriller Biography Romance Movie Quantization For each Social data token “Adam Sandler” , “Star Wars” we can get multiple hits. Example, “Robert Redford” – first 8 hits: Drama = 5 These genres Western = 1 become our Documentary = 2 Movie feature Issues: elements How do we know if actor name, movie title, director or other? Multiple hits for actor or director ---what do we do? (evidence them all) Multiple hits for movie title – what do we do? (take first hit) Order of Movie Quantization Given any social data element parsed from the user’s MOVIE data, we cannot know apriori if it is a title or actor or director’s name. It may even be the genre of movies a user likes. 1. Title search (take first hit) 2. Actor search (evidence all) 3. Director Search (evidence all) 4. Keyword Matching (see next) Quantization Result 1 Up,Forrest Gump,Rear Window,District 9,PacMan,WALL·E,My Flesh and Blood, MacMusical, Yields: MOVIE_FAMILY=0.6, MOVIE_SCIFI=0.2, MOVIE_DOCUMENTARY=0.4, MOVIE_THRILLER=0.2 Quantization using other services TV - IMDB, http://www.imdb.com/search/title?title_type=tv_serie s&title=". Books - Google Books Search, http://books.google.com/books/feeds/volumes? Music - IVA’s music API http://api.internetvideoarchive.com/Music/** Quantization via Keyword Matching What do we do when there is no pre-determined taxonomy and no services for database hits? Natural Language Processing techniques Currently employ simple (but, effective and efficient) technique of Keyword matching /lookup Create database of predetermined phrases/ keywords Lookup scheme to quantize social data token(s). “I work as an engineer” About ME lookup?? “Watch a lot of drama” Movies look up ?? Individual Social Data Tokens Ontology Codebook Set of User and Friend Quantized Data Vectors Codebooks Quantized Keyword Database Used on : About Me / Interests, Religion, Ethnicity, Looking For, Language, Relationship Secondary use: Books, TV, Music, Movies When service fails to provide any hits Keyword Database Creation manual scanning of hundreds (at starting level) of user profiles domain specific expert (human) knowledge dictionaries and taxonomies when exist Issue: how determine weights for every entry Expert determined (consistency) or all equal valued (no sense of importance) Issue: at very beginning level---can we create a dictionary for everything ---no --- are there more advance NLP techniques Some arbitrary Keyword DB entries ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME HOME HOME HOME HOME HOME HOME HOME HOME HOME Cats 0.2 Children Daughter Dog 0.2 Cats 0.2 Children Daughter Dog 0.2 home 0.5 0.2 0.2 0.2 0.2 Some arbitrary Keyword DB entries ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ABOUT_ME ENTERTAINMENT Shopping ENTERTAINMENT Shows ENTERTAINMENT Sing ENTERTAINMENT Ski 0.2 ENTERTAINMENT Songwriter 0.2 0.2 0.2 0.2 Keyword DB- evidence weight Issue: how determine weights for every entry Expert determined (consistency) or all equal valued (no sense of importance) System options: DB weights can take on different values, option to run with all weights equal. Keyword DB- ?? Issue: at very beginning level---can we create a dictionary for everything ---no --- are there more advance NLP techniques to explore for inferences. While users can write anything (and do), remember we are focuses on Advertisement Recommendation --- so the scope of our language is limited to hits related to our feature vector elements….this is a constrained problem Home, Entertainment, Smoking, Work, Social, Movies, TV, Shopping, Books, etc.—these are the kinds of areas we are concerned with. Types of Keyword Matching STRICT Social data token must match exactly a DB entry “Drama” Drama √ “I like Drama” Drama X DB_ENTRY_CONTAINS_DATA_ELEMENT Data token must exist inside the DB entry “Drama” Drama and Comedy √ DB_ENTRY_PARTOF_DATA_ELEMENT Part of data token matches DB entry (this is further segmenting data token) “I like Drama” Drama √ Quantization Results different kinds of Keyword Matching ‘ I am a student and I work and love cars' Output STRICT: No hits ABOUT_ME_ENTERTAINMENT = -1 ABOUT_ME_WORK = -1 ABOUT_ME_HOME] = -1 ABOUT_ME_SOCIAL = -1 ABOUT_ME_FOOD = -1 Quantization Results different kinds of Keyword Matching ‘ I am a student and I work and love cars' Output DB_ENTRY_CONTAINS_DATA_ELEMENT No hits ABOUT_ME_ENTERTAINMENT = -1 ABOUT_ME_WORK = -1 ABOUT_ME_HOME] = -1 ABOUT_ME_SOCIAL = -1 ABOUT_ME_FOOD = -1 Quantization Results different kinds of Keyword Matching ‘ I am a student and I work and love cars' Output DB_ENTRY_PARTOF_DATA_ELEMENT keyword = student ABOUT_ME_WORK =0.2 keyword = work ABOUT_ME_WORK =0.5 keyword = cars ABOUT_ME_ENTERTAINMENT =0.2 keyword = LOVE ABOUT_ME_HOME=0.2 ABOUT_ME_SOCIAL=0.2 ABOUT_ME_ENTERTAINMENT = 0.2 ABOUT_ME_WORK = 0.7 ABOUT_ME_HOME = 0.2 ABOUT_ME_SOCIAL = 0.2 ABOUT_ME_FOOD = -1 Quantization Results 2 – using DB_ENTRY_PARTOF_DATA_ELEMENT “ Fell in love with computers at 11, never got over it... Nonetheless, I have always understood that human problems are solved by people, not technology. My lifes work has been to empower communities to design and build their own solutions.” 6 data tokens from parsing RESULTS: ABOUT_ME_ENTERTAINMENT = 0.2 ABOUT_ME_WORK = 0.5 ABOUT_ME_HOME = 0.2 ABOUT_ME_SOCIAL = 0.2 ABOUT_ME_FOOD = -1 Quantization Result 3 – good null results i am xing ju. test ABOUT ME for opensocial. Parsed results: i am xing ju test ABOUT ME for opensocial NO keyword db hits ABOUT_ME_ENTERTAINMENT=> -1 ABOUT_ME_WORK => -1 ABOUT_ME_HOME => -1 ABOUT_ME_SOCIAL => -1 ABOUT_ME_FOOD => -1 Quantization Results Garbage in and Garbage out LoL really dude that is the way to be no hits is this garbage “LoL” = lots of love…..could you interpret this to be someone interested in social / friends?? Future – deeper interpretation / semantic analysis? Indexed Smoker, Drinker, Gender, Relationship (some networks), Looking for (some networks) , etc. Example for Drinker: opensocial.Enum.Drinker.HEAVILY opensocial.Enum.Drinker.NO opensocial.Enum.Drinker.OCCASIONALLY opensocial.Enum.Drinker.QUIT opensocial.Enum.Drinker.QUITTING opensocial.Enum.Drinker.REGULARLY opensocial.Enum.Drinker.SOCIALLY opensocial.Enum.Drinker.YES Quantized Feature Vector 107 elements Normalize to 0 to 1.0 (near) Advertisement Description Experts manually determine the feature vector weighting for each add. Future – to automate this from survey/ input directly from Advertiser Is there a way to analyze the ad message or image – image understanding? Will results even match advertiser’s goals. PPARS --- Advertisement Matching Not focus of this talk Currently doing variations on KNN with different forms of clustering Early results with small advertising database and beginning Keyword database look good What kinds of groups ---groups with user in it or not? based on only in common feature elements or not. PPARS- Advertisement Delivery Area of future work could be in effective delivery of “social message” related to selected add. Now simple form of direct delivery Based on grouping of same gender and age and strong likes in interests on home. PPARS- Advertisement Delivery Area of future work could be in effective delivery of “social message” related to selected add. Now simple form of direct delivery Your friends Nathan and Marty will like this Based on grouping of same gender and age and drinking. This is a grouping the user is not part of---only friends PPARS- Advertisement Delivery Here the grouping is “loose” only related by gender and very loosely by age. So the advertisement match is not great Question: should be only serve to “strong” groups? Analysis of Advertisement Results Groupings are tight when data allows Matches to advertisements in levels – best, top 10, etc. are correct Future Work Parsing – more syntax and semantics (NLP) Parsing – differences in different languages. Quantization – extend to Natural Language Understanding in addition/replacement of Keyword matching, effects of different evidence accumulation. Data Extrapolation – using inference to create hits in more feature elements.