Lecture 21: Privacy and Online Advertising References • Challenges in Measuring Online Advertising Systems by Saikat Guha, Bin Cheng, and Paul Francis • Serving Ads from localhost for Performance, Privacy, and Profit by Saikat Guha, Alexey Reznichenko, Kevin Tang, Hamed Haddadi, and Paul Francis Problem • Online advertising funds many web services – E.g., all the free stuff we get from Google • Ad networks gather much user information • How do they use the user information? Goals • Determining how well ad networks target users Methodology • Creating two clients representing two different user types • Measuring the different ads each client sees Challenges • How to compare ads • How to collect a representative snapshot of ads • Quantifying the differences • Avoiding measurement artifacts Comparing Ads is challenging • Ads don’t have unique IDs • A & B are semantically the same, but with different text • A & C are different, but with same display URLs How to define two ads are the same? • Easy but illegal approach: comparing destination URLs – FP: flagged as equal but not – FN: equal but not flagged • Display URL has the lowest FNs Use display URL to define ads equality Taking a Snapshot • More ads can be displayed on any single page • How to determine all Ads that may be fed to a user? – Reload the page multiple times – But too many reloads may lead to ads churn: old ads expire, new ads show up Determining the # of reloads • Reloads every 5 seconds • Repeated for 200 queries • Curve becomes linear > 10 reloads – Ads churns • Use 10 reloads as the threshold Quantifying Change • Metrics – Jaccard index: | A B | | A B | – Extended Jaccard index (cosine similarity) Comparing Effectiveness • Views: # of page reloads containing the ad • Value: # of page reloads scaled by the position of the ad • Overlap: Jaccard index Comparing Effectiveness The winner is • Weight: log(views) or log(value) Avoiding artifacts • Different system parameters may lead to different ads view – Browsers used different DNS servers – Browsers receive different cookies – HTTP proxy Analysis • Configure two or more instances to differ by one parameter • Comparing results for – Search Ads – Website Ads – Online Social Network Ads Search Ads • • • • A, B: control w/o cookies C, D: w/ cookies enabled. Seeded w/ different personae Google 730 random product-related queries for 5 days No obvious behavioral targeting in search ads. Why? – Keyword based ads bidding • Location targeting not studied Websites Ads • • • • Measure 15 websites that show Google ads A, B: control in NY C: SF; D: Germany Location affects web ads Website Ads • A, B: control • C: browse 3 out of 15 websites • D and E: browse random websites and Google search random websites • Google does not use browsing behavior to pick ads Online social network ads • Set up three or more Facebook profiles • A, B: control and identical • C: differs from A by one profile parameter Online social network ads • Use all profile parameters to customize ads • Age and gender are two primary factors • Diurnal patterns due to ads churn – Should it increase or decrease? • Education and relationship matter less, except for engaged and non-engaged women Checking Impact of Sexual Preference • Six profiles with different sexual preferences • Two males interested in females (male control) • Two females interested in males (female control) • One male interested in male • One female interested in female Ads differ by sexual preferences Other results • Found neutral ads targeted exclusively to gay men • Clicking would reveal to the advertiser a user’s sexual preference • 66 ads shown exclusively to gay men more than 50 times during experiments Summary • Search ads are largely key-word based so far • Websites ads use location but probably not behavior • Social network ads use all profile attributes to target users Question: how can we design a privacy-preserving online advertising system? Goals • Support online advertising – A good revenue source to fund online services • Preserve user privacy PrivAd • Serving Ads from a localhost client • Actors: user, publisher, advertiser, broker, and dealer How it works • Advertisers upload ads to broker • User client subscribes to a set of the ads according to the user’s profile to the broker – Message encrypted with Broker’s public key and contains a symmetric private key • The Broker sends filtered ads to the user client – Ads are encrypted with the symmetric key • Dealer anonymizes the client’s message to Broker Ad View/Click Reporting • When a user clicks an ad, the user client sends a view/click report containing ad ID and publisher ID to the broker via the dealer • Dealer attaches a unique report ID, removes client identity information, maps the ID to the user identity information Click-fraud defense • Broker provides dealer the record IDs if it suspects click-fraud • The dealer finds the user • The dealer stops relaying ads to user if convinced • Questions not answered: how to detect by broker, and what’s the punishment Defining User Privacy • Unlinkability – No single player can link the identity of user with any piece of user’s profile – No single player can link together more than some limited number of pieces of personalization information of a given user • The dealer learns User A clicks on some ad • The broker learns someone clicked on ad X • Not robust to dealer/broker collusion Scaling PrivAd • Ads churn is significant • 2GB/month of compressed ad data Discussion • What challenges does PrivAd may face in a practical deployment?