AJ Guardado Comp. Sci. 49S In order to correctly manage programs (AdSense, AdWords), properly charge for the PPC revenue model, and detect invalid clicks, Google must collect a great deal of data about querying and clicking activities. All of this data is accumulated by Google and contains information about a visitor’s activities on the Google Network. The “post-clicking” data about conversion actions on the advertiser’s website makes up a large piece of this collected data. If the advertiser formally agrees to provide this information, Google collects data on what pages the user went to on the advertised site marked as “conversion” pages (checkout page, form filling pages, etc). This data is limited to what the ADVERTISER decides to provide to GOOGLE. Some decide to opt out from providing this conversion data. This “raw” data is cleaned, preprocessed and stored in various internal logs by Google for different types of analysis. A weakness of Google’s data collection effort is it’s inability to get full access to all clicking activities of visitors. The conversion data they collect is only part of all the activity of a visitor an the advertised site. This data is important for detecting invalid clicks, but Google and many other search engines don’t have full access to it. This isn’t Google’s fault, it is a limitation of the types of data available to Google. Advertisers get reports describing clicking and billing activities from Google. These reports aren’t done that well. Smallest unit of analysis is one day, so advertisers can’t know if a click was marked as valid or invalid by Google, and Google won’t give them this info. Advertisers feel they have the right to know this info, but if Google gives them the info they open themselves up to click fraud, because they are giving the advertisers hints about how click detection works. One definition of invalid clicks: “When a person, automated script or computer program imitates a legitimate user of a web browser clicking on an ad, for the purpose of generating an improper charger per click”. Invalid clicks can be made by humans or computer programs. To evaluate how valid a click is, you have to understand what the intent of clicking the ad was. Need to determine if the click is generated “artificially” or not, by way of a list of “prohibited means” that Google follows: (https://www.google.com/adsense/policies?s ourceid=asos&subid=ww-ww-etHC_entry&medium=link ) Many can be detected, but some elude Google, like a person looking at an ad a second time to make sure he’s certain what the ad entailed. Doubleclicks are also sometimes disputed as valid or invalid. p is time difference between clicks, and if p is relatively large, second click is valid. These acts come from a malicious intent to make an advertiser pay for unnecessary clicks. Fraudulent clicks are invalid clicks made with malicious intent. Example of invalid is a person doubleclicking an add out of habit. May come from software or “bots” designed to click on ads, people manipulating pages, advertisers clicking on the ads of their competitors, or multiple accounts from AdSense publishers. Goal of the Click Quality team is to identify all invalid clicks regardless of nature, but they’re not there yet. Anomaly-based: Too many clicks in a given amount of time (Ex: 100 times a day). Rule-based: IF-THEN rules established. Classifier-based: One learns to recognize invalid clicks from past experiences with invalid clicks. Google uses the first two often, rarely uses third. No real definition of invalid clicks, and a definition can’t be given to the public because unethical users will take advantage. Search engines must either assure advertisers that they are doing everything possible, or use independent third-party vendors to solve the problem. Click Quality team tries to protect Google’s advertising and provide customer service. Does this through prevention and detection. Filtering and detection on several levels help solve the problem. Pre-filtering, online filtering, post-filtering, automated monitoring, manual reviews (proactive and reactive). Started with only 3 filters, steadily grew over the years. Prioritizes filters by order in which they are used in checking invalid clicks. Test filters before they actually use them, those that pass require constant tuning and maintenance to perform. When Google sees the filters missed invalid clicks, they give credits to the advertisers and try to fix their filters. 4 types of clicks: True Positive: invalid, correctly identified as invalid. True Negative: valid, correctly identified as valid. False Positive: valid, incorrectly identified as invalid. False Negative: invalid, incorrectly identified as valid. TP+TN+FP+FN=N (total number of clicks). Accuracy rate of a filter equal to (TP+TN)/N, and error rate to (FP+FN)/N. Hard for Google to get this info, doesn’t know about actual validity of clicks. Each filter only detects 2-3% not detected by other filters already. Offline invalid click methods detect few invalid clicks in comparison to the filters.