Course Name: Business Intelligence Year: 2009 Using Publicly Available Data 20th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 15 Bina Nusantara University 3 The Business Case It is very simple to make the case for using public data. Data that has been collected and made available by government resources is available at a low cost, and the only costs involve storage management and integrating with other BI data. In any company that has set up a BI environment, the processes associated with importing, managing, and integrating data have already been streamlined for internal data set aggregation. And so the only increase is in those variable costs associated with executing those processes. On the other hand, in the right circumstance there can be significant value through data enhancement using publicly available data. Bina Nusantara University 4 Management Issues There are three major management issues associated with the use of publicly available data: integration, privacy, and its lack of structure. In fact, there are a number of companies whose business is to enhance and improve public data sets and the resell them based on their added value. The second major issues revolves around personal privacy. There is a perception that any organization that collects data about individuals and the tries to exploit that information is invading a person’s privacy. The third major issue is that a lot of publicly available data is not always in a nicely structured form that is easily adaptable. Frequently, this data is semistructured, which means that the data requires some manipulations before it can be successfully and properly integrated. Bina Nusantara University 5 Public Data There is a large amount of public data that is easily accessible, and how to explore all of that data could fill an entire book. What is important is to explore the process of locating the data resources that are available and how to determine the usage possibilities for that data. There are many ways that data sets can be categorized, but we will break the realm of public data into these areas: • Personal Information Any data that attributes the information about a person could be called personal information. • Business Information Aside from personal information, there is a lot of data that can be used to attribute business entities. The public records are frequently related to rules and regulations imposed on business operations by federal or state government jurisdictions. This kind of data includes the following. Incorporations Uniform Commercial Code (UCC) Bina Nusantara University 6 Public Data (cont…) • Bankruptcy Filings Professional Licensing Securities Filings Regulatory Licensing Patents and Trademarks Legal Information A large number of legal cases are accessible online, providing the names of the parties involved in the cases as well as free text describing the case. These documents, many of which having been indexed and made available for search, contain embedded psychographic and geographic enhancement potential, along with opportunities for entity extraction and entity linkage. Those linkages may represent either personal or business relationships. • Factual Information There is an abundance of factual information embedded in available data sets. Although there may be some restriction on specific uses of some of this data, there is still much business value that can be derived from data sets such as the following. Bina Nusantara University 7 Public Data (cont…) Census Summary Topologically Integrated Geographic Encoding and Referencing database Federal Election Commission Bureau of Labor Statistics (BLS) Pharmaceutical Data Bina Nusantara University 8 Data Resources There are basically two approaches: gather data from the original source, and pay a data aggregator for a value-added data set. • Original Source As mentioned in the previous sections, the government is a very good source of publicly available data. Another source of publicly available information may be provided by third parties in a form that is not meant for exploitation. A good examples is a Web site, which may have some data but not in a directly usable form. Another interesting source of publicly available data is the subject of that data itself. • Data Aggregators The term data aggregator to refer to any organization that collects data form one or more sources, provides some value-added processing, and repackages the result in a useable form. Another method for providing aggregated data is through a queryand-delivery process. Bina Nusantara University 9 Semistructured Data On the other hand, when the content is limited to a vocabulary or a format that can be reasonably modeled, it is possible, with some degree of certainty, to extract bits and pieces of information from semistructured data. The point is that although the data has not been broken down into a distinct set of attributes and their assigned values, there is some predictable context that appears frequently enough that allows an application to extract information. Bina Nusantara University 10 The Myth of Privacy • Fear of Invasion The truth is, as BI professionals, we are somewhat responsible for collecting customer information and manipulating that information for marketing purposes, but are we really guilty of invasion of privacy? • The Value and Cost of Privacy This demonstrates an interesting model of information valuation, in that the consumer is being compensated in some way in return for providing information. • The “Privacy” Statement The issuing of a privacy statement does not imply that your data is being treated as private data. These statements actually are the opposite-they tell the consumer how the information is not being kept private. • The Good News for Business Intelligence There are a lot of benefits in society to the dissemination of personal information, such as the ability to track down criminals, detect fraud, provide channels for improved customer relationship management, and even track down terrorists. As BI professional, we have a twofold opportunity with respect to the privacy issue. Bina Nusantara University 11 The Myth of Privacy (cont…) The first is to raise awareness regarding the consumer’s value proposition with respect to data provision, leading to raised awareness about both the legality and the propriety of BI analysis and information use. The second is to build better BI applications. Bina Nusantara University 12 End of Slide Bina Nusantara University 13