Course Name: Business Intelligence Year: 2009 Data Enhancement 18th Meeting Source of this Material (2). Loshin, David (2003). Business Intelligence: The Savvy Manager’s Guide. Chapter 13 Bina Nusantara University 3 The Business Case There are two aspect to the business value of data enhancement. The first is that as organizational data environments mature and data managers want to exploit the corporate data asset, there is an increased necessity for sharing data from different group. The second aspect emerges from the actionable knowledge that can be discovered only by analyzing the result of composing multiple data sets. Data enhancement is a critical component to the BI program, especially as a value-adding process to the following. • Competition in knowledge industries • Customer relationship management • Micromarketing and personalization • Cooperative marketing • Industry deregulation Bina Nusantara University 4 Types of Data Enhancement There are two approaches to data enhancement. One focuses on incrementally improving or adding information as data is viewed or processed. Incremental enhancements are useful as a component of a later analysis stage, such sequence pattern analysis and behavior modeling. The other approach is batch enhancement, where data collections are aggregated and methods are applied to the collection to create value-added information. Here some examples. • Auditing Enhancement In business processes that require some degree of tracing capability, a frequent data enhancement is the addition of auditing data. Creating a tracking system associated with a sequence of related events provides a framework for evaluating efficiency within a business process. • Temporal Enhancement Historical data provides critical insight to a BI program. Whereas in some cases the history is embedded in the collected data, other instances require that activity be enhanced by incrementally adding timestamps noting the time at which some event occurred. Bina Nusantara University 5 Types of Data Enhancement (cont…) • Contextual Enhancement The place, or context, of data manipulation is an enhancement as well. A physical location, a path of access, the login account through which a series of transactions were performed, are examples of context that can augment data. Contextual enhancement also includes tagging data records in a way to be correlated with other pieces of data. • Geographic Enhancement Data enhanced with geographic information allows for analysis based on regional clustering and data inference based in predefined geodemographics. The first kind of geographic enhancement is the process of address standardization, where addresses are cleansed and then modified to fit a predefined postal standard. • Demographic Enhancement Demographic describe the similarities that exist within an entity cluster, such as customer age, marital status, gender, income, and ethnic coding. Demographic enhancements or through direct information merging. Bina Nusantara University 6 Types of Data Enhancement (cont…) • Psychographic Enhancement Psychographics describe what distinguishes individual entities within a cluster. Psychographics information is frequently collected via surveys, contest forms, customer service activity, registration cards, as well as specialized lists. The trick to using psychographic data is in being able to make the linkage between the entity within the organization database and the supplied psychographic data set. • Inference Enhancement Information inference is a BI technique that allows the user to draw conclusions about the examined entity based on supporting evidence and business rules. Inferred knowledge can be used to augment data to reflect what we have learned, and this in turn provides greater insight into solving the business problem at hand. Bina Nusantara University 7 Incremental Enhancement Incremental enhancement are those that can be attached to data in process. • Provenance The provenance of an item is its source. This idea generalizes the temporal and auditing enhancements described earlier. A provenance can be as simple as a single string data field describing the source or as complex as a separate table containing a time stamp and a location code each time the record is updated, related through a foreign key. • Audit Trails The combination of location, time, and activity information associated with a series of manipulations of a data record allows us to trace back all occasions at which that information was touched, giving us the audit data allowing us to see how activities cause data to flow through a system. • Context This kind of enhanced data provides significant marketing benefit, because this context information can be fed into a statistical framework for reporting on the behavior of users based on their locations or times of activity. Bina Nusantara University 8 Batch Enhancements Batch enhancements are applied to a large set of data instances as an offline process. They typically involve the merging of data from multiple instances within a single data set or multiple data instances drawn from multiple data sets. • Householding Householding is a process that attempts to reduce a set of individuals to a single grouped housing unit based on the database record attribution. A household consists of all people living as an entity within the same residence. • Organizational Merging When organizations merge, they will eventually want to merge their vendor, customer, and employee databases as well as their base reference data. • Other Batch Enhancements Other batch enhancements include data scrubbing, data cleansing, and health care diagnosis assistance, as well as building affinity programs and constructing relational associations, among others. Bina Nusantara University 9 Standardization Standardization refers to ensuring that a data instance conforms to a predefined expected format. A data standard is a format representation for data values that can be described using a series of rules. Because a standard is a distinct model to which all items in a set must conform, this means we can try to automate two components of any standardization process: • Determination of conformance to the standard • Bringing a nonstandard data instance into conformance with the standard There is usually a well-defined rule set describing both how to determine if an item conforms to the standard and what actions need to be taken to bring the offending item into conformance. • Data Standard and Standardization The value of data standardization lies in the notion that given the right base of reference information and a well-defined rule set, additional data can be added to a record in a purely automated way. Probably the most important benefit of standardization is that through the process of defining standards, organizations create a streamlined means for the transference and sharing of information. Bina Nusantara University 10 Standardization • Kinds of Standards Most standards either are dictated by some authority (such as the government), are developed through cooperation (such as an industry-defined standard), or are derived from common use (such as geographical biases toward representing dates). Bina Nusantara University 11 Example: Address Standardization In this section, we look at the different components of an address. • The Address Standard Recipient line The recipient line indicates the person or entity to which the mail is to be delivered. Delivery Address line The delivery address line is the line that contains the specific location associated with the recipient. Last line The last line of the address includes the city name, state, and ZIP code. • Standard Abbreviations The postal service provides, a set of enumerations of standard abbreviations, including U.S. State and Possession abbreviations, street abbreviations, as well as common business word abbreviations. Bina Nusantara University 12 Example: Address Standardization (cont…) • Zip + 4 ZIP codes are postal codes assigned to delivery areas to improve the precision of sorting and delivering mail. ZIP + 4 codes are a further refinement, narrowing down a delivery location within as subsection of a building or a street. • Address Standardization Software Because the USPS addressing standard is so well documented, it is relatively straightforward to build automated address standardization software, which eases the way in which this enhancement can be performed. Bina Nusantara University 13 Enhancement Methodologies There are many issues involved in data enhancement, but because a large number of them revolve around information record linkage, it is worthwhile to explore this greater detail. • Record Linkage Any two records can be connected based on a set of chosen attributes are candidates to be linked together. Usually record linkage is performed only when the chosen attributes match exactly, but simple record linkage is limited, for the following reasons. Information is missing Information sources are in different formats Record linkage is imprecise Information is out of synchronization Information is lost • Semistructured Data Semistructured data refers to information that is partially formatted, such as data elements on a web page or the comments field in a customer service database. Bina Nusantara University 14 Enhancement Methodologies Semistructured data may be a good source for both association and relation information, but the problem of extracting information out of the data is particularly difficult. • Inference An inference is an application of a heuristic rule that essentially creates a piece of information where its didn’t exist before. Even though inferencing represents the application of intuition, it is done so in a way that can be automated. Inference rules usually reflect some understood business analysis that can be boiled down to a set of business rules. • Types of Inference Enhancements based on inferencing are usually very focused bits of information relevant within a particular analytical context. Inferences are likely to center on demographic or psychographic details that can be derived as a direct result of data merging and analysis. Bina Nusantara University 15 Management Issues • Buy versus Build In the software and services market, the term data enhancement is overloaded and can be used to refer to anything from data cleansing and address standardization all the way to services-based record linkage as a means to add data fields to submitted data, such as credit ratings. • Performance Issues Some data enhancement applications are likely to be of high computational complexity, and therefore members of the team should be aware of high performance computing as well as database manipulation, ETL, and pattern matching. Bina Nusantara University 16 End of Slide Bina Nusantara University 17