Business Analysis & Data Mining - Computer Science and Computer

advertisement
Business Analytics & Data Mining
By Matthew Rothmeyer
Overview of BA Discussion
 Business Analytics (BA)
 Overview
 History
 Types of Business Analytics
 Real world examples
 Challenges
 Relations to Data Mining
Business Analytics (BA) : an overview
 BA can be considered a subset of Business intelligence
 A set of skills, technologies, applications and practices
 exploration and investigation of past business performance to gain insight
and drive business planning.
 Like Business Intelligence, BA can focus either on the business as a
whole or only on segments of it
 Focuses on developing new insights and understanding of performance
based on data and statistical methods
BA : Short History
 Analytics in business dates far before computing
 Frederick Taylor, father of scientific management, 19th century
 time management exercises used in industrial settings
 Henry Ford : assembly line pacing used to improve output and business
profitability
 BA becomes widespread when computers were used in DSS systems in
the 60’s
 Evolved into ERP, data warehouses, etc.
Types of Business Analytics
 Reporting or Descriptive Analytics
 Affinity grouping
 Clustering
 Modeling or Predictive analytics
BA: Reporting
 Based on the need to locate and distribute business insights and
experiences
 Often involves ETL procedures used alongside a data warehousing
scheme
 The data is then collected, quantified, and organized using reporting
tools
 Reporting, allows for information describing different views of an
enterprise to come together one place
 A user could query a production and marketing database to determine if
production of a product could be moved closer to where a product is sold
BA: Affinity grouping
 A tool used by businesses and
organizations to take ideas and data
and organize them.
 Often takes the form of an affinity diagram
 Enables data and ideas stemming from brainstorming to be
sorted into groups
 Sorting is based on their natural relationships
BA: Clustering
 Placing a set of objects into groups (called clusters) so that the
objects in the same cluster are more similar (in some sense or another)
to each other than to those in other clusters – wikipedia
 Is a main task of explorative data mining and statistical data analysis
 Clustering is a general task that does not have one set solution
 Clustering can be hard or fuzzy
 Can be done by people or machines
 The latter is preferred
BA: how do we model clusters?
 Connectivity models – how data can be connected to other points
 Density models – defining a cluster by determining where sets of data
points are densest
 Distribution model – clusters are modeled using statistical
distributions
 Expectation maximization
BA: Predictive Analysis
 Stems from the desire to predict future events through analyzing data




an enterprise has collected
Pattern exploitation results in the identification of opportunities and
also risks
Allow relationships in disparate data to be identified
Helps guide in decision making in a business
Is often implemented in the form of data mining
BA : Examples
 Credit company– uses business analytics to track credit risk of
customers as well as matching customers to offerings
 Sales and offers – companies can track customer interaction, and use
that information to determine appropriate product offerings.
 Sales groups can use BA to optimize inventory and analyze past sales
 Could measure peak purchasing times for products
 Could decide whether or not to stock poorly selling items
 Give examples of business cases where data mining might be useful,
and describe how data mining would be used
 Preventing credit card fraud through detecting spending patterns
 Inventory management by tracking sales
BA : Challenges
 Acquiring sufficient volumes of high quality data
 Most data acquired in the field is unsorted and appears in many different
formats
 When dealing with high volume data, deciding what is important and
what is noise
 Rapidly reacting storage structures
 BA can influence customer interactions, and as such that information must
be available fast
 Ex: a customized sales pitch
Business Analytics & Data Mining
 Data Mining is an important sub task of Business Analytics
 Both Predictive analysis and clustering tasks utilize
information retrieved from data mining
 Data mining helps handle some of the specific problems faced
when conducting Business Analytics
 Dealing with and sorting through large data sets
Data Mining : An Overview
 What is Data Mining ?
 History
 Applications of Data Mining
 Detecting data discrepancies or outliers
 Relationship identification
 Data-Function mapping for modeling/prediction
 Categorizing and Summarizing Data
 Standards
 Challenges
Data Mining : What is it?
 Applying statistical analysis techniques to data
 the goal often being to determine unnoticed patterns or to collect
categorized information
 turns collected data into understandable structures
 Data Mining is often used as a buzz word to describe processing large
amounts of data
 In essence, its correct use relates to discovery of new things through
observation
 Synonymous with knowledge discovery
Data Mining : History
 Though HNC trademarked the term in 1990, hands on pattern
extraction is centuries old
 As long as statistic analysis has existed
 Discoveries in computer science have increasingly shifted the field
from hands on to machine dependent, this allows for :
 The use of data indexing and DB systems to handle data efficiently
 The application of statistical algorithms on a large scale, possibly in a
distributed manner, with less error
Data Mining : Use : Application
 Data Mining is often broken into several different categories of
tasks
 Detecting data discrepancies or outliers
 Relationship identification
 Data-Function mapping for modeling/prediction
 Categorizing and Summarizing Data
Data Mining : Finding outliers
 The process of analyzing large, mostly homogeneous, sets of
data and determining which sets or points
 “go with the flow” and conform with patterns the rest of the
data seem to follow
 do not follow expected results when viewed against the entire
set of data
 An outlier can be a point or set of points, but can also be
defined through other means
 A period of time could yield unexpected results
 Ex. Network Intrusion
Data Mining : Techniques in finding outliers
 Rule Based – deciding a set of rules that determine an outlier
(or what isn’t one)
 Can be fuzzy or hard rules
 Cluster Analysis – As mentioned earlier
 Distance or Standard Deviation – Determining an average
over a data set and marking points that aren’t within a
Deviation or Distance
Applications of Outlier Detection
 Network Intrusion Detection
 Unusual bursts of network activity
 Identity Theft Detection
 Unusual spending or customer activity
 Detecting Software bugs
 Software does not deliver expected outputs
 Sensor event detection
 Monitoring patient health fluctuations in a medical setting
 Preprocessing
 Removing data skews based on extenuating circumstances
Relationship Discovery: Basics
 Understanding how data is related is a key factor in trend and
knowledge discovery
 This is the definition of data mining
 Ex: Which products are often bought before a major forecasted
storm
 {hamburger buns} => {???}
 With small sets of data, or with correlations that aren’t subtle
(as the one above), identifying relationships is not as difficult
 With large data sets or subtle relations a combination of rule
generation and data analysis can be used to expedite the process
Relationship Discovery: How its done
 Since the number of relationships between points of data
could be boundless, two important concepts are often
introduced in relationship discovery:
 The amount of data within which a relationship might exist,
called the support of a rule.
 The probability that data in the support will verify a selected
rule, called the confidence of a rule.
Relationship Discovery: How its done
 Generally we apply minimum bounds to both the support of a rule
and its confidence to determine relationships
 First : determine possible relationships
 Set a minimum support
 Orders with hamburgers, Orders with hamburger buns
 Other, user specific rules can be used here
 Second : take the remaining sets, look for patterns in the items
sets such that occurrence rate is above the minimum confidence
 How many people bought hamburgers and buns together
 Ex: we find that if the customer is a male, and they buy diapers, they
will also buy beer
 {male, diapers} => {beer}
Matching data to functions
 Often, it is desirable to match data sets and the factors that
determine them to functions
 Allows for the possibility of predicting future results
 Involves learning how dependent and independent variables
in our data interact
 Dependent : the result, or where a point exists
 Independent : an cause or circumstance that determines the
dependent variable
 If we know how dependent and independent variables
interact, we can create a function and run simulations to see
results
Uses of Function-Data Mapping
 Weather Forecasting
 Determining what conditions lead to what kinds of weather
 Stock market analysis
 When to buy and when to sell
 Crime Prevention
 What conditions cause or prevent crime
Categorizing
 Categorizing – Often we want to separate data based off of a
set of predefined attributes
 Very helpful in pattern recognition
 Ex: a persons political preference
 The process :
 we synthetically generate or measure a set of observations (data
points) with known categories
 we extract properties from said observations which we believe
contribute to the category
 These are called explanatory variables
 Finally we examine new data for these properties
Summarizing
 Summarizing – we almost never want to look at all of the
data individually
 Having too much data can actually hider the decision making
process
 Known as information overload
 Summarizing takes the results from data mining and
transforms it into formats that can be easily read without
omitting important information
 Summarizing might :
 Extract and display only important data
 correlate and abstract data to display trends
 Formats Include : Reports, Graphs, Dashboards, etc.
Standards : CRISP-DM
 Cross Industry Standard Process for Data Mining
 describes common practice for conducting data mining in an enterprise
setting
 KD nuggets – a community resource in DM and analytics took polls
and found CRISP-DM was the top methodology in 02’, 04’, & 07’
 Six step methodology
 Business Understanding
 Data Understanding
 Data Preparation
 Modeling
 Evaluation
 Deployment
CRISP-DM : Explained
 Business Understanding
 Determining the business purpose
 Define success conditions – how do we know we succeeded
 Ex : improved prediction accuracy
 Map purpose/success conditions to data mining results
 Ex: fraud prevention => detect deviations
 Data Understanding
 Collecting and exploring data – defining its attributes
 Data quality verification
CRISP-DM : Explained
 Data Preparation
 Data Cleaning
 Normalization – fitting data within ranges
 Outlier removal – removing cases that could skew the model
 Handle missing attributes – the data was not obtained
 Formatting – changing data so that it fits with our tools
 Modeling – fitting the data to a model following the methods previously
described and then interpreting that model
 Assess the accuracy of the collected data
 General purpose divided into prediction or description
CRISP-DM : Explained
 Evaluation – look at results and measure them with respect to the
success cases defined earlier
 Determine if one has succeeded
 Determine next steps, how do we apply the results
 Deployment – The execution of a strategy for using the results of our
data mining
 Includes preparing ways to monitor and maintain the application of data
mining results in the day to day
 Includes some sort of final summary
SEMMA
 Sample, Explore, Modify, Model and Assess
Sample
selecting the data set
Explore
Understand data through discovering relationships, both expected and otherwise
Modify
Transform and clean the data in order to prepare it for the modeling process
Model
Apply models to the data in order to discover trends and make predictions
Assess
Evaluate the results of the modeling process to determine the reliability of the mined data
 Proposed by SAS Institute : A producer of BI and BA software suites.
 Though this model is often considered general SAS prefers to apply it
directly to their products
 Focuses mainly on data mining and not on applying results to business
(unlike CRISP-DM)
Challenges in data mining
 Not enough or too much data
 Oftentimes it is difficult to access sufficient quantities of data for small
enterprises
 If the enterprise is large however, sometimes there is too much and
deciding what to keep is difficult
 Acquiring clean data
 Multiple formats or no format at all
 Privacy and ethical concerns
 Data aggregation : data compiled from multiple sources can lead to
revelations that violate privacy concerns
 Ex: anonymous data is collected and aggregated, leading to identification
Questions For Exam
 What are some of the challenges in business analytics and
data mining.
 Gathering enough data
 Protecting privacy
 Verifying data integrity
 How can finding outliers be useful in data mining
 We can use them to clean up results
 Sometimes we wish to isolate outliers so that we can find their
cause
Download