Lect3

advertisement
LECTURE 3-DATA MINING
WHY DATA MINING AND THE KDD PROCESS?
2
WHY DATA WAREHOUSING AND DATA MINING?
• Huge volumes of Data available: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized
society
• Cameras, publication tools, scanned text and images
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• Medical data, demographic data, financial data and marketing data
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—
Automated analysis of massive data sets
3
WHY DATA WAREHOUSING AND DATA MINING?—POTENTIAL APPLICATIONS
• Data analysis and decision support
• Market analysis and management
• Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation
• Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
• Fraud detection and detection of unusual patterns (outliers)
• Other Applications
• Text mining (news group, email, documents) and Web mining
• Stream data mining
• Bioinformatics and bio-data analysis
4
EX. 1: MARKET ANALYSIS AND MANAGEMENT
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
• Target marketing
• Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.
• Determine customer purchasing patterns over time
• Cross-market analysis—Find associations/co-relations between product
sales & predict based on such association
• Customer profiling—What types of customers buy what products (clustering
or classification)
• Customer requirement analysis
• Identify the best products for different groups of customers
• Predict what factors will attract new customers
• Provision of summary information
• Multidimensional summary reports
• Statistical summary information (data central tendency and variation)
5
EX. 2: FRAUD DETECTION & MINING UNUSUAL PATTERNS
• Approaches: Clustering & model construction for frauds, outlier analysis
• Applications: Health care, retail, credit card service, telecomm.
• Auto insurance: ring of collisions
• Money laundering: suspicious monetary transactions
• Medical insurance
• Professional patients, ring of doctors, and ring of references
• Unnecessary or correlated screening tests
• Telecommunications: phone-call fraud
• Phone call model: destination of the call, duration, time of day or week.
Analyze patterns that deviate from an expected norm
• Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest employees
• Anti-terrorism
6
CLASS QUIZ
1.
Which of the following is the reason why Data Mining is required today?
a)
b)
c)
d)
2.
Which of the following is not a source of data:
a)
b)
c)
d)
3.
Youtube
Web
Online medical data repositories
Cars
Which is the most important application of Data Mining?
a)
b)
c)
d)
4.
Data has not been mined before
Mining can lead to good quality data
Data is hard to find
Huge amounts of Data is available today hence we need advanced techniques to make
sense of it
Decision support
Data analysis
Searching
Providing data
Match the techniques on the left to their description on the right
a)
b)
c)
d)
Cross market analysis
Target Marketing
Customer requirement analysis
Customer profiling
i. what customer buy what kind of product
ii. Predict what will attract a customer
iii. Determine customer purchasing patterns
7
iv. Finding association between products
DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES
Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Data Mining
Algorithm
Visualization
Artificial
Intelligence
8
KNOWLEDGE DISCOVERY (KDD) PROCESS
Data mining—core of knowledge
discovery process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
9
KDD PROCESS: SEVERAL KEY STEPS
• Learning the application domain
• relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
• Find useful features, dimensionality/variable reduction, invariant
representation
• Choosing functions of data mining
• summarization, classification, regression, association, clustering
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
• visualization, transformation, removing redundant patterns, etc.
• Use of discovered knowledge
10
Suppose you are asked to do some Data Mining for Chelsea.
1.
What possible solutions using Data Mining would you propose
to them?‫ما هي الحلول باستخدام بيانات التعدين ممكن هل تقترح لهم؟‬
Ans. Which player will score the most goals?
Which club will win the cup?
2. What is the first thing you would do?
Ans. Learn more about the domain.
3. What would you consider of utmost importance in data
selection?‫ماذا كنت تنظر من أهمية قصوى في اختيار البيانات ؟‬
Ans. Select good quality data important for further analysis.
4. How would you acquire data reduction? ‫كيف يمكنك الحصول على‬
‫تخفيض البيانات ؟‬
Ans. Remove data that is redundant.
5. What data transformation would be required? ‫ما تحويل البيانات‬
‫سيكون مطلوبا ؟‬
Ans. Convert dates from Gregorian to Hijri
11
CLASS WORK
Suppose you are asked to do some Data Mining for Hyper
Panda.
1.
What possible solutions using Data Mining would you propose
to them?‫ما هي الحلول باستخدام بيانات التعدين ممكن هل تقترح لهم؟‬
Ans. What will the customers buy next year?
Will they be interested in a new kind of milk?
2. What is the first thing you would do?
Ans. Learn more about the domain.
3. What would you consider of utmost importance in data
selection?‫ماذا كنت تنظر من أهمية قصوى في اختيار البيانات ؟‬
Ans. Select good quality data important for further analysis.
4. How would you acquire data reduction? ‫كيف يمكنك الحصول على‬
‫تخفيض البيانات ؟‬
Ans. Remove data that is redundant.
5. What data transformation would be required? ‫ما تحويل البيانات‬
‫سيكون مطلوبا ؟‬
Ans. Convert dates from Gregorian to Hijri
12
LET US DISCUSS AN EXAMPLE
Suppose you are asked to do some Data Mining for Abdul
Lateef Jamil.
1.
What possible solutions using Data Mining would you propose
to them?
Ans.Which car will sell more in the next year?
Which car will be more popular in Province Riyadh?
2. What is the first thing you would do?
Ans. Learn more about the domain.
3. What would you consider of utmost importance in data selection?
Ans Select good quality data important for further analysis.
4. How would you acquire reduction?
Ans. Remove data that is redundant.
5. What transformation would be required?
Ans. Convert dates from Gregorian to Hijri
13
CLASS WORK
Suppose you are asked to do some data mining for ELM who
would like to know the opinion of the users of there system
using a database of online opinion blogs and tweet messages.
1.
What possible solutions using Data Mining would you propose
to them or what knowledge can be mined using Data Mining?
2. What is the first thing you would do?
3. What would you consider of utmost importance in data selection?
4. How would you acquire reduction?
5. What transformation would be required?
14
Download