Uploaded by bashirrabiu00

Introduction & Overview data mining

advertisement
PETE 2060, Computing and Data Mining
PETE 4990, Data Mining (TPS)
Introduction
What is Data Analytics?
Business analytics (BA) is the practice and art of bringing quantitative data to bear on
decision‐making.
The term means different things to different organizations.
Process of extracting meaningful insights from data -> Enables data-driven decisions based on
factual data
Data
Data
Analytics
Data
Driven
Decisions
Artificial Intelligence vs Machine Learning vs Deep Learning
o Artificial Intelligence: The science behind programming computers to
simulate human intelligence by thinking and acting like humans.
Artificial Intelligence
• Automates repetitive learning and discovery through data
• Analyzes more data, faster and more accurately.
Symbolic
Learning
o Machine learning: A specific subset of AI applications that learn on their
Statistical
Learning
own using patterns in the data without explicit programming.
o Deep Learning: Uses complex neural network algorithms with many
Machine (Data)
Learning
Robotics
Computer Vision
Image Processing
Speech Recognition
Natural Language
Processing
Deep Learning
(ANN)
Computer Vision
Object Recognition
processing layers to recognize patterns in very large datasets.
3
What is Data Mining?
Definition: Non-trivial extraction of implicit, previously unknown and potentially useful information from data.
Non-trivial: obvious knowledge is not useful
Implicit: hidden and difficult to observe knowledge
Previously unknown
Potentially useful: actionable; easy to understand
A pattern is interesting if:
o Easily understood by humans,
o Valid on new or test data with some degree of certainty
o Potentially useful, novel
o Validates some hypothesis that a user seeks to confirm
Interestingness measures:
Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
4
CRISP-DM Pipeline
Cross Industry Standard Process for Data Mining
5
Knowledge Discovery
•
Learning the application domain
o
•
Data selection
•
Data cleaning and preprocessing (may take 60% of the effort!)
•
Data reduction and transformation
o
Find useful features, dimensionality/variable reduction, invariant
representation
Choosing functions of data mining
o
Relevant prior knowledge and goals of application
Creating a target data set
o
•
Summarization, classification, regression, association, clustering
•
Choosing the mining algorithm(s)
•
Data mining
o
•
Pattern evaluation and knowledge presentation
o
•
Search for patterns of interest
Visualization, transformation, removing redundant patterns, etc.
Use of the discovered knowledge
6
What Types of Data? What Types of Analytics?
Data Types
Foresight
High
Prescriptive
▪ Time-series data, temporal data, sequence data
▪ Structured data, graphs, information networks
Insight
Value
▪ Data streams and sensor data
Predictive
Diagnostic
▪ Spatial data and spatiotemporal data
Complexity
▪ Relational database, heterogeneous databases
Medium
▪ Text databases
Descriptive
Hindsight
Low
7
What Are We Trying to Do?
ML Models
Learning
Tasks
Supervised
Unsupervised
Reinforcement
Makes inferences
using labeled data
Makes inferences
without labeled data
Learns based on
consequences
Classification
Regression
Clustering
• Find patterns in a large quantity of data (e.g., quantify “low risk” borrowers for loans)
• Match related data (e.g., facial recognition)
• Make recommendations (create predictive models from large data sets and apply in in real-time to specific cases)
• Decision-making (fully or semi-autonomous – self-driving cars, probation determinations)
Data Situation in the Industry
9
Exploration and Field Development
Resource
Remote
Constraints
Time
Constraints
More
Complex
Verify assumptions using historical data
Acreage assessment and prospect generation
Increase the success rate of identifying
potentially productive seismic trace signatures
10
Operational Efficiency
• Pressure support
• Reservoir sweep
• Water/gas/steam flooding
• Enhanced oil recovery (EOR)
• Real-time drilling optimization
• Precision drilling
• Understand operational constraints
Subsurface Characterization
Drilling & Completions
• Well Performance
• Field Operations
• Lifting and pumping
• Flow assurance
• Reduced uncertainty
• Automation
• Best practices and knowledge capture
• Remote collaborative teams
Production Engineering
Operational Excellence
11
Predictive Maintenance
Monitoring
Legacy Data
Real-time Data
Digital Twin
Diagnosis
Adjusting
Situational
Real-time
Awareness
Decision Making
Risk Sensing
Proactive
Capability
Prevention
Expert Knowledge
12
Course Objectives
▪ Students will have a comprehensive knowledge of the principles of data mining techniques.
▪ Students will be familiar with various soft computing algorithms such as neural network and
support vector machines (PETE 4990)
▪ Students will have an appreciation of the necessity and benefits of applying modern data analytics
and knowledge discovery techniques in petroleum engineering.
▪ Students will be able to manipulate different data mining techniques to build analytical applications
within petroleum engineering.
13
Tentative Topics
•
•
•
•
•
•
•
•
•
•
•
•
•
Introduction to data pre-processing
Data cleaning and preparation
Data wrangling
Modelling with machine learning algorithms
Frequent patterns and association rules
Linear regression
Decision trees for Classification
Regression trees
Ensemble Methods-Classification
Ensemble Methods-Regression
Support Vector Machines
Artificial Neural Networks
Clustering Techniques
14
Data Mining Tools
▪ Orange Data Mining based on visual programming: Orange Data Mining - Getting started
▪ Python for data mining: Free Download | Anaconda
15
Download