PPT

advertisement
CSC 177
Data warehouse and
Mining project
Pooja Vora
Vishma Shah
Guided by – Prof. Meiliu lu
Agenda
Data Warehouse Project






Introduction
Background
Scope of study
Implementation
Data Cleaning and Preprocessing
Data Mart
Data Mining Project





Introduction
Background
Scope of study
Implementation
Data mining
 Learning experience
 Future Scope
 References
Data Warehouse Introduction
• The objective of our project is
to create a data mart with star
schema
• Data mart will be used to find
answers related to various
company key factors and
statistics.
Background
• Source website : Navathe company schema
• Dataset :
• Company dataset
• Company dataset :
Fact table - 7 attribtues,1000 entries
Scope Of Study
• Data Preprocessing
• Microsoft Office Excel
• Microsoft SQL Server
• Data Mart
• Microsoft SQL server , Visio, convertCSVtoSQL
• Olap Operations
• SQL server queries
Implementation
• Data Cleaning & Preprocessing
• Data Mart
• Olap Operations
Data Cleaning & Preprocessing
The company schema had different tables as per navathe ,
we also added few dimension for analytical processing and
created a fact table with star schema.
Data Mart
• We have 5 dimension tables in our data mart and one fact table
which forms star schema.
• The Fact table tables consists of around 1000 rows having various
details about ssn, project, work_id etc
Star Schema
Data Mart Question-Answers
•
•
•
•
How many products were produced over the months?
•
Rollup
How to find employee current working project?
•
Slicing on employee dimension
How to find the statistics of days where more than 5 products were produced
•
Dicing on product and work dimension
How to find which days and how many products of particular product were
produced?
•
Scoping
Olap Operations Example
•
Roll Up
select t.date_year, t.date_month, sum(w.NumberOfProduct) as 'No. Of Products'
from EmpFactTable f, DimTime t, DimEmp_work_record w
where f.date_key= t.date_key and f.work_id = w.work_id
group by date_year, date_month with rollup
date_year date_month No. Of Products
2014
1
980
2014
2
761
2014
3
1274
2014
4
240
2014
NULL
3255
NULL
NULL
3255
winning month
Quiz
Which dimension was used for slicing cube?
• Employee
• Time
• Work
• Product
Answer - Employee
Data Mining Project
Introduction
• Perform Data mining on data set to
discover knowledge
• Apply data mining algorithms using
tools
• compare the performance of
algorithms using these tools.
• Compare the tools performance
Background
• Source Website – www.data.gov
• Dataset :
• Consumer complaints
• Data:
- 14 attribtues, 55000 entries
(Data from 2012 to 2014)
Scope Of Study
• Data Preprocessing
•
•
Microsoft Office Excel
Tools (Weka, Rapidminer)
• Data Mining
•
•
Tools : Weka, Rapidminer
Algorithms : K-Means, Naïve Bayes
Implementation
• Data Cleaning & Preprocessing
• Data Mining
• Tools Comparision
Data Cleaning & Preprocessing
• Data Cleaning - Replaced missing values with “unknown”
• Data selection – Selected Consumer complaints data of two
months (Sept , Oct) for mining
• Sample Data selected as 3000 rows
Data Mining
We have used One Classification & One Clustering Algorithm
Classification – Naïve Bayes
Clustering – K-means
Data Mining Demo
Tools Comparision : K-Means
Rapid Miner
Weka
Tools Comparision : Naïve Bayes
Rapidminer
Weka
Quiz
Which Clustering Algorithm was used for data
mining?
• K-Means
• EM
Answer – K-means
Learning Experience
•
•
•
•
•
Learned the analytical processing through data mart project.
Helped to improve knowledge for Database statistics
Learned to gain information out of the querying results.
Learned different data mining tools like weka and rapid miner
Improved understanding of various algorithms and their practical
implementation through tools
• Learned to make sense out of the results obtained from the tools
Future Scope
• Data Warehouse
• Create a snowflake schema by introducing dimension like
employee types contractors/Fulltime and then take it
further for analytical processing for different statistics
• Data Mining
• Can implement other algorithms and tools like orange etc
References
• Elmasri and Navathe, Fundamentals of
Addison-Wesley Publishing
Database System, 6th Edition,
• OLAP Courseware
http://athena.ecs.csus.edu/~olap/olap/introduction.php
• DM dataset http://www.data.gov/consumer/
• Data Mining Courseware http://athena.ecs.csus.edu/~datamini
• https://rapidminer.com/wpcontent/uploads/2013/10/RapidMiner_Ra
pidMinerInAcademicUse_en.pdf
Questions….
Download