Uploaded by guruthemr.singh

Foundations of Data Science Presentation

advertisement
FOUNDATIONS OF DATA SCIENCE
DIT 54102
Muhammad Shahin
Data Understanding
Data Understanding
• Collect Initial Data or acquire the data and access to the data listed in the projects resources. You need
to have a checklist of the dataset you have acquired, the dataset location, the methods to acquire the
datasets, and record any problems encountered and any solutions to the problems for the other users or
project members to be aware of.
• Describe Data by examining the properties of the data acquired, provide a description report regarding
the format of the data, quantity of data and even the records and fields in each table or datasets.
• Explore Data by using data science questions that can be quickly answered through querying,
visualization, and reporting or summary report. In this stage, you will be able to find your first or initial
hypothesis and their impact on the project.
• Verify Data Quality by examining if the data is complete. If the data has errors or are there missing values
and if there is, what is the percentage of the missing values versus the overall data obtained.
How do we collect data?
Data Acquisition
• Data acquisition techniques
• Load from the local filing system
• Download from the web
• API calls
• Web scraping
• IoT devices
• Data types
• Structured data
• Unstructured data / Semi-structured
• Data comes in various
file formats
• SQL
• CSV
• JSON
• XML
• HTML
• …
How to obtain different format datasets for data science?
• Data extraction involves pulling data from different sources and converting it into
a useful format for further processing or analysis. It is the first step of the ExtractTransform-Load pipeline (ETL).
• As a data scientist, you might need to combine data that is available in multiple
file formats such as JSON, XML, CSV, SQL, and many more.
• We will use python libraries such as pandas, json, and requests or API calls to
read data from different sources and load them into a Jupyter notebook as a
pandas dataframe for further analyses.
Data Understanding
Use Case: Buying a House in Winnipeg – Inspections and Defects
• When buyers put in an offer, there is the option to insert a home inspection clause.
• Typically, the buyer is responsible for bearing the cost of a home inspection, unless other
arrangements are made with the seller. You can expect to pay anywhere from $500 to $800 for a
home inspection, depending on the size, location and age of the home.
• Should you get a home inspection? An objective mind always answers “yes” but under certain
market conditions, some homebuyers may be tempted to bypass this critical step in the
purchasing process.
• Can we find a data-driven answer to this question?
Data Sources
https://data.Winnipeg.ca
Data Sources
Data Sources
Data Sources
Data Sources
Data Export Option 1 - CSV
Data Export Option 1 - CSV
Data Export Option 2 - API
.
Data Sources
guests
Data Sources
Getting App Token
Data Sources
Data Sources
<your email>
Getting App Token
Data Sources
Data Sources
Data Sources
Data Sources
PIP Install New Package
Import Data to Pandas DataFrame
“
< your api-token >
”
Inspect Data
• wpg_inspect_defects_df.head()
• wpg_inspect_defects_df.tail()
• wpg_inspect_defects_df.info()
?
?
Inspect Data
• wpg_inspect_defects_df.info()
Convert Necessary Columns to Numeric
Data Sources
• 2015 - 2024
Drop Unnecessary Column(s)
• wpg_inspect_defects_df.drop('location', axis=1, inplace=True)
Pivot Data by Inspection Type
wpg_inspect_defects_df_pv.shape
(164, 6)
(525, 6)
wpg_inspect_defects_df.shape
Data Sources
Data Sources
Reset Index
Feature Engineering
• Total Number of Inspections
• Total Number of Defects
• Defect Ratio
• Drop Unnecessary Features
Visual Data Inspection and Observations
Data Sources
Data Sources
Data Sources
Data Sources
Data Sources
164x3
Data Exploration
• What sort of hypotheses have you formed about the data?
• Which attributes seem promising for further analysis?
• Have your explorations revealed new characteristics about the data?
• How have these explorations changed your initial hypothesis?
• Can you identify particular subsets of data for later use?
• Take another look at your data mining goals. Has this exploration altered the goals?
Question?
Run and explore the notebook
Next topic: Collecting data from social media
Download