FOUNDATIONS OF DATA SCIENCE DIT 54102 Muhammad Shahin Data Understanding Data Understanding • Collect Initial Data or acquire the data and access to the data listed in the projects resources. You need to have a checklist of the dataset you have acquired, the dataset location, the methods to acquire the datasets, and record any problems encountered and any solutions to the problems for the other users or project members to be aware of. • Describe Data by examining the properties of the data acquired, provide a description report regarding the format of the data, quantity of data and even the records and fields in each table or datasets. • Explore Data by using data science questions that can be quickly answered through querying, visualization, and reporting or summary report. In this stage, you will be able to find your first or initial hypothesis and their impact on the project. • Verify Data Quality by examining if the data is complete. If the data has errors or are there missing values and if there is, what is the percentage of the missing values versus the overall data obtained. How do we collect data? Data Acquisition • Data acquisition techniques • Load from the local filing system • Download from the web • API calls • Web scraping • IoT devices • Data types • Structured data • Unstructured data / Semi-structured • Data comes in various file formats • SQL • CSV • JSON • XML • HTML • … How to obtain different format datasets for data science? • Data extraction involves pulling data from different sources and converting it into a useful format for further processing or analysis. It is the first step of the ExtractTransform-Load pipeline (ETL). • As a data scientist, you might need to combine data that is available in multiple file formats such as JSON, XML, CSV, SQL, and many more. • We will use python libraries such as pandas, json, and requests or API calls to read data from different sources and load them into a Jupyter notebook as a pandas dataframe for further analyses. Data Understanding Use Case: Buying a House in Winnipeg – Inspections and Defects • When buyers put in an offer, there is the option to insert a home inspection clause. • Typically, the buyer is responsible for bearing the cost of a home inspection, unless other arrangements are made with the seller. You can expect to pay anywhere from $500 to $800 for a home inspection, depending on the size, location and age of the home. • Should you get a home inspection? An objective mind always answers “yes” but under certain market conditions, some homebuyers may be tempted to bypass this critical step in the purchasing process. • Can we find a data-driven answer to this question? Data Sources https://data.Winnipeg.ca Data Sources Data Sources Data Sources Data Sources Data Export Option 1 - CSV Data Export Option 1 - CSV Data Export Option 2 - API . Data Sources guests Data Sources Getting App Token Data Sources Data Sources <your email> Getting App Token Data Sources Data Sources Data Sources Data Sources PIP Install New Package Import Data to Pandas DataFrame “ < your api-token > ” Inspect Data • wpg_inspect_defects_df.head() • wpg_inspect_defects_df.tail() • wpg_inspect_defects_df.info() ? ? Inspect Data • wpg_inspect_defects_df.info() Convert Necessary Columns to Numeric Data Sources • 2015 - 2024 Drop Unnecessary Column(s) • wpg_inspect_defects_df.drop('location', axis=1, inplace=True) Pivot Data by Inspection Type wpg_inspect_defects_df_pv.shape (164, 6) (525, 6) wpg_inspect_defects_df.shape Data Sources Data Sources Reset Index Feature Engineering • Total Number of Inspections • Total Number of Defects • Defect Ratio • Drop Unnecessary Features Visual Data Inspection and Observations Data Sources Data Sources Data Sources Data Sources Data Sources 164x3 Data Exploration • What sort of hypotheses have you formed about the data? • Which attributes seem promising for further analysis? • Have your explorations revealed new characteristics about the data? • How have these explorations changed your initial hypothesis? • Can you identify particular subsets of data for later use? • Take another look at your data mining goals. Has this exploration altered the goals? Question? Run and explore the notebook Next topic: Collecting data from social media