Data Analysis: Outliers, Pandas Joining, & Statistics

MATH ● import math ● math.ceil(7.4) ● math.floor(7.4) ● math.pi BOX PLOT PERCENTAGE VS PERCENTILE ● Percentage means out of 100. For e.g, if you have scored 80%(80 percent) in an exam, then it means out of every 100, you have an 80 ● Percentile means if the entire range of students is divided into 100 groups, which group do you belong to. ● If your result says you have got 90 percent, then it means 90% of the total marks. So 90 out of 100. ● But if the result says 90 percentile, then that doesn't mean your score is 90. It means your score is better than 90% of the applicants. You may have scored 70, but 90% of the applicants are below that score. OUTLIERS ● Outliers are values that are very large or very small with respect to the distribution of the other data. Outliers are erroneous values. We can only find outliers in numerical data. Box plots are one good way to find the outliers in a dataset. Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data. FINDING AND REMOVING OUTLIERS IN DATA ● Load the german_credit_data.csv dataset . ● The dataset contains 1,000 entries with 20 categorical/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes credit from a bank. Each person is classified as a good or bad credit risk according to the set of attributes. ● The link to the german_credit_data.csv dataset can be found here: https://raw.githubusercontent.com/datawizardsai/Data-Science/ master/german_credit_data.csv FINDING AND REMOVING OUTLIERS IN DATA ● Import pandas as pd ● dataset = 'https://raw.githubusercontent.com/datawizardsai/Data-Science/master/ger man_credit_data.csv' ● #reading the data into the dataframe into the object data ● df = pd.read_csv(dataset) FINDING AND REMOVING OUTLIERS IN DATA ● This dataset contains an Age column. ● Plot a boxplot of the Age column. ● sbn.boxplot(df['Age']) ● The preceding code generates the following output: A box plot of the Age column ● We can see that some data points are outliers in the boxplot. FINDING AND REMOVING OUTLIERS IN DATA FINDING AND REMOVING OUTLIERS IN DATA ● The boxplot uses the IQR method to display the data and the outliers (the shape of the data). ● But in order to print an outlier, we use a mathematical formula to retrieve it. ● Find the outliers of the Age column using the IQR method: ● Q1 = df["Age"].quantile(0.25) ● Q3 = df["Age"].quantile(0.75) ● IQR = Q3 - Q1 ● print(IQR) ● >>> 15.0 ● In the preceding code, Q1 is the first quartile and Q3 is the third quartile FINDING AND REMOVING OUTLIERS IN DATA ● Find the upper fence and lower fence by adding the following code, and print all the data above the upper fence and below the lower fence. ● Lower_Fence = Q1 - (1.5 * IQR) ● Upper_Fence = Q3 + (1.5 * IQR) ● print(Lower_Fence) ● print(Upper_Fence) ● >>> 4.5 ● >>> 64.5 FINDING AND REMOVING OUTLIERS IN DATA ● Print all the data above the upper fence and below the lower fence. ● df[((df["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))] ● The preceding code generates the following output: Outlier data based on the Age column ● We negate the preceding result using the ~ operator ● df = df[~((df ["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))] ● df FINDING AND REMOVING OUTLIERS IN DATA ● Dataset: https://raw.githubusercontent.com/datawizardsai/Data-Science/ma ster/Mall_Customers.csv ● Detect and remove outliers from the columns Annual Income and Spending Score JOINING IN PANDAS ● To make it simple, let's assume we have data in CSV format in different places, all talking about the same scenario. Say we have some data about an employee in a database. We can't expect all the data about the employee to reside in the same table. It's possible that the employee's personal data will be located in one table, the employee's project history will be in a second table, the employee's time-in and time-out details will be in another table, and so on. So, if we want to do some analysis about the employee, we need to get all the employee data in one common place. ● This process of bringing data together in one place is called data integration. ● To do data integration, we can merge multiple pandas DataFrames using the merge function. JOINING IN PANDAS ● Merge the details of students from two datasets, namely student.csv and marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The marks.csv dataset contains columns such as Mark and City. The Student_id column is common between the two datasets. ● The student.csv dataset can be found at this location: https://raw.githubusercontent.com/datawizardsai/Data-Science/master/student.c sv ● The marks.csv dataset can be found at this location: https://raw.githubusercontent.com/datawizardsai/Data-Science/master/mark.csv JOINING IN PANDAS ● import pandas as pd ● dataset1 = "https://raw.githubusercontent.com/datawizardsai/Data-Science/master/student.csv" ● dataset2 = "https://raw.githubusercontent.com/datawizardsai/Data-Science/master/mark.csv" ● df1 = pd.read_csv(dataset1) ● df2 = pd.read_csv(dataset2) JOINING IN PANDAS ● df = pd.merge(df1, df2, on = 'Student_id') ● df.head(10) ● df = pd.merge(df1, df2, on= 'Student_id', how='inner') WHAT IS DATA? • Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects. • Data are facts and statistics collected together for reference or analysis. • Data is a collection of facts, such as numbers, words, measurements, observations or just descriptions of things. MAKING SENSE OF DATA ● Different disciplines store different kinds of data for different purposes. For example, medical researchers store patients' data, universities store students' and teachers' data, and real estate industries storehouse and building datasets. ● These datasets are stored in hospitals and are presented for analysis. Most of this data is stored in some sort of database management system in tables/schema. An example of a table for storing patient information is shown. ● A dataset contains many observations about a particular object. ● For instance, a dataset about patients in a hospital can contain many observations. A patient can be described by a patient identifier (ID), name, address, weight, date of birth, address, email, and gender. ● Each of these features that describes a patient is a variable. Each observation can have a specific value for each of these variables. TYPES OF DATA THAT EXIST IN REAL WORLD ● Numerical ● Categorical INDEPENDENT AND DEPENDENT VARIABLES ● There are independent variables and dependent variables. In machine learning Independent variables are used to predict the Dependent variable. ● An independent variable is the variable you think is the cause, while a dependent variable is the effect INDEPENDENT AND DEPENDENT VARIABLES Civic BMW 1800 3000 Honda X5 10000 30000 VARIABLE TYPES

Data Analysis: Outliers, Pandas Joining, & Statistics

Related documents

Products

Support

Data Analysis: Outliers, Pandas Joining, & Statistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib