Uploaded by gurhangnl

Python - Misc Topics

advertisement
MATH
● import math
● math.ceil(7.4)
● math.floor(7.4)
● math.pi
BOX PLOT
PERCENTAGE VS PERCENTILE
● Percentage means out of 100. For e.g, if you have scored 80%(80 percent) in
an exam, then it means out of every 100, you have an 80
● Percentile means if the entire range of students is divided into 100 groups,
which group do you belong to.
● If your result says you have got 90 percent, then it means 90% of the total
marks. So 90 out of 100.
● But if the result says 90 percentile, then that doesn't mean your score is 90. It
means your score is better than 90% of the applicants. You may have scored
70, but 90% of the applicants are below that score.
OUTLIERS
● Outliers are values that are very large or very small with respect to the
distribution of the other data. Outliers are erroneous values. We can only find
outliers in numerical data. Box plots are one good way to find the outliers in a
dataset. Sometimes a dataset can contain extreme values that are outside the
range of what is expected and unlike the other data.
FINDING AND REMOVING OUTLIERS IN DATA
● Load the german_credit_data.csv dataset .
● The dataset contains 1,000 entries with 20 categorical/symbolic attributes
prepared by Prof. Hofmann. In this dataset, each entry represents a person
who takes credit from a bank. Each person is classified as a good or
bad credit risk according to the set of attributes.
● The link to the german_credit_data.csv dataset can be
found here: https://raw.githubusercontent.com/datawizardsai/Data-Science/
master/german_credit_data.csv
FINDING AND REMOVING OUTLIERS IN DATA
● Import pandas as pd
● dataset =
'https://raw.githubusercontent.com/datawizardsai/Data-Science/master/ger
man_credit_data.csv'
● #reading the data into the dataframe into the object data
● df = pd.read_csv(dataset)
FINDING AND REMOVING OUTLIERS IN DATA
● This dataset contains an Age column.
● Plot a boxplot of the Age column.
● sbn.boxplot(df['Age'])
● The preceding code generates the following output: A box plot of the
Age column
● We can see that some data points are outliers in the boxplot.
FINDING AND REMOVING OUTLIERS IN DATA
FINDING AND REMOVING OUTLIERS IN DATA
● The boxplot uses the IQR method to display the data and the outliers (the shape of the data).
● But in order to print an outlier, we use a mathematical formula to retrieve it.
● Find the outliers of the Age column using the IQR method:
● Q1 = df["Age"].quantile(0.25)
● Q3 = df["Age"].quantile(0.75)
● IQR = Q3 - Q1
● print(IQR)
● >>> 15.0
● In the preceding code, Q1 is the first quartile and Q3 is the third quartile
FINDING AND REMOVING OUTLIERS IN DATA
● Find the upper fence and lower fence by adding the following code, and print all
the data above the upper fence and below the lower fence.
● Lower_Fence = Q1 - (1.5 * IQR)
● Upper_Fence = Q3 + (1.5 * IQR)
● print(Lower_Fence)
● print(Upper_Fence)
● >>> 4.5
● >>> 64.5
FINDING AND REMOVING OUTLIERS IN DATA
● Print all the data above the upper fence and below the lower fence.
● df[((df["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))]
● The preceding code generates the following output: Outlier data based on the
Age column
● We negate the preceding result using the ~ operator
● df = df[~((df ["Age"] < Lower_Fence) |(df["Age"] > Upper_Fence))]
● df
FINDING AND REMOVING OUTLIERS IN DATA
● Dataset: https://raw.githubusercontent.com/datawizardsai/Data-Science/ma
ster/Mall_Customers.csv
● Detect and remove outliers from the columns Annual Income and Spending
Score
JOINING IN PANDAS
● To make it simple, let's assume we have data in CSV format in different places, all talking about the
same scenario. Say we have some data about an employee in a database. We can't expect all the
data about the employee to reside in the same table. It's possible that the employee's personal data
will be located in one table, the employee's project history will be in a second table, the employee's
time-in and time-out details will be in another table, and so on. So, if we want to do some analysis
about the employee, we need to get all the employee data in one common place.
● This process of bringing data together in one place is called data integration.
● To do data integration, we can merge multiple pandas DataFrames using the merge function.
JOINING IN PANDAS
● Merge the details of students from two datasets, namely student.csv and marks.csv.
The student dataset contains columns such as Age, Gender, Grade, and Employed.
The marks.csv dataset contains columns such as Mark and City. The Student_id column is
common between the two datasets.
● The student.csv dataset can be found at this
location: https://raw.githubusercontent.com/datawizardsai/Data-Science/master/student.c
sv
● The marks.csv dataset can be found at this
location: https://raw.githubusercontent.com/datawizardsai/Data-Science/master/mark.csv
JOINING IN PANDAS
● import pandas as pd
● dataset1 =
"https://raw.githubusercontent.com/datawizardsai/Data-Science/master/student.csv"
● dataset2 =
"https://raw.githubusercontent.com/datawizardsai/Data-Science/master/mark.csv"
● df1 = pd.read_csv(dataset1)
● df2 = pd.read_csv(dataset2)
JOINING IN PANDAS
● df = pd.merge(df1, df2, on = 'Student_id')
● df.head(10)
● df = pd.merge(df1, df2, on= 'Student_id', how='inner')
WHAT IS DATA?
• Data are characteristics or information, usually numerical, that are collected through
observation. In a more technical sense, data are a set of values of qualitative or
quantitative variables about one or more persons or objects.
• Data are facts and statistics collected together for reference or analysis.
• Data is a collection of facts, such as numbers, words, measurements, observations or
just descriptions of things.
MAKING SENSE OF DATA
● Different disciplines store different kinds of data for different purposes. For example, medical researchers
store patients' data, universities store students' and teachers' data, and real estate industries storehouse
and building datasets.
● These datasets are stored in hospitals and are presented for analysis. Most of this data is stored in some
sort of database management system in tables/schema. An example of a table for storing patient
information is shown.
● A dataset contains many observations about a particular object.
● For instance, a dataset about patients in a hospital can contain many observations. A patient can be
described by a patient identifier (ID), name, address, weight, date of birth, address, email, and gender.
● Each of these features that describes a patient is a variable. Each observation can have a specific value for
each of these variables.
TYPES OF DATA THAT EXIST IN REAL WORLD
● Numerical
● Categorical
INDEPENDENT AND DEPENDENT VARIABLES
● There are independent variables and dependent variables. In machine learning
Independent variables are used to predict the Dependent variable.
● An independent variable is the variable you think is the cause, while
a dependent variable is the effect
INDEPENDENT AND DEPENDENT VARIABLES
Civic
BMW
1800
3000
Honda
X5
10000
30000
VARIABLE TYPES
Download