Uploaded by jo man

PythonForData

advertisement
In python the main data structures are lists, tuples, sets, and dictionaires.
Lists
Lists are ordered collection of items, the order in which elements are added are represented by indicies, similar to arrays in c++. Except Lists are mutable, meaning we can add, remove or modify its elements by index. lists are defined using square braces
my_list=[1,2,3,4]
to access a certain element we type the name of the list followed by its index
my_list[0], my_list[1], my_list[2], etc.
Common operations on lists:
Accessing elements: my_list[index]
Modifying elements: my_list[index] = new_value
Adding elements: my_list.append(item) or my_list.insert(index, item)
Removing elements: my_list.remove(item) or del my_list[index]
Tuples
Tuples are similiar to lists except they are not mutable once defined and are defined with round braces
my_tuple=(1,2,3,4)
The way to accesss the elements of a tuple is the same as for lists
my_tuple[index]
Sets
Sets are unordered collection of objects. Since it is unordered indicies do not apply. All elements of a set are unique. Thus we cannot define a set with duplicate elements nor can we add a duplicate. sets are defined using curly braces
my_set={1,2,3,4}
Dictionaires
Dicionaires are similar to sets in that dictionaries are also unordered. The difference is that the elements of a dictionary can be accessed using key values. my_dict={"name":"John", "age":30, "city":"New york"}.
Common operations on dictionaries:
Accessing values: my_dict[key] //in our case my_dict["name"] returns "John"
Modifying values: my_dict[key] = new_value
Adding new key-value pairs: my_dict[new_key] = new_value
Removing key-value pairs: del my_dict[key]
Modules
Modules are essentially header files. However the keyword here is import instead of #include. import pandas as pd
we could have just typed import pandas but giving it the alias pd allows for less typing when using the module e.g. pd.fucntion() instead of pandas.function(). The following are common data visualization and analysis modules
numpy(np) - provides mathematical functions and operations on arrays for scientific computing
scipy.stats(st) - provides statistical functions
pandas(pd) - provides data frames(data visualization) and statistical functions
matplotlib.pyplot(ply) - data visualization (plots data)
scikit-learn(sks) - machine learning and data visualization
seaborn(sns) - data visualization
Using modules to plot data
plt gives us functions to plot data. Usually used in conjunction with pandas and numpy. Here is an example of using these modules to read, plot and show data. # imports the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
# loads the unemployment dataset, creates "DataFrame"
unemployment = pd.read_csv('http://data-analytics.zybooks.com/unemployment.csv')
# title
plt.title('U.S. unemployment rate', fontsize = 20)
# x and y axis labels
plt.xlabel('Year')
plt.ylabel('% of total labor force')
# plot data values read from data set
plt.plot(unemployment["Year"], unemployment["Value"])
# saves the image
plt.savefig("unemployment.png")
# shows the image
plt.show()
Data Frames
A data frame is a two dimensional tabular data structure provided by pandas. The main components of a data frame are the index, columns, and values. The index is the set of row labels. The columns are the column labels, and the values are the data contained in these rows and columns. These values can be of any data type. In order to use data with Data Frames, we must import a file using pandas. # imports the pandas library
import pandas as pd
# loads a file containing comma-separated values and assigns
# the data frame to variable DataFrame
data_frame1 = pd.read_csv('file.csv')
# loads text file where the values are separated by a space with no column labels
data_frame2= pd.read_csv('file.txt', sep = ' ', header = None)
# loads an excel file
data_frame3 = pd.read_excel('file.xlsx', sheetname='Sheet1')
The main attributes of a data frame are:
axes - index and column labels
columns - column labels
dtypes - data types of values in each column
index - index labels
shape - ordered pair that gives the number of rows and columns
size - total values in the data frame values - values in the data frame
The main methods for a data frame are:
describe() - returns summary statistics for numerical columns
head(), tail() - first/last 5 rows in the DataFrame
min(), max() - returns min/max values of a numerical column
mean(), median() - Mean/Median of values in a numerical column
sample() - Random row
std() - returns standard deviation of values in a numerical column
We will now build a data frame using the data set containing various information about the passengers of the titanic. This data set is included with the seaborn library. Therefore, we import seaborn and store the data set in the variable "titanic". and then use titanic to manipulate the data and return desired information.
import seabron as sns
#loads the titanic data set, creates "DataFrame"
titanic=sns.load_dataset("titanic")
We may find out the number of rows and columns in the data set, we use titanic.shape which returns the ordered pair (891, 15), 891 rows and 15 columns. To find the column names, we use the method titanic.columns which will output Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
'alive', 'alone'],
dtype='object')
using the method titanic.dtypes returns the data types used in each column.
Subsetting data
We may retrieve parts of a data frame. Commonly, we do this by selecting a column label or selecting a range of rows and columns. To select a column or columns we use the syntax
titanic[["sex"],["age"]]
This returns all values for these attributes
-If we want the first five rows of these selected columns
titanic[["sex"],["age"]].head()
-or a custom number for rows where a is the starting index and b is the index to end on. titanic[["sex"],["age"]][a:b]
-we may return the rows where one column has a particular value
titanic[titanic.age==26]
-we may use comparison operators to return certain rows
titanic[titanic.age<=26]
Reshaping data
When data is returned in "long form" each data value is presented as a separate row, and the different variables associated with each data values are presented as columns
StudentID Subject Score
0 1 Math 85
1 1 Science 78
2 2 Math 92
3 2 Science 89
"wide form" structure presents each data value in a single row while each variables are presented in columns
StudentID Math_Score Science_Score
0 1 85 78
1 2 92 89
To reshape the data from long to wide or vice versa we use the pivot() function here is an example #create data frame
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
'two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
foo bar baz zoo
0 one A 1 x
1 one B 2 y
2 one C 3 z
3 two A 4 q
4 two B 5 w
5 two C 6 t
#pivot the data frame
df.pivot(index='foo', columns='bar', values='baz')
bar A B C
foo
one 1 2 3
two 4 5 6
#reshape it again adding more values
df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
baz zoo
bar A B C A B C
foo
one 1 2 3 x y z
two 4 5 6 q w t
Scatter plots
We may graph scatter plots using seaborn. Here is an example
# loads the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# creates the data points from (numpy)
x = np.array([0, 5, 3, 4, 7, 8, 10])
y = np.array([5, 2, 5, 15, 27, 15, 31])
# plot
sns.regplot(x, y, ci=None)
# saves the image
plt.savefig("scatterplot.png")
# shows the image
plt.show()
We can also use regplot with data frames. # loads the necessary modules
import matplotlib.pyplot as plt
import seaborn as sns
# loads the iris data set
df = sns.load_dataset("iris")
# title
plt.title('Petal length and width of iris', fontsize=20)
# plot
sns.regplot(df["petal_length"], df["petal_width"], ci=None, fit_reg=False);
# saves the image
plt.savefig("irisscatterplot.png")
# shows the image
plt.show()
The first parameter in regplot is the name for the x axis, the next parameter is the y axis, ci is confidence interval, and setting fit_reg to True will show a trend line in the plot. Strip/swarm Plots
A strip plot is a scatter plot where a categorical variable represents an axis and an oridnal variable represents another. Different points appear stacked on one another to form a "strip" or single line. strip plots are useful for plotting categorical data. However they dont show repeated points. Swarm plots do include these repeated points. These plots give a better sense of the data set. An example of strip and swarm plots using the titanic data set. # loads the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# loads the titanic dataset
titanic = sns.load_dataset("titanic")
# title
plt.title('Fares paid by passengers of the Titanic by deck', fontsize=20)
# plot
sns.stripplot(x="deck", y="fare", data=titanic);
# saves the image
plt.savefig("titanicstripplot.png")
# shows the image
plt.show()
The parameters for the strip plot are the categories and the third parameter is the data set we wish to plot. If we wished to create a swarm plot, we switch stipplot for sns.swarmplot
# loads the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# loads the titanic dataset
titanic = sns.load_dataset("titanic")
# title
plt.title('Fares paid by passengers of the Titanic by deck', fontsize=20)
# plot
sns.swarmplot(x="deck", y="fare", hue = "sex", data=titanic);
# saves the image
plt.savefig("titanicswarmplot.png")
# shows the image
plt.show()
In this case, the swarm plot will show the relationship between the "deck" and "fare" variables, and the data points will be colored differently based on the "sex" of the individuals in the Titanic dataset.
Line Charts
The main benefit of a line graph is to quickly convey whether values are increasing, decreasing, or remaining constant between data points. A linear trend line is a straight line that depicts the general direction data changes from the first to last data point, often added to summarize the entire chart or the general trend of the dataset values.
We may have multiple lines in a line graph that show multiple data sets for a certain category, e.g. the average high temperatures for different regions of the world. We may load data and represent it as a line graph:
# imports the necessary modules
import pandas as pd
import matplotlib.pyplot as plt
# loads the unemployment dataset
unemployment = pd.read_csv('http://data-analytics.zybooks.com/unemployment.csv')
# title
plt.title('U.S. unemployment rate', fontsize = 20)
# x and y axis labels
plt.xlabel('Year')
plt.ylabel('% of total labor force')
# plot
plt.plot(unemployment["Year"], unemployment["Value"])
# saves the image
plt.savefig("linechart.png")
# shows the image
plt.show()
Percentiles
The nth percent of a dataset is the data value for which n percent of the data falls at or below that value. Its a way to describe the distribution of the data. If a value falls within the 95th percentile, it means that the value is higher than 95% of the other values in the dataset. In other words, only 5% of the values in the dataset are greater than this particular value. Conversely, if a value falls within the 5th percentile then it it is only greater than 5% percent of all the values in the data set and 95% of the values in the set are greater than it. Percentiles indicate the relative standing of a value within the dataset.
The first quartile(Q1) of a data set is the value where 25% of the data falls at or below. It is the median of the lower half of the data The third quartile (Q3) of a data set is the value where 75% of the data falls at or below. It is the media of the upper half of the data. The median of a data set is the value where 50% of the data falls at or below. To calculate these values, first we must calculate the 50th percentile which is the median of the data set. Then we again calculate the median for the values below the 50th percentile which will give us the value for the first quartile. Like wise calculating the median for the values above the 50th percentile will yield the third quartile. The values Q1, Q2, median, minimum and maximum are a set of descriptive statistics called the five number summary. We may return these values using the function describe().
import pandas as pd
scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')
# Prints the summary for each exam
print(scores.describe())
# Prints the summary for Exam1 only
print(scores[['Exam1']].describe())
Box plots
A box plot is a visualization of the five number summary. The "box", in the plot represents the middle 50% of the data, with Q1 as the lower boundary of the box and Q3 as the upper boundary, with the median represented as a line inside the box. The two lines extend away from the lower and upper bound to the minimum and maximum values. Each line represents 25% of the data. We can create a such a plot using the boxplot function from seaborn.
# loads the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# loads the ExamScores dataset
scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv')
# transforms the data
df = pd.melt(scores, value_name = "Score", var_name = "Exam")
# plot
sns.boxplot(x="Exam", y="Score", data=df);
# saves the image
plt.savefig("Examsboxplots.png")
# shows the image
plt.show()
Outliers in data
We can determine which outliers in data by calculating how far each data value is from Q1 and Q3. The interquartile range is the difference between Q3 and Q1 or the length of the box in a box plot. A data value greater than Q3+1.5*(IQR) or less than Q1-1.5(IQR) is considered an outlier. A dataset should first be imported as a pandas data frame using read_csv(). Then we determine the structure of the data using the head() function to display the first few lines of data. If we can the five number summary of a specific attribute, we use dataframeName.attributeName.describe()
For example we have data of different cars and its attributes
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# reads the mtcars.csv file into a dataframe called mtcars
mtcars = pd.read_csv("https://data-analytics.zybooks.com/mtcars.csv")
# prints the first few lines of mtcars
print(mtcars.head())
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#five number summary for the data values of an attribute
mtcars.wt.describ()
count 32.000000
mean 3.217250
std 0.978457
min 1.513000
25% 2.581250
50% 3.325000
75% 3.610000
max 5.424000
Name: wt, dtype: float64
#command to display the box for attribute "wt"
sns.boxplot(mtcars.wt, width=0.35)
Histograms
A frequency distribution is a summary of the frequencies (or counts) of individual values or ranges of values in a dataset. It shows how often each different value or range of values occurs in a dataset. Frequency distributions are commonly represented by histograms. They depict data by displaying the frequencies (or counts) of values within specific intervals, or "bins."Each bin represents a specific range of values along the horizontal axis of the histogram. The veritcal axis represents the frequency of these specific values. They provide a better view of data density. Higher bars represents a higher density of data in that range. They are also convinient for understanding data distribution.
In general, the size of the interval or "bins" should be close to the square root of the number of total values (square root rule).
Histograms can be plotted using hist() function from the matplotlib function, where the first parameter takes the data set and the second parameter takes an integer value for the number of bins
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.read_csv('http://data-analytics.zybooks.com/height.csv')
fig, ax = plt.subplots()
plt.hist(heights['Height'], bins=6)
ax.set_xlabel('Height')
ax.set_ylabel('Frequency')
plt.savefig('histogram6.png')
plt.show()
Download