In python the main data structures are lists, tuples, sets, and dictionaires. Lists Lists are ordered collection of items, the order in which elements are added are represented by indicies, similar to arrays in c++. Except Lists are mutable, meaning we can add, remove or modify its elements by index. lists are defined using square braces my_list=[1,2,3,4] to access a certain element we type the name of the list followed by its index my_list[0], my_list[1], my_list[2], etc. Common operations on lists: Accessing elements: my_list[index] Modifying elements: my_list[index] = new_value Adding elements: my_list.append(item) or my_list.insert(index, item) Removing elements: my_list.remove(item) or del my_list[index] Tuples Tuples are similiar to lists except they are not mutable once defined and are defined with round braces my_tuple=(1,2,3,4) The way to accesss the elements of a tuple is the same as for lists my_tuple[index] Sets Sets are unordered collection of objects. Since it is unordered indicies do not apply. All elements of a set are unique. Thus we cannot define a set with duplicate elements nor can we add a duplicate. sets are defined using curly braces my_set={1,2,3,4} Dictionaires Dicionaires are similar to sets in that dictionaries are also unordered. The difference is that the elements of a dictionary can be accessed using key values. my_dict={"name":"John", "age":30, "city":"New york"}. Common operations on dictionaries: Accessing values: my_dict[key] //in our case my_dict["name"] returns "John" Modifying values: my_dict[key] = new_value Adding new key-value pairs: my_dict[new_key] = new_value Removing key-value pairs: del my_dict[key] Modules Modules are essentially header files. However the keyword here is import instead of #include. import pandas as pd we could have just typed import pandas but giving it the alias pd allows for less typing when using the module e.g. pd.fucntion() instead of pandas.function(). The following are common data visualization and analysis modules numpy(np) - provides mathematical functions and operations on arrays for scientific computing scipy.stats(st) - provides statistical functions pandas(pd) - provides data frames(data visualization) and statistical functions matplotlib.pyplot(ply) - data visualization (plots data) scikit-learn(sks) - machine learning and data visualization seaborn(sns) - data visualization Using modules to plot data plt gives us functions to plot data. Usually used in conjunction with pandas and numpy. Here is an example of using these modules to read, plot and show data. # imports the necessary libraries import pandas as pd import matplotlib.pyplot as plt # loads the unemployment dataset, creates "DataFrame" unemployment = pd.read_csv('http://data-analytics.zybooks.com/unemployment.csv') # title plt.title('U.S. unemployment rate', fontsize = 20) # x and y axis labels plt.xlabel('Year') plt.ylabel('% of total labor force') # plot data values read from data set plt.plot(unemployment["Year"], unemployment["Value"]) # saves the image plt.savefig("unemployment.png") # shows the image plt.show() Data Frames A data frame is a two dimensional tabular data structure provided by pandas. The main components of a data frame are the index, columns, and values. The index is the set of row labels. The columns are the column labels, and the values are the data contained in these rows and columns. These values can be of any data type. In order to use data with Data Frames, we must import a file using pandas. # imports the pandas library import pandas as pd # loads a file containing comma-separated values and assigns # the data frame to variable DataFrame data_frame1 = pd.read_csv('file.csv') # loads text file where the values are separated by a space with no column labels data_frame2= pd.read_csv('file.txt', sep = ' ', header = None) # loads an excel file data_frame3 = pd.read_excel('file.xlsx', sheetname='Sheet1') The main attributes of a data frame are: axes - index and column labels columns - column labels dtypes - data types of values in each column index - index labels shape - ordered pair that gives the number of rows and columns size - total values in the data frame values - values in the data frame The main methods for a data frame are: describe() - returns summary statistics for numerical columns head(), tail() - first/last 5 rows in the DataFrame min(), max() - returns min/max values of a numerical column mean(), median() - Mean/Median of values in a numerical column sample() - Random row std() - returns standard deviation of values in a numerical column We will now build a data frame using the data set containing various information about the passengers of the titanic. This data set is included with the seaborn library. Therefore, we import seaborn and store the data set in the variable "titanic". and then use titanic to manipulate the data and return desired information. import seabron as sns #loads the titanic data set, creates "DataFrame" titanic=sns.load_dataset("titanic") We may find out the number of rows and columns in the data set, we use titanic.shape which returns the ordered pair (891, 15), 891 rows and 15 columns. To find the column names, we use the method titanic.columns which will output Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object') using the method titanic.dtypes returns the data types used in each column. Subsetting data We may retrieve parts of a data frame. Commonly, we do this by selecting a column label or selecting a range of rows and columns. To select a column or columns we use the syntax titanic[["sex"],["age"]] This returns all values for these attributes -If we want the first five rows of these selected columns titanic[["sex"],["age"]].head() -or a custom number for rows where a is the starting index and b is the index to end on. titanic[["sex"],["age"]][a:b] -we may return the rows where one column has a particular value titanic[titanic.age==26] -we may use comparison operators to return certain rows titanic[titanic.age<=26] Reshaping data When data is returned in "long form" each data value is presented as a separate row, and the different variables associated with each data values are presented as columns StudentID Subject Score 0 1 Math 85 1 1 Science 78 2 2 Math 92 3 2 Science 89 "wide form" structure presents each data value in a single row while each variables are presented in columns StudentID Math_Score Science_Score 0 1 85 78 1 2 92 89 To reshape the data from long to wide or vice versa we use the pivot() function here is an example #create data frame df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'baz': [1, 2, 3, 4, 5, 6], 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}) foo bar baz zoo 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t #pivot the data frame df.pivot(index='foo', columns='bar', values='baz') bar A B C foo one 1 2 3 two 4 5 6 #reshape it again adding more values df.pivot(index='foo', columns='bar', values=['baz', 'zoo']) baz zoo bar A B C A B C foo one 1 2 3 x y z two 4 5 6 q w t Scatter plots We may graph scatter plots using seaborn. Here is an example # loads the necessary libraries import matplotlib.pyplot as plt import seaborn as sns import numpy as np # creates the data points from (numpy) x = np.array([0, 5, 3, 4, 7, 8, 10]) y = np.array([5, 2, 5, 15, 27, 15, 31]) # plot sns.regplot(x, y, ci=None) # saves the image plt.savefig("scatterplot.png") # shows the image plt.show() We can also use regplot with data frames. # loads the necessary modules import matplotlib.pyplot as plt import seaborn as sns # loads the iris data set df = sns.load_dataset("iris") # title plt.title('Petal length and width of iris', fontsize=20) # plot sns.regplot(df["petal_length"], df["petal_width"], ci=None, fit_reg=False); # saves the image plt.savefig("irisscatterplot.png") # shows the image plt.show() The first parameter in regplot is the name for the x axis, the next parameter is the y axis, ci is confidence interval, and setting fit_reg to True will show a trend line in the plot. Strip/swarm Plots A strip plot is a scatter plot where a categorical variable represents an axis and an oridnal variable represents another. Different points appear stacked on one another to form a "strip" or single line. strip plots are useful for plotting categorical data. However they dont show repeated points. Swarm plots do include these repeated points. These plots give a better sense of the data set. An example of strip and swarm plots using the titanic data set. # loads the necessary libraries import matplotlib.pyplot as plt import seaborn as sns import numpy as np # loads the titanic dataset titanic = sns.load_dataset("titanic") # title plt.title('Fares paid by passengers of the Titanic by deck', fontsize=20) # plot sns.stripplot(x="deck", y="fare", data=titanic); # saves the image plt.savefig("titanicstripplot.png") # shows the image plt.show() The parameters for the strip plot are the categories and the third parameter is the data set we wish to plot. If we wished to create a swarm plot, we switch stipplot for sns.swarmplot # loads the necessary libraries import matplotlib.pyplot as plt import seaborn as sns import numpy as np # loads the titanic dataset titanic = sns.load_dataset("titanic") # title plt.title('Fares paid by passengers of the Titanic by deck', fontsize=20) # plot sns.swarmplot(x="deck", y="fare", hue = "sex", data=titanic); # saves the image plt.savefig("titanicswarmplot.png") # shows the image plt.show() In this case, the swarm plot will show the relationship between the "deck" and "fare" variables, and the data points will be colored differently based on the "sex" of the individuals in the Titanic dataset. Line Charts The main benefit of a line graph is to quickly convey whether values are increasing, decreasing, or remaining constant between data points. A linear trend line is a straight line that depicts the general direction data changes from the first to last data point, often added to summarize the entire chart or the general trend of the dataset values. We may have multiple lines in a line graph that show multiple data sets for a certain category, e.g. the average high temperatures for different regions of the world. We may load data and represent it as a line graph: # imports the necessary modules import pandas as pd import matplotlib.pyplot as plt # loads the unemployment dataset unemployment = pd.read_csv('http://data-analytics.zybooks.com/unemployment.csv') # title plt.title('U.S. unemployment rate', fontsize = 20) # x and y axis labels plt.xlabel('Year') plt.ylabel('% of total labor force') # plot plt.plot(unemployment["Year"], unemployment["Value"]) # saves the image plt.savefig("linechart.png") # shows the image plt.show() Percentiles The nth percent of a dataset is the data value for which n percent of the data falls at or below that value. Its a way to describe the distribution of the data. If a value falls within the 95th percentile, it means that the value is higher than 95% of the other values in the dataset. In other words, only 5% of the values in the dataset are greater than this particular value. Conversely, if a value falls within the 5th percentile then it it is only greater than 5% percent of all the values in the data set and 95% of the values in the set are greater than it. Percentiles indicate the relative standing of a value within the dataset. The first quartile(Q1) of a data set is the value where 25% of the data falls at or below. It is the median of the lower half of the data The third quartile (Q3) of a data set is the value where 75% of the data falls at or below. It is the media of the upper half of the data. The median of a data set is the value where 50% of the data falls at or below. To calculate these values, first we must calculate the 50th percentile which is the median of the data set. Then we again calculate the median for the values below the 50th percentile which will give us the value for the first quartile. Like wise calculating the median for the values above the 50th percentile will yield the third quartile. The values Q1, Q2, median, minimum and maximum are a set of descriptive statistics called the five number summary. We may return these values using the function describe(). import pandas as pd scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv') # Prints the summary for each exam print(scores.describe()) # Prints the summary for Exam1 only print(scores[['Exam1']].describe()) Box plots A box plot is a visualization of the five number summary. The "box", in the plot represents the middle 50% of the data, with Q1 as the lower boundary of the box and Q3 as the upper boundary, with the median represented as a line inside the box. The two lines extend away from the lower and upper bound to the minimum and maximum values. Each line represents 25% of the data. We can create a such a plot using the boxplot function from seaborn. # loads the necessary libraries import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # loads the ExamScores dataset scores = pd.read_csv('http://data-analytics.zybooks.com/ExamScores.csv') # transforms the data df = pd.melt(scores, value_name = "Score", var_name = "Exam") # plot sns.boxplot(x="Exam", y="Score", data=df); # saves the image plt.savefig("Examsboxplots.png") # shows the image plt.show() Outliers in data We can determine which outliers in data by calculating how far each data value is from Q1 and Q3. The interquartile range is the difference between Q3 and Q1 or the length of the box in a box plot. A data value greater than Q3+1.5*(IQR) or less than Q1-1.5(IQR) is considered an outlier. A dataset should first be imported as a pandas data frame using read_csv(). Then we determine the structure of the data using the head() function to display the first few lines of data. If we can the five number summary of a specific attribute, we use dataframeName.attributeName.describe() For example we have data of different cars and its attributes import seaborn as sns import pandas as pd import matplotlib.pyplot as plt # reads the mtcars.csv file into a dataframe called mtcars mtcars = pd.read_csv("https://data-analytics.zybooks.com/mtcars.csv") # prints the first few lines of mtcars print(mtcars.head()) Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb 0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #five number summary for the data values of an attribute mtcars.wt.describ() count 32.000000 mean 3.217250 std 0.978457 min 1.513000 25% 2.581250 50% 3.325000 75% 3.610000 max 5.424000 Name: wt, dtype: float64 #command to display the box for attribute "wt" sns.boxplot(mtcars.wt, width=0.35) Histograms A frequency distribution is a summary of the frequencies (or counts) of individual values or ranges of values in a dataset. It shows how often each different value or range of values occurs in a dataset. Frequency distributions are commonly represented by histograms. They depict data by displaying the frequencies (or counts) of values within specific intervals, or "bins."Each bin represents a specific range of values along the horizontal axis of the histogram. The veritcal axis represents the frequency of these specific values. They provide a better view of data density. Higher bars represents a higher density of data in that range. They are also convinient for understanding data distribution. In general, the size of the interval or "bins" should be close to the square root of the number of total values (square root rule). Histograms can be plotted using hist() function from the matplotlib function, where the first parameter takes the data set and the second parameter takes an integer value for the number of bins import pandas as pd import matplotlib.pyplot as plt heights = pd.read_csv('http://data-analytics.zybooks.com/height.csv') fig, ax = plt.subplots() plt.hist(heights['Height'], bins=6) ax.set_xlabel('Height') ax.set_ylabel('Frequency') plt.savefig('histogram6.png') plt.show()