DMCT Lab Report: Python, Data Analysis, and Visualization

Department of Computer Science and Engineering
Submitted By:
Name: Sameer Damkondwar
Roll No.: BTECH/10239/20
Branch: Computer Science & Engineering
Session: MO-2023
Assignment 1
1. Perform following operations using Python:
a) Updating an existing String
b) Using built-in String Methods to Manipulate Strings e.g
len(),find(),decode(),isalpha(), isdigit() ,count() etc.
2. Perform the following on ‘Lists’ using Python
a) Updating, Slicing, Indexing
b) Using following built-in Methods to manipulate lists: append(),
extend(), insert(), pop(), remove(),reverse(), sort().
3. Perform the following on ‘Tuples’ using Python
a) Updating, Indexing, deleting, slicing.
b) Using following built-in Methods to manipulate Tuples: max(), min(),
len(), tuple().
4. Perform the following on ‘Dictionary’ using Python
a) Updating, deleting
b) Using following built-in Methods to manipulate Dictionary: update(),
values(),get(), clear(), copy(), type(), len().
5. Create Separate Sets for Indian Cricket Players playing in T20, ODI and
Test Match for current West Indies tour. Also perform the union(),
intersection(), difference() operations on the above sets. ‘
6. Find the Largest of the 4 Strings using Conditional Statements(if-elifelse) using Python.
7. Write a function to generate even numbers between 1 to 30. Create a
list squaring these numbers and display the list as well. Create another
list by filtering the squared list using ‘anonymous’ function to get those
numbers which are even numbers( Hint: Use filter() method and
‘lambda’ keyword )
Assignment 2
1. Create a random List of 100 numbers between 0 to100 importing
‘random’ module . Write a function to return Mean, Mode, Median,
Standard deviation and the Variance of the above List. Also find
Quartiles(Q1 , Q3 ) and percentile using ‘numpy’ function . Also calculate
the IQR(Inter Quartile Range)[Hint; use ‘numpy’ function quartile() for
Quartiles and percentile]
2. Use ‘panda’ library to create a Data frame for storing a .csv file(provided
about IPL Cricket data ) and perform the following:
a) Create the Lists to display the Rows and Column Names of this Data
b) Display it in Transposed form and perform slicing and indexing of this
c) Use shape() and info() functions for the complete information about
this DF.
d) Use value_counts() to count the number of Players TEAM-wise and
display the percentage of players TEAM-wise .
e) Use crosstab() function of ‘Pandas’ to display the
a. TEAM-wise COUNTRY-wise count of Players.
b. COUNTRY-wise, PLAYING ROLE-wise count of Players.
f) Display the Top 10 Players in their sorted order of SOLD PRICE.
g) Create and Display a DFs storing COUNTRY-wise average of SOLD PRICE,
COUNTRY-wise Maximum SOLD PRICE, COUNTRY-wise Minimum SOLD
PRICE (use groupby)
h) Create and Display a DFs storing PLAYING ROLE-wise Average of SOLD
Minimum SOLD PRICE (use groupby)
i) Create a Data frame that has PLAYING ROLE-wise Average SOLD PRICE
and Maximum SOLD PRICE ( Hint: Create Separate DFs One for
PLAYING ROLE and Maximum SOLD price, and the other DF for PLAYING
ROLE and average SOLD PRICE. Then merge them using merge() on
PLAYING ROLE attribute.
j) Display the names of the players who have lifted more than 70 Sixers.
Assignment 3
Draw a bar chart COUNTRY attribute with the SOLD PRICE [
HINT: use barplot() of seaborn library. The Data Frame should be passed
in the parameter data=DF]
Draw a bar chart for COUNTRY attribute with the attribute SIXERS.
Draw a histogram for SOLD PRICE and understand its distribution.
Draw a Density Plot for SOLD PRICE
Draw a boxplot for the SOLD PRICE and also identify the outliers.
Draw a comparing density plot for the SOLD PRICE of players having
CAPTAINCY EXPERIENCE and having no Captaincy experience.
Compare the distribution of SOLD PRICE according to PLAYING
ROLES using boxplots in a single graph.
Draw a scatter plot no of SIXERS Vs SOLD PRICE of players.
Create a List of features as follows : ['SR-B', 'AVE', 'SIXERS', 'SOLD
PRICE'] and draw a pairwise scatter plot for each of the above
features.[Hint: use pairplot() Seaborn library]
Draw also a Correlation Matrix for the above List of features(of Q.
s) ).
Assignment 4
1. Read the ‘Glucose’ attribute of the Diabetes data(Diabetes_zero.csv) and
Apply the following Binning techniques for smoothing the data.[Bin Size
a) Binning by Mean.
b) Binning by Median.
c) Binning by bin boundaries.
2. Perform the same binning by Equal Width method on same data for
Mean, Median and Bin Boundaries.[on above mentioned Data]
3. Perform PCA on the Diabetes data(Diabetes_zero.csv) to reduce the
dimensionality to 4. Also draw a hit map to find the corelation mong
these PCs.
Assignment 5
Q1. Load the sns dataset.
import pandas as pd
import numpy as np
import seaborn as sns
Q2. Store the data in tips variable
ips_df = sns.load_dataset(“tips” )
Q3. Drop the tip attribute
Q4. Print the head after the dropping the tips.
Q5. drop the categorical columns from the dataset.
Q6. create a dataframe that contains only categorical columns.
Q7. converts categorical columns into one-hot encoded columns using the
pd.get_dummies() method.
Q8. join the numerical columns with the one-hot encoded columns(use
concat() function).
Q9. Divide the dataset into training and testing ( with test_size=20%)
Q10. use the StandardScaler() function from the sklearn.preprocessing module
to standardize the training and testing data.
Q11. Apply linearRegression.
Q12. Find the Mean Absolute Error, Mean Squared Error, Root Mean Squared
Error and R2-score.
Assignment 6
Q1. Load the kc_housedata in pandas.
Q2. Print head
Q3. Display the data types of each column using the attribute dtypes
Q4. Drop the columns "id" and "Unnamed: 0" from axis 1 using the method
drop(), then use the method describe() to obtain a statistical summary of the
Q5. use the method value_counts to count the number of houses with unique
floor values, use the method .to_frame() to convert it to a dataframe.
Q6. use the function boxplot in the seaborn library to produce a plot that can
be used to determine whether houses with a waterfront view or without a
waterfront view have more price outliers.
Q7. Use the function regplot in the seaborn library to determine if the feature
sqft_above is negatively or positively correlated with price.
Q8. Fit a linear regression model to predict the price using the feature
'sqft_living' then calculate the R^2.
Q9. Find intercept and coefficients and slope.
Q10. Calculate the y_predicted values.
Q11. Fit a linear regression model to predict the 'price' using the list of
"floors" "waterfront" "lat" "bedrooms" "sqft_basement" "view" "bathrooms"
"sqft_living15" "sqft_above" "grade" "sqft_living" The calculate the R^2.
Q12. Plot y_pred Vs y_true(line plot).
Assignment 7
Q1. Upload the order_item_dataset.csv file and show 30000 data.
Q2. Order_item_id column is the counter that would count no. of items in an order. For
Apriori we need Quantity column,that will calculate the quantity of a product purchased in a
given order. append this column towards the end.
Q3.Append Quantity column. Since, there's 1 product on every row we can easily use Quanti
ty as 1 which will be added subsequently.
Q4.find data shape
Data Preprocessing
Things to do:
Total no. of unique orders and products
Whether to drop products that are purchased within a given threshold
Average no. of products purchased in a given order
Q6. Total no. of unique orders and products
Q7.describe the dta with include=’all’
Q8.find product_id value counts.
Q9.show that only the products that were purchased for a min. of 10 times.
Q10.Find Average products purchased per transaction
Things to do:
Create a basket of all products, orders and quantity
Perform One hot encoding so as to convert all quantities into format suitable for
apriori algorithm
Build list of frequent itemsets
Generate rules based on your requirements
Q11. Create a basket of all products, orders and quantity with groupby on orderid, product i
d and quantity sum, reset index with filna(0) and set index order id.
Q12. Convert 0.0 to 0, convert the units to One hot encoded values
Q13. Build frequent itemsets min_support = 0.0001 and length.
Q14. Create Rules with metric = 'support', min_threshold = 0.0001
Q15 find rules with Products having 50% confidence likely to be purchased together
Assignment 8
Q1. Upload the Social_Network_Ads.csv.
Q2. Splitting the dataset into the Training set and Test set
Q3. Do the normalization on train and test data.
Q4. Training the Decision Tree Classification model on the Training set Using de
cision tree.
Q5. calculate the Confusion Matrix.
Q6. Plot the decision tree.
Q1. Load the Live.csv
Q2. Check for missing values in dataset
Q3. Drop redundant columns
Q4. View the statistical summary of numerical variables
Q5. Drop status_id and status_published variable from the dataset
Q6. Declare feature vector and target variable
y = df['status_type']
Q7. Convert categorical variable into integers
Q8.Do the normalization.
Q8. Apply K-Means model with two clusters
Q9. Find kmeans centers.
Q10. Find the accuracy.
Q11. Use elbow method to find clusters.
Q12. Check your results with k=3 and k=4.