BIRLA INSTITUTE OF TECHNOLOGY, MESRA Department of Computer Science and Engineering DMCT LAB(IT 427) REPORT FILE Submitted By: Name: Sameer Damkondwar Roll No.: BTECH/10239/20 Branch: Computer Science & Engineering Session: MO-2023 Assignment 1 1. Perform following operations using Python: a) Updating an existing String b) Using built-in String Methods to Manipulate Strings e.g len(),find(),decode(),isalpha(), isdigit() ,count() etc. 2. Perform the following on ‘Lists’ using Python a) Updating, Slicing, Indexing b) Using following built-in Methods to manipulate lists: append(), extend(), insert(), pop(), remove(),reverse(), sort(). 3. Perform the following on ‘Tuples’ using Python a) Updating, Indexing, deleting, slicing. b) Using following built-in Methods to manipulate Tuples: max(), min(), len(), tuple(). 4. Perform the following on ‘Dictionary’ using Python a) Updating, deleting b) Using following built-in Methods to manipulate Dictionary: update(), values(),get(), clear(), copy(), type(), len(). 5. Create Separate Sets for Indian Cricket Players playing in T20, ODI and Test Match for current West Indies tour. Also perform the union(), intersection(), difference() operations on the above sets. ‘ 6. Find the Largest of the 4 Strings using Conditional Statements(if-elifelse) using Python. 7. Write a function to generate even numbers between 1 to 30. Create a list squaring these numbers and display the list as well. Create another list by filtering the squared list using ‘anonymous’ function to get those numbers which are even numbers( Hint: Use filter() method and ‘lambda’ keyword ) Assignment 2 1. Create a random List of 100 numbers between 0 to100 importing ‘random’ module . Write a function to return Mean, Mode, Median, Standard deviation and the Variance of the above List. Also find Quartiles(Q1 , Q3 ) and percentile using ‘numpy’ function . Also calculate the IQR(Inter Quartile Range)[Hint; use ‘numpy’ function quartile() for Quartiles and percentile] 2. Use ‘panda’ library to create a Data frame for storing a .csv file(provided about IPL Cricket data ) and perform the following: a) Create the Lists to display the Rows and Column Names of this Data Frame(DF). b) Display it in Transposed form and perform slicing and indexing of this DF. c) Use shape() and info() functions for the complete information about this DF. d) Use value_counts() to count the number of Players TEAM-wise and display the percentage of players TEAM-wise . e) Use crosstab() function of ‘Pandas’ to display the a. TEAM-wise COUNTRY-wise count of Players. b. COUNTRY-wise, PLAYING ROLE-wise count of Players. f) Display the Top 10 Players in their sorted order of SOLD PRICE. g) Create and Display a DFs storing COUNTRY-wise average of SOLD PRICE, COUNTRY-wise Maximum SOLD PRICE, COUNTRY-wise Minimum SOLD PRICE (use groupby) h) Create and Display a DFs storing PLAYING ROLE-wise Average of SOLD PRICE, PLAYING ROLE -wise Maximum SOLD PRICE ,PLAYING ROLE -wise Minimum SOLD PRICE (use groupby) i) Create a Data frame that has PLAYING ROLE-wise Average SOLD PRICE and Maximum SOLD PRICE ( Hint: Create Separate DFs One for PLAYING ROLE and Maximum SOLD price, and the other DF for PLAYING ROLE and average SOLD PRICE. Then merge them using merge() on PLAYING ROLE attribute. j) Display the names of the players who have lifted more than 70 Sixers. Assignment 3 GRAPH PLOTTING k. Draw a bar chart COUNTRY attribute with the SOLD PRICE [ HINT: use barplot() of seaborn library. The Data Frame should be passed in the parameter data=DF] l. Draw a bar chart for COUNTRY attribute with the attribute SIXERS. m. Draw a histogram for SOLD PRICE and understand its distribution. n. Draw a Density Plot for SOLD PRICE o. Draw a boxplot for the SOLD PRICE and also identify the outliers. p. Draw a comparing density plot for the SOLD PRICE of players having CAPTAINCY EXPERIENCE and having no Captaincy experience. q. Compare the distribution of SOLD PRICE according to PLAYING ROLES using boxplots in a single graph. r. Draw a scatter plot no of SIXERS Vs SOLD PRICE of players. s. Create a List of features as follows : ['SR-B', 'AVE', 'SIXERS', 'SOLD PRICE'] and draw a pairwise scatter plot for each of the above features.[Hint: use pairplot() Seaborn library] t. Draw also a Correlation Matrix for the above List of features(of Q. s) ). (WRITE YOUR INTERPRETATIONS IN 1-2 LINES FOR EACH OF THE ABOVE PLOTTED GRAPHS ) Assignment 4 1. Read the ‘Glucose’ attribute of the Diabetes data(Diabetes_zero.csv) and Apply the following Binning techniques for smoothing the data.[Bin Size =5]: a) Binning by Mean. b) Binning by Median. c) Binning by bin boundaries. 2. Perform the same binning by Equal Width method on same data for Mean, Median and Bin Boundaries.[on above mentioned Data] 3. Perform PCA on the Diabetes data(Diabetes_zero.csv) to reduce the dimensionality to 4. Also draw a hit map to find the corelation mong these PCs. Assignment 5 Q1. Load the sns dataset. • • • • import pandas as pd import numpy as np import seaborn as sns sns.get_dataset_names() Q2. Store the data in tips variable ips_df = sns.load_dataset(“tips” ) Q3. Drop the tip attribute Q4. Print the head after the dropping the tips. Q5. drop the categorical columns from the dataset. Q6. create a dataframe that contains only categorical columns. Q7. converts categorical columns into one-hot encoded columns using the pd.get_dummies() method. Q8. join the numerical columns with the one-hot encoded columns(use concat() function). Q9. Divide the dataset into training and testing ( with test_size=20%) Q10. use the StandardScaler() function from the sklearn.preprocessing module to standardize the training and testing data. Q11. Apply linearRegression. Q12. Find the Mean Absolute Error, Mean Squared Error, Root Mean Squared Error and R2-score. Assignment 6 Q1. Load the kc_housedata in pandas. Q2. Print head Q3. Display the data types of each column using the attribute dtypes Q4. Drop the columns "id" and "Unnamed: 0" from axis 1 using the method drop(), then use the method describe() to obtain a statistical summary of the data. Q5. use the method value_counts to count the number of houses with unique floor values, use the method .to_frame() to convert it to a dataframe. Q6. use the function boxplot in the seaborn library to produce a plot that can be used to determine whether houses with a waterfront view or without a waterfront view have more price outliers. Q7. Use the function regplot in the seaborn library to determine if the feature sqft_above is negatively or positively correlated with price. Q8. Fit a linear regression model to predict the price using the feature 'sqft_living' then calculate the R^2. Q9. Find intercept and coefficients and slope. Q10. Calculate the y_predicted values. Q11. Fit a linear regression model to predict the 'price' using the list of features: "floors" "waterfront" "lat" "bedrooms" "sqft_basement" "view" "bathrooms" "sqft_living15" "sqft_above" "grade" "sqft_living" The calculate the R^2. Q12. Plot y_pred Vs y_true(line plot). Assignment 7 Q1. Upload the order_item_dataset.csv file and show 30000 data. Q2. Order_item_id column is the counter that would count no. of items in an order. For Apriori we need Quantity column,that will calculate the quantity of a product purchased in a given order. append this column towards the end. Q3.Append Quantity column. Since, there's 1 product on every row we can easily use Quanti ty as 1 which will be added subsequently. Q4.find data shape Data Preprocessing Things to do: • • • Total no. of unique orders and products Whether to drop products that are purchased within a given threshold Average no. of products purchased in a given order Q6. Total no. of unique orders and products Q7.describe the dta with include=’all’ Q8.find product_id value counts. Q9.show that only the products that were purchased for a min. of 10 times. Q10.Find Average products purchased per transaction Things to do: • • • • Create a basket of all products, orders and quantity Perform One hot encoding so as to convert all quantities into format suitable for apriori algorithm Build list of frequent itemsets Generate rules based on your requirements Q11. Create a basket of all products, orders and quantity with groupby on orderid, product i d and quantity sum, reset index with filna(0) and set index order id. Q12. Convert 0.0 to 0, convert the units to One hot encoded values Q13. Build frequent itemsets min_support = 0.0001 and length. Q14. Create Rules with metric = 'support', min_threshold = 0.0001 Q15 find rules with Products having 50% confidence likely to be purchased together Assignment 8 Q1. Upload the Social_Network_Ads.csv. Q2. Splitting the dataset into the Training set and Test set Q3. Do the normalization on train and test data. Q4. Training the Decision Tree Classification model on the Training set Using de cision tree. Q5. calculate the Confusion Matrix. Q6. Plot the decision tree. K-means: Q1. Load the Live.csv Q2. Check for missing values in dataset Q3. Drop redundant columns Q4. View the statistical summary of numerical variables Q5. Drop status_id and status_published variable from the dataset Q6. Declare feature vector and target variable y = df['status_type'] Q7. Convert categorical variable into integers Q8.Do the normalization. Q8. Apply K-Means model with two clusters Q9. Find kmeans centers. Q10. Find the accuracy. Q11. Use elbow method to find clusters. Q12. Check your results with k=3 and k=4.