Data Splitting and Exploration in Machine Learning

Q1 The function takes a loaded dataset as input and returns the dataset split into two subsets. 1 ... 2 # split into train test sets 3 train, test = train_test_split(dataset, ...) “Ideally, you can split your original dataset into input (X) and output (y) columns, then call the function passing both arrays and have them split correctly into train and test subsets.” 1 ... 2 # split into train test sets 3 X_train, X_test, y_train, y_test = train_test_split(X, y, ...) “The size of the split can be specified via the “test_size” fight that takes a number of rows (integer) or a percentage (float) of the size of the dataset between 0 and 1.The latter is the most common, with values used such as 0.33 where 33 percent of the dataset will be allocated to the test set and 67 percent will be allocated to the training set.” 1 ... 2 # split into train test sets 3 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) We can demonstrate this using a synthetic classification dataset with 1,000 examples. 1 # split a dataset into train and test sets 2 from sklearn.datasets import make_blobs 3 from sklearn.model_selection import train_test_split 4 # create dataset 5 X, y = make_blobs(n_samples=1000) 6 # split into train test sets 7 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) 8 print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) Running the example splits the dataset into train and test sets, then prints the size of the new dataset. Q2 “First, I’ll explore the data to gain a preliminary understanding of which variables might be important. I’ll begin by taking a look at the correlate to see which variables correlate most with our target variable D” setwd('C:/Users/danny/Downloads') bank.df <- read.csv('Bankruptcy.csv', header = TRUE) # begin with a corr plot cols <c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R 14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24') suppressWarnings(library(ggcorrplot)) ggcorrplot(cor(bank.df[,cols]), ggtheme = theme_classic, outline.color = 'white', colors = c("#6D9EC1", "white", "#E46726"), lab = TRUE) Q3 Q4 “In KNN, finding the value of k is not easy. A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive. 2. Another simple draw near to select k is set k = sqrt(n).” Q5 Q6 Q7 “A confusion matrix in R is a table that will categorize the predictions against the actual values. It includes two dimensions; among them one will indicate the predicted values and another one will stand for the actual values.” Q8

Data Splitting and Exploration in Machine Learning

Related documents

Products

Support

Data Splitting and Exploration in Machine Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib