Uploaded by Rcj MusaRaza

KNN

advertisement
Q1
The function takes a loaded dataset as input and returns the dataset split into two subsets.
1 ...
2 # split into train test sets
3 train, test = train_test_split(dataset, ...)
“Ideally, you can split your original dataset into input (X) and output (y) columns, then call the
function passing both arrays and have them split correctly into train and test subsets.”
1 ...
2 # split into train test sets
3 X_train, X_test, y_train, y_test = train_test_split(X, y, ...)
“The size of the split can be specified via the “test_size” fight that takes a number of rows
(integer) or a percentage (float) of the size of the dataset between 0 and 1.The latter is the most
common, with values used such as 0.33 where 33 percent of the dataset will be allocated to the
test set and 67 percent will be allocated to the training set.”
1 ...
2 # split into train test sets
3 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
We can demonstrate this using a synthetic classification dataset with 1,000 examples.
1 # split a dataset into train and test sets
2 from sklearn.datasets import make_blobs
3 from sklearn.model_selection import train_test_split
4 # create dataset
5 X, y = make_blobs(n_samples=1000)
6 # split into train test sets
7 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
8 print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
Running the example splits the dataset into train and test sets, then prints the size of the new
dataset.
Q2
“First, I’ll explore the data to gain a preliminary understanding of which variables might be
important. I’ll begin by taking a look at the correlate to see which variables correlate most with
our target variable D”
setwd('C:/Users/danny/Downloads')
bank.df <- read.csv('Bankruptcy.csv', header = TRUE)
# begin with a corr plot
cols <c('D','R1','R2','R3','R4','R5','R6','R7','R8','R9','R10','R11','R12','R13','R
14','R15','R16','R17','R18','R19','R20','R21','R22','R23','R24')
suppressWarnings(library(ggcorrplot))
ggcorrplot(cor(bank.df[,cols]), ggtheme = theme_classic, outline.color =
'white',
colors = c("#6D9EC1", "white", "#E46726"), lab = TRUE)
Q3
Q4
“In KNN, finding the value of k is not easy. A small value of k means that noise will have a
higher influence on the result and a large value make it computationally expensive. 2. Another
simple draw near to select k is set k = sqrt(n).”
Q5
Q6
Q7
“A confusion matrix in R is a table that will categorize the predictions against the actual values.
It includes two dimensions; among them one will indicate the predicted values and another one
will stand for the actual values.”
Q8
Download