Análisis de datos con Random Forest

RANDOM FOREST Importing Libraries In [ ]: import import import import numpy as np pandas as pd matplotlib.pyplot as plt seaborn as sns Importing Dataset In [ ]: df=pd.read_csv('drug200.csv') df.head() Age Sex Out[ ]: BP Cholesterol Na_to_K Drug 0 23 F HIGH HIGH 25.355 drugY 1 47 M LOW HIGH 13.093 drugC 2 47 M LOW HIGH 10.114 drugC 3 28 F NORMAL HIGH 7.798 drugX 4 61 F HIGH 18.043 drugY BP Cholesterol Na_to_K LOW In [ ]: x=df.iloc[:,:-1] y=df.iloc[:,-1] In [ ]: x.head() Age Sex Out[ ]: 0 23 F HIGH HIGH 25.355 1 47 M LOW HIGH 13.093 2 47 M LOW HIGH 10.114 3 28 F NORMAL HIGH 7.798 4 61 F HIGH 18.043 LOW In [ ]: y.head() Out[ ]: 0 drugY 1 drugC 2 drugC 3 drugX 4 drugY Name: Drug, dtype: object In [ ]: y.value_counts() Out[ ]: drugY 91 drugX 54 drugA 23 drugC 16 drugB 16 Name: Drug, dtype: int64 In [ ]: print("Shape of x:",x.shape) print("Shape of y:",y.shape) Shape of x: (200, 5) Shape of y: (200,) Data Preprocessing Handeling The Null Values In [ ]: x.isnull().sum() Out[ ]: Age Sex BP Cholesterol Na_to_K dtype: int64 0 0 0 0 0 In [ ]: y.isnull().sum() Out[ ]: 0 In [ ]: from sklearn.preprocessing import LabelEncoder le=LabelEncoder() x.iloc[:,1]=le.fit_transform(x.iloc[:,1]) x.iloc[:,2]=le.fit_transform(x.iloc[:,2]) x.iloc[:,3]=le.fit_transform(x.iloc[:,3]) <ipython-input-14-6bc19a4866e3>:3: DeprecationWarning: In a future version, `df.il oc[:, i] = newvals` will attempt to set the values inplace instead of always setti ng a new array. To retain the old behavior, use either `df[df.columns[i]] = newval s` or, if columns are non-unique, `df.isetitem(i, newvals)` x.iloc[:,1]=le.fit_transform(x.iloc[:,1]) <ipython-input-14-6bc19a4866e3>:4: DeprecationWarning: In a future version, `df.il oc[:, i] = newvals` will attempt to set the values inplace instead of always setti ng a new array. To retain the old behavior, use either `df[df.columns[i]] = newval s` or, if columns are non-unique, `df.isetitem(i, newvals)` x.iloc[:,2]=le.fit_transform(x.iloc[:,2]) <ipython-input-14-6bc19a4866e3>:5: DeprecationWarning: In a future version, `df.il oc[:, i] = newvals` will attempt to set the values inplace instead of always setti ng a new array. To retain the old behavior, use either `df[df.columns[i]] = newval s` or, if columns are non-unique, `df.isetitem(i, newvals)` x.iloc[:,3]=le.fit_transform(x.iloc[:,3]) In [ ]: x.describe() Age Sex BP Cholesterol Na_to_K count 200.000000 200.000000 200.000000 200.000000 200.000000 mean 44.315000 0.520000 0.910000 0.485000 16.084485 std 16.544315 0.500854 0.821752 0.501029 7.223956 min 15.000000 0.000000 0.000000 0.000000 6.269000 25% 31.000000 0.000000 0.000000 0.000000 10.445500 50% 45.000000 1.000000 1.000000 0.000000 13.936500 75% 58.000000 1.000000 2.000000 1.000000 19.380000 max 74.000000 1.000000 2.000000 1.000000 38.247000 Out[ ]: Splitting the dataset into the Training set and Test set In [ ]: from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(x,y, test_size=0.2, random_state=29 print('The Shape of X_train:',x_train.shape) print('The Shape of X_test:',x_test.shape) print('The Shape of y_train:',y_train.shape) print('The Shape of y_test:',y_test.shape) The The The The Shape Shape Shape Shape of of of of X_train: (160, 5) X_test: (40, 5) y_train: (160,) y_test: (40,) Training the Random Forest model on the Training set In [ ]: from sklearn.ensemble import RandomForestClassifier classification = RandomForestClassifier() classification.fit(x_train,y_train) Out[ ]: ▾ RandomForestClassifier RandomForestClassifier() Predicting the Test set results In [ ]: y_pred = classification.predict(x_test) output=pd.DataFrame() output['Actual Value']=y_test output['predicted Value']=y_pred output.head() Actual Value predicted Value 145 drugX drugX 197 drugX drugX 75 drugY drugY 119 drugY drugY 141 drugY drugY Out[ ]: Confusion Matrix And Accuracy In [ ]: from sklearn.metrics import accuracy_score,confusion_matrix,ConfusionMatrixDisplay import matplotlib.pyplot as plt print("The accuracy of the above classification model is:") print(accuracy_score(y_test,y_pred)*100," %") print("\nConfusion Matrix\n") matrix = ConfusionMatrixDisplay(confusion_matrix(y_test,y_pred),display_labels=clas matrix.plot() plt.show() The accuracy of the above classification model is: 97.5 % Confusion Matrix In [ ]: from sklearn.tree import export_graphviz from IPython.display import Image import graphviz for i in range(3): tree = classification.estimators_[i] dot_data = export_graphviz(tree, feature_names=x_train.columns, filled=True, max_depth=1, impurity=False, proportion=True) graph = graphviz.Source(dot_data) display(graph) Na_to_K <= 14.615 samples = 100.0% value = [0.131, 0.062, 0.094, 0.262, 0.45] True False Age <= 51.5 samples = 56.0% value = [0.239, 0.114, 0.17, 0.477, 0.0] (...) samples = 44.0% value = [0.0, 0.0, 0.0, 0.0, 1.0] (...) Age <= 31.5 samples = 100.0% value = [0.094, 0.075, 0.088, 0.338, 0.406] True BP <= 0.5 samples = 25.7% value = [0.256, 0.0, 0.116, 0.233, 0.395] (...) False Na_to_K <= 14.829 samples = 74.3% value = [0.034, 0.103, 0.077, 0.376, 0.41] (...) (...) (...) Na_to_K <= 14.829 samples = 100.0% value = [0.1, 0.088, 0.088, 0.3, 0.425] True BP <= 0.5 samples = 55.7% value = [0.174, 0.152, 0.152, 0.522, 0.0] (...) (...) False samples = 44.3% value = [0.0, 0.0, 0.0, 0.0, 1.0]

Análisis de datos con Random Forest

Related documents

Products

Support

Análisis de datos con Random Forest

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib