CODING USED FOR FRAYM DATA CASE 1.) Comparing Lat/Long of customer locations with census data locations to extract third variable a. Case: % Access to Electricity >>%Load energy table >> energy = readtable('Energy_Sources_LATLONG.csv'); >> %Change cells to numeric >> energyMat = table2array(energy); >> >>%Remove duplicates from energyMat >> [uvals, ~, uidx] = unique(energyMat, 'stable'); energyMat2 = energyMat; %mostly to copy the class and size for K = 1 : length(uvals) mask = uidx == K; energyMat2(mask) = uvals(K) + (0 : nnz(mask) - 1) * 0.01; end >> >> E = scatteredInterpolant(energyMat2(:,1), energyMat2(:,2), (1:size(energyMat2,1)).', 'nearest'); >> custEnerNear = E( custProfMat(:,1:2) ) >>energyNear = energyMat2(custEnerNear, :) MACHINE LEARNING STEPS >>C = readtable('customer_profiles_F.csv'); >>%Create categorical array for HighRepayor variable >>%(>=70% = High, >=30% and <70% = Med, <30% = Low) >> C.HighRepayors = categorical(C.HighRepayors) %Remove unnecessary variables >> C(:,1:4)=[] %Split into training and test datasets >> cvpt = cvpartition(C.HighRepayors,'Holdout',0.3); %Create logical variable which contains a value of ‘true’ for data used to ‘train’ the classifier and ‘false’ for test data >> trainingIdx = training(cvpt); >> testIdx = test(cvpt); %Create table containing all data to train >> trainingData = C(trainingIdx,:); >> testData = C(testIdx,:); %Fit a Model Using k-NN Clustering Technique to training data >> knnMdl = fitcknn(trainingData,'HighRepayors'); %Use model to predict which group the test data belongs to >> predictedGroups = predict(knnMdl,testData) %%Evaluate the classification %Calculate Training and Test Error >> trainErr = resubLoss(knnMdl) trainErr = 0 >> testErr = loss(knnMdl,testData) testErr = 0.1992 %see how the data is misclassified, you can use a confusion matrix >> [cm,grp] = confusionmat(testData.HighRepayors,predictedGroups) cm = 2 3 1 2 48 1 2 3 3 grp = High Low Med % Calculate rate at which each category was misclassified: % High as Low >> misClass = cm(grp=='High',grp=='Low'); >> falseNeg = 100*misClass/height(testData); >> disp(['Percentage of False Negatives: ',num2str(falseNeg),'%']) Percentage of False Negatives: 4.6154% % Low as High >> misClass = cm(grp=='Low',grp=='High'); >> falseNeg = 100*misClass/height(testData); >> disp(['Percentage of False Negatives: ',num2str(falseNeg),'%']) Percentage of False Negatives: 3.0769% %High as Med >> misClass = cm(grp=='High',grp=='Med'); >> falseNeg = 100*misClass/height(testData); >> disp(['Percentage of False Negatives: ',num2str(falseNeg),'%']) Percentage of False Negatives: 1.5385% %Med as High >> falseNeg = 100*misClass/height(testData); >> disp(['Percentage of False Negatives: ',num2str(falseNeg),'%']) Percentage of False Negatives: 3.0769% %Cross validation techniques %K-fold cross validation (leave-one-out method) >> knnMdl2 = fitcknn(C,'HighRepayors','Leaveout','on'); >> mdl2Loss = kfoldLoss(knnMdl2) mdl2Loss = 0.1689 %%Principal Component Analysis (Feature Transformation) >> [~,scrs,~,~,pexp] = pca(C{:,1:end-1}); >> pareto(pexp) %visualize the contributions of each variable to the principal components as an image. >> imagesc(abs(pcs(:,1:5))) >>xlabel('Principal Component') >>colorbar %%Decide on predictor importance to an accurate model %Some classifiers, such as decision trees, have their own built-in methods of feature selection. %Check the methods associated with the decision tree model. One of the methods,predictorImportance, can be used to identify the predictor variables that are important for creating an accurate model. >> treeMdl = fitctree(trainingData,'HighRepayors'); >> methods(treeMdl) Methods for class ClassificationTree: compact margin resubMargin compareHoldout predict resubPredict crossval predictorImportance surrogateAssociation cvloss prune view edge resubEdge loss resubLoss >> p = predictorImportance(treeMdl); >>bar(p) >> trainTreeErr = resubLoss(treeMdl) trainTreeErr = 0.1039 >> testTreeErr = loss(treeMdl,testData) testTreeErr = 0.1989