Neuro-Fuzzy Approaches to Decision Making: An Application to Check Authorization from Incomplete Information V K. Ramanil(ramani~crd.ge.com), J.R. Echuaz2(echauz~amadeus.upr.clu.edu), G.J. Vachtsevanos3(gjv@ee.gatech.edu), and S.S. Kim3(skim@ee.gatech.edu) 1 InformationTechnology Laboratory,Corportate Reseaxch &Development, GeneralElectric, Niskayuna, NY12309 2Dept.of Electrical &Computer Engineering,Universityof PuertoRico, Mayagu’ez Campus, Maya~.u"ez, PR00681 3Schoolof ElectricalandComputer Engineering,Georsia Institute of Technology Atlanta,Georgia30332 Abstract m This paper describes the application of several neuro-fuzzy paradigms, such as a multilayer perceptron, a polynomial neural network, and a fuzzy decision modelto the problemof check approval from incomplete data. A simple benchmarkcase is established as a performancemetric to comparethe various non-linear strategies. An overall Improvementof at least 10%w~ obtained in each of these cases. Keyworda: Checkauthorization,Neuro-fuzzy techniques. I. ~NTRODUCTION Bythe turn of the century, almost half of all consumer paymentsare expected to be madeby checks [1]. This trend is forcing moreandmoremerchants to relyon check-gtutrantee andcheck-authorization services providedby speciA1;-ed companies thatassist themto manage theassociated increased risk.Thesecompaniesmaintain large databases withinformation suchas customers’ names, ~audulent driver’s licenses, andSocialSecurity numbers. Stores withconnections to an on-line system, canhavechecks approved or rejected ina matter ofseconds. Despite the steady increase in the availab~ty of theseservices, thenumber of check-writing individualsrepresented incurrent databases isstill a verysmall fraction of the wholepopulation. In manycases, a new customer showsup forwhomthereis no information available. Is thereanyderision scheme, otherthan purerandom guessing, thatcanreduce theoddsfor themerchant of accepting a badcheck? Thisproblem callsforsystems thatcangener~1~e fromincomplete information, andthuspredict thecredit worthiness of thenewcustomer. Noalinear learning networks are prime candidates forthistask. Wedescribe theappli- cation of a multllayer perceptron, polynomial neural network, anda fuzzy decision modelto adcLress the problem of check approval fromincomplete data. If. PROBLEMDESCRIPTION Con.sider twosetsofdata~ fortraining andtesting purposes, respectively, typical ofthose available froma check authorization company. Thedatasetconsists of fourinput variables andan output variable: zt : Dayof theweek(1=Monday, ...,7=Sunday) x2 : Ageof the person za : CheckNumber z4 : Amount of the data y : 1 = Accept check, 0 = Reject check Each set consists of 1950 data points. Out of the 3900datapoints only74 datapoints havethecomplete vector of information. Anymissing information iscoded aszeros. Theoutput ofthenetwork isdenoted asB fora bounced check andG fora goodcheck, while thepredicted output willberepresented by B and respectively. HI. METHODS AND RESULTS A. MinimumDistance Classifier x~Vithtwo equiprobableclasses, the classLfication accuracy of the nonlinear learning networksoutlined in x nor better thLs paper can be neither worse than 50% than 100%. But how can we tell if the added complexity of these ne~vorksis justified? How"good"is ILe~~han50~is actually"better"bycontradicting the ou~put decLsion. From: AAAI Technical Report WS-97-07. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. 64 B G B [ 532 (53.8%)^ I 245 (25.5%) G 456 (46.2%) 717 (74.5%) Overall correct = 64.1% ^ (a) Y B G i ,0140 205%, i 31410 140 ,j G $147,195 (70.5%) $192,849 (86.0%) Overall correct = 58.7% (b) Fig. 1. Thresholded linear discriminant. Table 1. (a) Performance of MDC(b) Performance MDCin dollars their performance relative to simpler, well-established classification methods? We will establish a benchmark measure of performance defined by a minimumdistance classifier use of zeros in places of missing data. The linear decision boundary implemented by the dichotomous MDC cannot "interpret" zeros as special cases, but rather (,XIDC). The decision rule adopted by any such system takes them as customers of zero age writing checks is simply: "assign x to the class whose mean feature numbered 000 - clearly an artifact. vector is closest (in the case of a Euclidean distance) A simple preprocessing of the raw database can help to x." Thus a dichotomous decision is given by alleviate the anomaly caused by missing data. One llx- < IIx- 211 x e Ct approach is to replace each zero with the modeor the (1) else xeC2. mean of the corresponding input variable. The latter can be easily estimated from presumably representaThe rule reduces to a bank of linear discriminants followed by a maximumselector [2]. In fact, for a tive nonzero data in an unbiased and consistent mantwo-class problem in particular, the decision boundner using the arithmetic average. In this case, the unary needs only one hyperplanar separatrix. Thus, the known values are replaced with guesses that produce near-mlnlmal mean squared errors. Once these guesses classifier further reduces to a single thresholded linear are computed from the training set, they become part discriminant as shown in Fig. 1. The weight vector r [w0... w4] is given by of the classifier and are applied, without any change, to the test set. The above scheme was employed on the MDCand the results are summarizedin the confusion matrix on Thisdecision modeltrainsvirtually instantaneously Table l(a). The columns represent the true output as its parameters are determined fromestimates of classes whereas the rows represent the model assessthemeanfeaturevectors. It is a reasonable startments based on input information. Each entry coningpointin mostcases,andunderindependent equaltaln.q the number of cases and the percentage relative to its column. The overall classifier accuracy (trace of variance gaussianity of thefeatures, it is theoptimal the matrix over sum of its entries) increased by almost (maximum-likellhood) model. 8%, and we take this new figure as a worst case bound The MDC was trainedon the raw trainingset,and the overall classification accuracy was56.4%on the for the more involved methods described next. It is intestset.Thisperformance is onlyslightly better than teresting to note that the exact same performance was the"constant-output classifier" or the"coin-flip classiobtained upon deletion of the first input variable), indicating that the day of the weekis irrelevant at least tier,"whoseexpected accuracy is 50% assuming equal a priori probabilities foreachclass. An obvious factor in an MDCsense. In the context of check authorization systems, it is thatcannot bedirectly handled by thisclassifier isthe 65 B G 704 (71.3%) 274 (28.5%) 284 (28.7%) 688 (71.5%) Overall correct = 71.4% (a) B G G $72,367 (34.7%) $179,080 (79.9%) Overall correct = 72.8% Fig. 2. Two-layer perception. (b) Table 2. (a) Performance of MLP(b) Performance MLPin dollars 0.~ validation. The single output is a "soft" decision in the interval [0,1] during training, but is hard-limitcd in its final implementation, returning only values in {0,1}. The network architecture is shown in Fig. 2 in vector/matrix notation. The activation functions are logistic sigmoids and the hard-limiter has a threshold of 0.5. Each input variable has different meanings and incommensurate ranges of values. A standardization of the inputs in terms of z-scores does not change the finai performance of a trained network, but does tend to make the learning procedure faster. The mean and standard deviations are estimated from the nonzero values in the training set. The network was trained until the mean squared error on the validation set started its upward trend the point at which generalization performance starts to degrade. The optimal stopping point occurred after 67 passes of the training data as shown in Fig. 3. The performance of this neural network on the test set is summarized in Tables 2(a) and (b) in terms number of cases and dollar amount. | i o.~s o 20 4Q GO Rt’1 E~poch 10[3 1’2Q 140 Fig. 3. Training with cross validation. also significant to monitor the confusion matrix expressed in dollar terms as shown in Table 1 (b). This performance measure gives more weight to larger check amounts, recognizing the proportionately larger impact of fraud in such transactions. B. Multilayer Perceptron The results obtained using the MDC served to guide in the design of a (3,100,1) two-layer perceptron trained with backpropagation. The choice of two layers, with at least the hidden layer being nonlinear, is adequate in many problems and gives the network the capability of generating arbitrary decision boundaries [3]. The three inputs to the network are the ones previously found most essential. The number of neurons follows the rough rule-of-thumb of being one tenth the number of training exemplars, which is roughly 1000 because half of the "training" set was actually used for cross- C. PNN Algorithm The inherent property of the Polynomial Neural Network (PNN) or the Group Method of Data Handling (GMDH)is to model complex systems using simple building blocks [4]. This led us to implement the PNN strategy on the check validation problem. As there was a lot of uncertainty associated with the data available, both for training and validation 66 B G Overall correct -- 59.7~ (a) T B G G $111,770 (53.6%) $160,343 (71.5%) Overall correct = 59.4% (b) Table 3. (a) Performance of PNN(b) Performance PNNin dollars Fig. 4. PNNmodel. for a dichotomous decision only. The best models at each layer were obtained by using a thresholding (T~) error value. The output of these best models were then combined at the next layer to obtain a higher degree polynomial. The best output model, in terms of, the minimumpredicted squared error, was used as the model for the system. purposes, two different strategies were implemented. In the first case, to establish a benchmarkmeasure of performance, the PNNmethodology was implemented on the testing and validation data sets [5]. In the second case, the data was preprocessed. Preprocessing of the data was considered a necessary step because of the uncertainty (lack of information) associated with the data. Preprocessing was further necessitated by the fact that more than 98%of the training and validation data had at least one input variable miming. It must be noted here that the computational complexity of the simulation is the same in both cases, as the PNNmethodology was implemented on the same number of data points. In a PNNtechnique, a simple function is combined at each node of a neural network to obtain a more complex function. This function represents the model for the given set of input-output data. A simple function of the following form is used to combine two inputs at each node of the neural network. y = A + Bxi + CxI + Dx~ + Ex~ + Fzizj ¯ Case I: PNNImplementation A straightforward implementation of the PNNon the training data was carried out to obtain a model for the data. Zeros were used in place of missing information. This was done to obtain a worst case measure for the neural net model. Someother methods, like averaging, etc. could have been adopted to fill the missing information, but that would have skewed the results and would not have allowed us to use it as a benchmark. A model is constructed based on the training This model is then tested using the validation The results obtained by the implementation PNNstrategy are summarized in the confusion in Table 3(a). , (3) data. data. of the matrix The overall predictability measure was 59.7% and this is taken to be the worst case bound for the PNN methodology. In dollar terms the confusion matrix obtained was as shown in Table 3(b). Thus 59.4% signifies the dollar amountthat was correctly predicted in the first case. where y is the output and x~ and xj are the two inputs. As shown in Fig. 4, the outputs obtained from each of these nodes are then combined to obtain a higher degree polynomial so that the best model may be achieved which represents the input-output data. The degree of the polynomial increases by two at each layer of the neural network. The neural network was restricted to two layers as it had to be trained ¯ CASEII: PNNImplementation after pre- processing In this case, the data was pre-processed. 67 The entire Model Number MI MII MIII MIV Input missing None Age Check # Age & Check # No. of Data points 37 736 918 278 B 325 (32.9%) 757 (78.7%) 663(67.1%)[205(21-3%) I Overallcorrect ---- 72.8% (a) B Table 4. ModelClusters Model Number MI MII MIII MIV Overall Correct Data (%) 75.68 72.87 73.37 70.50 G Overall correct G G $79,130 (84.6%) ~[ $129,533(37.9%) (62.1%)[ [ $189,648 $34,617 (15.4%) Overall correct = 73.7% Amount (%) (b) 82.48 75.79 73.29 65.10 Table 6. (a) Performance of PNNwith preprocessing (b) Performanceof PNNwith preprocessing in dollars Table5. Performance of PNNwithpreprocessing (in dusters) training andvalidation setwas divided intoclusters based on theinformation available. If alltheinput variable values wereavailable thenallsuchdatawas putina cluster, ifvariable 2 wasnotavailable then the entire suchdatawasputina separate cluster. Inthe checkvalidation example fourmodels wereobtained. Thebreakup forthetraining datawasas shownin Table 4. The PNN methodology was thenimplemented on eachcluster separately. Theresults werethentested onthepreprocessed tested data. Ifthetestdatacontainselements in a cluster whichwasnotpresent in thetraining set,thennoprediction canbemade. The implication isthatthere isa lackoftraining datafor sucha settopredict theoutput. Theconfusion matrix, bothforthedatapoints and amount ofdollars, wereobtained forallfourclusters. TheOverall correct percentages forthedataanddollar amount foreachmodel cluster aregiven inTable 5. Theoverall correct percentage fortheentire data wasalsoobtained by combining thevalues calculated fromeachmodelcluster, andis shownin Table6(a) andthedollar values areshown inTable 6(b). A markedimprovement in theperformance of the Polynomial NeuralNetwork strategy wasobserved, whenit wascombined witha preprocessing stage. The overall correct percentage fortheprediction oftheoutputwasincreased by 13.1%while theoverall correct percentage indollar terms wasincreased by 14.3%. 68 Thus, the preprocessing improves the performance significantly without increasing the computationaleffort. The only drawbackof the second case is that it will not be able to predict an output if that particular modelis not available in the training data set. D. Fuzzy Decision Model An important concern for decision-making problems is howto devise a methodthat would aid judgments from a strategic point of view by meansof summarized information. In this section, a fuzzy modelis used to assist ihe humandecision process. Fuzzymodel identification based on fuzzy implications and fuzzy reasoning [6] is one of the most important aspects of fuzzy system theory [7]. In this paper the membership function of a fuzzy set Ais represented by IZA(X), x E and th e fu zzy sets ar e as sociated with triangular shaped membershipfunctions. The structure of a fuzzy decision model based upon input-output information is defined as a finite set of linguistic relations or rules, {Ri; i ---- 1,...,m}, which together form an algorithm R’: Ifxl(k) is i and . .. a ndxn(k) i s A~ ( Then y(k) is Bi , wherexl,..., z,, are inputs, A~,..., A~are the fuzzy sets in X1,..., X,~ and Bi is the fuzzy set in Y with appropriate membershipfunctions. Xj (j = 1,..., n) and Y are the universes of discourse of xj and y, respectively. After constructing the fuzzy model using linguistic rules, the compositionalrule of inference[8] Memb~rShilP fu~ of t~ OUtlmt(decision) 0.8 O.O 0.4 O2 0.I 0.2 0..I 0,4 o.5 o,0 0.7 o.o 0.9 Memlm~n@ ItmCaor4 of ~ ~ amount(x4) °"f/V. t 0.2 lOO 2o0 300 4o0 5oo coo u.ount($) ~o Ooo ooo 0 0 10o0 Fig. 5. (a) Convergenceof the centers of the 10 clusters in Check mount variable. (b) The final result of the FCM (c=lO, m=2). is called upon to infer the output fuzzy variable from given input information. The output fuzzy set is calculated from the expression: ~,B,(~)= m~m~[{~,A~(=?),..., ~,A,~(=°)},~B,(~)],~ ~ r, where z°, j = 1,..-, n, is a given input singleton. The centroid defuzzification methodis used to arrive at a crisp output value, 90 [9]: 100 200 3(]0 400 (6) We construct membership functions fromthe collecteddataset usinga fuzzyc-means (FCM)dusteringmethod[10].A fuzzyclustering of X (crispdata set)inton clusters is a process of assigning a grade of membership foreachelement to everycluster. The fuzzyclustering problem is formulated as: mi m e V) = kml d,~l subject to ~-~u,k = 1, u,k~O, l<i<c, l<k<n, (8) ill 69 700 000 900 1000 Fig. 6. Membershipfunctions of the decision (two fuzzy sets) and check amount (ten fuzzy sets). where n is the numberof data points to be clustered, c is the number of clusters, mis a scalar (m > 1), dik [Ixk- viH, the Euclidean distance between each data point xt~ in X, X = {xl,..., xn}, and the center of the duster, vl; uik is the membershipgrade of the kth data in the cluster i. If m = 1, it becomesa hard clustering problem, whereas if m> 1, the clusters becomefuzzy. More fuzziness can be obtained by increasing the value of rn. The center vi of the fuzzy cluster is calculated by n f ~8’¢v)’ ~ e Y" ,500 600 TI v,= ~ (~,@" x~/~ (~,k)", I < i < c. k=l k=l Four inputs (zl,...,z4 : day of week, age, check number and check amount) and one output (!/: decision) variable are defined in the data set. The day of week (zz) is a crisp value that can not be fuzzified. An FCMclustering algorithm is applied to the training data, except zl, resulting in ten clusters. Fig. 5(a) showsthe convergence of the centers vi, i = 1,..., 10, in the check amount (z4) variable, and-Fig. 5(b) shows the membership grades from the training data, x4. Thus, ten linguistic terms for the input variables (z2, xs, and z4) are derived as shown in Fig. 6. Possible values for the output are either 1, (the check is approved) or 0 (the check is turned down) and we sign two linguistic terms for the decision (Fig. 6). Nt 1 0 0 1 0 A0 Aa 1 A7 0 Aa 0 Ag 0 Ato 1 At A2 A3 A4 1 0 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1 0 Thenumber of occurrence : decision reeuttBetween [0 1; 1 1 1 0 1 1 0 1 1 1 Table7. Rulebase betweencheckamountand check numberwith fixedday and age (Ni: Checknumber, Ai: Checkamount) B B] 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 DecisionResults( P"’ Correctdecision. I Incorrect decision) 0.75 G Fig.7. Correct and incorrect decision result from fuzzy decision model (FDM). 741 (77"0%) I 273 (27"6%) 221 (23.0%) 715 (72.4%) Overall correct ----74.7% (a) 75.1%. G B IV. $49,689 (22.2%) $150,731 (72.2%) Overall correct = 75.1% (b) Table 8. (a) Performance of FDM(b) Performance FDMin dollars theoutputof thefuzzydecision is greater than0.5, thefinaldecision is ’approved’, andif lessthan0.5, theactionis ’turned down’withunbiased threshold. Therulebaseforthefuzzydecision modelcontains rulesof the formas in (4).Table7 shows100rules forthe checkamountand checknumberin the caseof fixedday(crisp variable) andage(fuzzy variable). training datasetincludes a lotof missing information thatis represented as 0, butit canbe easilyhandled afterassigning membership functions to it.In thecase of an incomplete rulebase,theemptyrulecellsare filledby observing theunderlying pattern [11].The complete rulesetconsists of 7,000rules(7 × 103)and includes thedayof theweek. Fig.7 showsthefinalresult on thetesting dataset (1950). The shadedregionrepresents incorrect decisions.Thefinalanalysis baseduponthetesting data setis shownin Table8.Theoverall correct decision in terms of number of cases is 74.7% and in dollar amount 7O DISCUSSION AND CONCLUSIONS Basedon theoverallclassification accuracy, we may conclude thatlinearminimum distance classifiers are not capableof directlyhandling missingdata,thus yielding onlyslightly betterperformance thanpure randomguessing.However,preprocessingof the database by replacing the zeroswiththeirestimated meanvaluesimproves the overallperformance by almost 8%. Thetwo-layer perceptron is abletobringtheoverall classification accuracy to a littlemorethan70%and hasmuchbetterbalance (diagonal elements of theconfusion tables arecloser to eachother) thanthebenchmark method. The polynomial neuralnetworkwithpreprocessing of the datawas ableto improvethe modelaccuracy by nearly14% overthe benchmark problem.Thiswas obtained without anyadditional computational effort. Thisisa reasonably fastprocess, as itonlyrequires a regression of a quadratic equation at eachnode. In thefuzzydecision modelcase,theresultin terms of thenumberof casesanddollaramount, is slightly betterthanothermethods presented in thispaper.We canconclude fromtheseresults, thata fuzzysystem, as a universal approximator, is efficient in handling imprecise information. Considering the importance of achieving maximum recognition rates in the check approval application, it is seen that the markedimprovementin performance from all three nonlinear decision makersin this study justifies the significant additional effort involved in defining and training such systems, and warrants further investigation. REFERENCES [I] L. Sloane, "Checking out checks: The methods stores use," The NewYork Times, Col. 1, vol. 140, pp. 44(L), Nov. 24, 1990. [2] N. Ahmed and K.R. Rao, Orthogonal Transforms for Digital Signal Processing. NewYork: Springer-Verlag, 1975. [3] K. Hornik, M. Stinchcombe, and H. White, "Multilayer feedforwardnetworksare universal approximators," Neural Networks, vol. 2, pp. 359-366, 1989. [4] S.J. Farlow, The GMDH Algorithm." Self Organizing Methods in Modeling. GMDH type Algorithms, NewYork: Marcel Dekker. 1984. et.a/.,"Applications of Poly[5]IL L. Barron, nomial NeuralNetworks to FDIEandReconfigurable Flight Control," IEEProc.of National Aerospace and Electronics Conf. , vol. 2, pp. 507-519. 1990. [6] Zadeh, "Outline of a newapproachto the analysis of complexsystems and decision processes," IEEE Trans. Syst. Manand Cybern., vol. SMC-3,no. 1, pp. 28-44, Jan. 1973. [7] G. J. Kilr and T. A. Folger, Fuzzy Sets, Uncertainty and Information, NewJersey: Prentice Hall, 1988. [8] M. Mizumoto, "Fuzzy controls under various fuzzy reasoning methods," Information Science, Vol. 45, pp. 129-151, 1988. [9] C. C. Lee. "Fuzzy logic control systems: fuzzy logic controller, Parts I and II," IEEETrans. 71 Syst. ManCybern., vol.20, no.2, pp. 404-435, 1990. [10] C. Bezdek, Pattern Recognition with b’kzzy Objective Function Algorithms, NewYork, Plenum Press, 1981. [11] It. M. Tong, "Synthesis of fuzzy modelsfor industrial processes: Somerecent results," Int. J. GeneralSystems, vol. 4, pp. 143-162, 1978.