Consumers’ Preferences Modeling With Multiclass Fuzzy Support Vector Machines Chih-Chieh Yang Department of Multimedia and Entertainment Science, Southern Taiwan University No. 1, Nantai Street, Yongkang City, Tainan County, Taiwan 71005 Meng-Dar Shieh Department of Industrial Design, National Cheng Kung University, Tainan, Taiwan 70101 Abstract Consumers’ preferences toward product design are often affected by a large number of form features. It is very important for product designers to understand the relationship between consumers’ preferences and product form features. In this paper, an approach based on multiclass fuzzy support vector machines (multiclass fuzzy SVM) is proposed to construct the prediction model of consumers’ preferences. Product samples are collected and their form features are systematically examined. Each product sample is assigned a class label and a fuzzy membership agreeing this label to formulate a multiclass classification problem. A one-versus-one multiclass fuzzy SVM model is constructed using collected product samples. Optimal training parameter set of model is determined by a two-step cross-validation. A case study of mobile phone design is also given to demonstrate the effectiveness of proposed methodology. Two standard kernel functions including polynomial kernel and Gaussian kernel are used and compared their performance. The experiment results show that the performance of Gaussian kernel model is better than polynomial model. The Gaussian model performed very well and is capable to prevent the overfitting problem. Keywords: Consumers’ preferences; Multiclass fuzzy support vector machines; Mobile phone design 1. Introduction The appearance of product is one of the most important factors affecting consumers’ purchase decision. Traditionally, the quality of product form design heavily depends on designers’ intuitions and not proves to gain success in the 1 marketplace. In order to understand the consumers’ preferences and develop appealing product in a more effective manner, many researches have been done to study product form design in systematic approaches. The most noticeable research is Kansei Engineering proposed by (Jindo, Hirasago et al. 1995). The main issue is how to deal with the inter-attribute correlations between attributes and take care of the nonlinear property of attributes (Shimizu and Jindo 1995; Park and Han 2004). The mostly adapted techniques in product design field such as multiple regression analysis (Park and Han 2004) or multivariate analysis (Shimizu and Jindo 1995) heavily dependent on the assumption of impendence and linearity hence can not deal with the nonlinearity of the relationship effectively. In addition, prior to establish mathematical model, data simplification and variable screening is often needed to obtain better results (Hsu, Chuang et al. 2000). Fuzzy regression analysis (Shimizu and Jindo 1995) and other method suffer the same shortcomings (Park and Han 2004). (Vapnik 1995) developed a new kind of algorithm called support vector machine (SVM). SVM has been shown to provide higher performance than traditional learning techniques (Burges 1998). SVM’s remarkable and robust performance with respect to sparse and noisy data makes them first choice in a number of applications such as pattern recognition (Burges 1998) and bioinformatics (Scholkopf, Guyon et al. 2001). SVM is known as its elegance in solving the nonlinear problem with the technique of “kernels” that automatically do a nonlinear mapping to a feature space. In a consequence, the nonlinear relationship between product form features can be processed effectively by introducing suitable kernel function. This study proposed an approach based on multiclass fuzzy SVM for consumers’ preferences modeling. This approach begins with processing product form with discrete and continuous attributes and can also deal with sparse feature vectors. Each product sample is assigned a class label and a fuzzy membership used to describe the semantic differential score that agrees this label. A one-versus-one multiclass fuzzy SVM model is constructed using collected product samples. Optimal training parameter set of model is determined by a two-step cross-validation. The reminder of the paper is organized as follows. Section 2 gives an introduction of multiclass fuzzy SVM. Section 3 presents the proposed prediction model of consumers’ preferences. Section 4 demonstrates the experimental results of the proposed model using mobile phone design as example. Finally, Section 5 presents some brief conclusions and suggestions for future work. 2. Multiclass fuzzy support vector machines 2.1. Fuzzy support vector machines for binary classification 2 An SVM maps the input points into high-dimensional feature space and finds a separating hyperplane that maximizes the margin between two classes in the space. Maximizing the margin is a quadratic programming (QP) problem can be solved from its dual problem by introducing Lagrangian multipliers. Without any knowledge of the mapping, the SVM finds the optimal hyperplane by using the dot product functions in feature space via the aid of kernels. The solution of the optimal hyperplane can be written as a combination of a few input points that are called support vectors. In many real-world applications, input samples may not be exactly assigned to one class and the effects of the training samples might be different. Some are more important to be fully assigned to one class so that SVM can separate these samples more correctly. Some samples might be noisy and less meaningful and should discard them. Equally treating every data samples may cause unsuitable overfitting problem. The original SVM lacks this kind of ability. (Huang and Liu 2002; Lin and Wang 2002) proposed the concept of fuzzy SVM which combines fuzzy logic and SVM to make different training samples have different contributions to their own class. The central of their concept is to fuzzify the training set and assign each data sample a membership value according to its relative importance in the class. A description of fuzzy SVM is given in the Appendix. Figure 1 illustrated a simplified binary classification problem with only two attributes training by fuzzy SVM using linear kernel. Since all data samples only have two attributes, the data point can be plotted in 2D plane and explain the training results in a more intuitive manner. Red and blue disks are two classes of training l samples. Grey values indicate the value of the argument y K ( x , x) b i i i of Eq. (20) i 1 in the Appendix. Then new data sample without given class label can be discriminated according to Eq. (3) in the Appendix. In Figure 1 the middle solid line is the decision surface and data points lie on this surface satisfy D( x) 0 . The outer dash lines precisely meet the constraint in Eq. (6) in the Appendix and data points lie on this margin satisfy D( x) 1 or D( x) 1 . In addition, support vectors are very useful in data analysis and interpretation. In the original definition of SVM, the data points satisfying the condition i 0 are called a support vector. In fuzzy SVM, the same value of i may indicate a different type of support vectors due to the factor i (Lin and Wang 2002). The one with corresponding i i C is misclassified. The one with corresponding 0 i i C lies on the margin of the hyperplane (marked by extra circles in Figure 1). Different fuzzy memberships to different classes were applied to demonstrate the training effect of i . In Figure 1(a) both classes are assigned membership equal to 1, 3 and the training result is exactly the same as SVM. In Figure 1(b) and Figure 1(c) red and blue data points are assigned membership equal to 0.1 separately. Since i is the attitude toward certain class label of the corresponding data sample, the training results are sensitive to the membership of training samples. Tuning parameter C can make balance between the minimization of the error function and the maximization of the margin of the optimal hyperplane. Take the same two-dimensional dataset in Figure 1(a) for example. Figure 2 shows the influence of different parameter C from 10000 to 0.1 of polynomial kernel with degree 2. The increase of C makes the training of SVMs less misclassifications and narrower margin, while the decrease of C makes SVMs ignore more training points and get a wider margin. (a) All data membership = 1.0 (b) All blue data membership = 0.1 (c) All red data membership = 0.1 Figure 1. The relationship between membership value and training margin of fuzzy SVM (linear kernel). 4 C = 10000 C = 1000 C = 100 C = 10 C=1 C = 0.1 Figure 2. The relationship between parameter C and the training margin (polynomial kernel with degree 2). 2.2 One-versus-one multiclass support vector machines In previous section we have described the concept of fuzzy SVM. However, fuzzy SVM is still limited in binary classification. How to effectively extend SVM for multiclass classification is still an on-going research issue. (Hsu and Lin 2001; Duan and Keerthi 2005) compared the performance of several multi-class SVM methods based on binary SVM, including “one-versus-rest” (OVR) (Bottou, Cortes et al. 1994), “one-versus-one” (OVO) (Krebel 1999), and directed acyclic graph SVM (DAGSVM) 5 (Platt, Cristianini et al. 2000). The binary SVM was used as a component in these multiclass classification algorithms. Since OVO method and DAGSVM were shown to have higher accuracy for practical use (Hsu and Lin 2001), the OVO method was n n(n 1) adapted in this study. The OVO method constructs binary SVMs for 2 2 n -class problem, where each of the n(n 1) / 2 SVMs is trained on data samples from two classes. Data samples are partitioned by a series of optimal hyperplanes. The optimal hyperplane means training data is maximally distant from the hyperplane itself, and the lowest classification error rate will be achieved when using this hyperplane to classify current training set. These hyperplanes can be modified from Eq. (2) in the Appendix as wst z bst 0 (1) and the decision functions are defined as Dst ( xi ) wst z bst , s and t mean two arbitrary classes separated by an optimal hyperplane in n classes; wst is the weight vector and bst is the bias term. After all n(n 1) / 2 classifiers are constructed, a max-win voting strategy is used to examine all data samples (Krebel 1999). Each of the n(n 1) / 2 OVO SVMs casts one vote. If Dst ( xi ) says xi is in the s-th class, then the vote of xi for the s-th is added by one. Otherwise, the t-th is increased by one. Then xi can be predicted in the class with the largest vote. Since fuzzy SVM is a natural extension of traditional SVM, the same OVO scheme can be used to deal with multiclass problem without any difficulty. 3. Prediction model of consumers’ preferences This study aims to develop a prediction model based on consumers’ preferences. The product samples are analyzed by considering the sparse and mixed properties of form features. Class labels are used to describe the consumers’ preferences toward product samples. The product samples are collected and their form features are systematically examined. Consumers are asked to assign one most suitable class labels to each product sample. After analyzing form feature of product samples and collecting consumers’ evaluation data, an OVO multiclass fuzzy SVM model is constructed. In order to obtain optimal training model, a two-step cross-validation is used to search best combination of parameters. 3.1. Processing sparse and mixed product form features 6 Two characteristics of product form features are considered in this study. Firstly, the form feature vector is often sparse. This is mainly because there often exist large amounts of features to represent product form design, and each product sample is not necessarily occupied all form features. The number of active or non-zero features in a feature vector is lower than the total number of features. This situation is very common in product form feature representation (Kwahk and Han 2002). Secondly, product form features are often mixed with two kinds of attributes denoted as “discrete” or “continuous” type. Discrete attributes denote categorical choices among fixed number of variables, such as types of texture, material used in parts etc. Continuous attributes such as length, and proportion often have some kind of scale or can be measured and the domain of variable is continuous without interruption. SVM can deal with mixed attribute of discrete and continuous types at the same time. Since SVM requires that each data sample be represented as a vector of real numbers, discrete attributes can be represented as integer number. Taking a three-category attribute “circle, rectangle, triangle” for example, it can be coded as {1,2,3} . As for continuous attributes, because kernel values usually depend on the inner products of feature vectors, e.g. linear kernel and polynomial kernel, large attribute values might cause numerical problems (Hsu, Chang et al. 2003). Continuous attributes are linearly scaled to the range [0, 1] to avoid numerical difficulties during calculation. 3.2. Describing consumers’ preferences using class labels The concept of product positioning was borrowed describe the consumers’ preferences toward product form design. (Kaul and Rao 1995) suggested that a company should provide an array of products into the marketplace in order to the needs of each homogenous consumer segment. Vice versa, consumers often make choices in the marketplace according to the perceived product attributes. Base on this idea, product samples are assumed to be distinguished by consumers and classified into different groups. The managerial decisions can be made more effectively by identifying the relative importance attached to various product attributes. Take the mobile phone design for example, class labels such as sports, simplicity, female, plain and business etc. are used to describe different product divisions provided in the marketplace. Although there exist other product characteristics may affect consumers’ subjective perceptions (brand, price, etc.), the authors mainly emphasize on the factors only in product form design. Other marketing strategies which may influence the decision of consumers are beyond the scope of this study. 3.3. Collecting product samples 7 A total of 69 mobile phones were collected from the Taiwan market in 2006. Three product designers each with at least 5 years experiences conducted the product form features analysis. They first examined the main component structure using the method proposed in (Kwahk and Han 2002) and then used this structure to analyzes all product samples. Form features of each product sample were discussed by all designers and determine one unified representation. Continuous attributes were recorded directly while discrete attributes were processed by the method described in Section 3.1. The color and texture information of the product samples were ignored and emphasized on the form features only. All entities in the feature matrix are prepared for training multiclass fuzzy SVM. Five class labels such as sports, simplicity, female, plain and business are chosen for semantic evaluations. In order to collect consumers’ perception data for mobile phone design, 30 subjects, including 15 males and 15 females, were asked to evaluate all product samples using the selected five class labels. Each subject was asked to choose the most suitable class label for resenting each product sample, and evaluates each sample in a semantic differential scale from 0 (very low) to 1 (very high). Since each product sample only had single instance when training the multiclass fuzzy SVM model, the label with most frequently assigned label were used for representing each product sample. Training multiple instances of samples is another interesting issue worth of further research. The selected class label is assigned as +1, and rest of the labels is assigned as –1. The semantic differential score is directly stored as the membership value for fuzzy SVM training. 3.4. Constructing multiclass fuzzy SVM model In this study, each product sample is assigned a class label to formulate a multiclass classification problem. This problem is then divided into a series of OVO SVM sub-problems. The objective of multiclass classification is to correctly discriminating these classes from each other and each OVO problem is addressed by a two different class labels (e.g. sports versus simplicity). Each classifier uses the fuzzy SVM to define a hyperplane that best separates product samples into two classes. Each test sample is sequentially presented to each of the 5 (5 1) / 2 10 OVO classifiers and can be predicted which label it belong to, based on the OVO classifier having the largest vote. 3.5. Choosing optimal parameters using cross-validation 8 Since the number of product samples is limited, it is important to obtain best generalization performance and reduce the overfitting problem. Practical implementation is to partition these data samples into training data and testing data. Various partition strategies have been proposed including leave-one-out cross-validation (LOOCV), k-fold cross-validation, repeated random subsampling, and bootstrapping (Berrar, Bradbury et al. 2006). In this study, 5-fold cross-validation is used to choose optimal parameters. The whole training samples are randomly divided into five subsets of approximately equal size. Each multiclass model is trained using 5 1 4 subsets and tested using the remaining subset. Training is repeated five times and the average testing error rates for all the five subset that are not included in the training data is calculated. The performance of SVM model is heavily dependent on the regulation parameter C and the parameter of chosen kernel function. Take the Gaussian kernel for example, each binary classifier requires the selection of two parameters, which are the regularization parameter C and kernel parameter 2 . C and 2 of each classifier within the multiclass model are set to be the same for calculation efficiency. Since cross validation may be very time-consuming, a two-step grid search is conducted to find the optimal hyperparameter pair (Hsu, Chang et al. 2003). In the first step, a coarse grid search is taken using the following sets of values: C 103 ,...,103 and 2 103 ,...,103 . Thus 49 combinations of C and 2 are tried in this step. An optimal pair (C0 , 02 ) is selected from the coarse grid search. In the second step, a fine grid search is conducted around (C0 , 02 ) , where C 0.2C0 ,0.4C0 ,...,0.8C0 , C0 ,2C0 ,4C0 ,...,8C0 , and 2 0.2 02 ,0.4 02 ,...,0.8 02 , 02 , 2 02 , 4 02 ,...,8 02 . All together, 81 combinations of C and 2 are tried in this step. The optimal hyperparameter pair is selected from this fine search. Likewise, the same two-step grid search is repeated on polynomial kernel. For polynomial kernel, the coarse grid of polynomial is taken as C 103 ,...,103 and 1,2,...,5 . When (C0 , 0 ) is determined, the range of the fine grid search is as C 0.2C0 ,0.4C0 ,...,0.8C0 , C0 ,2C0 ,4C0 ,...,8C0 , and 0.20 ,0.40 ,...,0.80 , 0 ,1.20 ,1.40 ,...,1.80 . After comparing the performance of all training models using different kernel 9 functions and parameters, the best combination of parameters obtained by cross-validation is used to build the multiclass fuzzy SVM model. 4. Experimental results 4.1. Data set Mobile phone design has been selected to demonstrate the proposed methodology. Table 1 shows a part of product samples used in this study. The set of Si (i 1, 2,...,10) represents a part of product samples to be analyzed; the set of X i (i 1, 2,...,6) denotes the product form feature attributes; and the set of Yi (i 1, 2,...,5) represents the class labels; i is the membership value of the +1 class label of each product sample S i . For the sake of simplicity only six product form features are listed in the example of Table 1. X { X1 , X 2 , X 3 , X 4 , X 5 , X 6 } {body-length,body-width,body-thickness,body-volume, body-type,function-button-type} Five class labels are used to describe consumers’ subjective perceptions of mobile phone design. These class labels list as following: Y {Y1 , Y2 , Y3 , Y4 , Y5 } {sports, simplicity, female, plain, business} Take product sample S1 as example, the consumer choose label Y4 “plain” and the attitude of Y4 is 0.5 . A complete list of all product form features is shown in Table 2. Product samples Product form features Class labels and membership values (S ) (X) (Y, ) X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 S1 0.75 0.45 0.72 0.62 2 3 -1 -1 -1 +1 -1 0.5 S2 0.67 0.43 0.64 0.47 3 3 -1 -1 +1 -1 -1 0.8 S3 0.79 0.42 0.57 0.48 1 3 +1 -1 -1 -1 -1 0.5 S4 0.75 0.44 0.6 0.5 3 3 -1 -1 -1 +1 -1 0.6 S5 0.67 0.42 0.77 0.54 2 2 -1 -1 +1 -1 -1 0.6 S6 0.72 0.48 0.53 0.47 2 3 -1 -1 -1 +1 -1 0.9 S7 1 0.44 0.56 0.63 1 1 -1 -1 +1 -1 -1 0.7 S8 0.77 0.45 0.81 0.71 2 1 -1 -1 +1 -1 -1 1 S9 0.75 0.45 0.72 0.62 2 3 -1 -1 +1 -1 -1 0.6 S10 0.67 0.43 0.64 0.47 3 3 -1 -1 +1 -1 -1 0.8 10 Table 1. Part of training product samples for mobile phone design. Form features Type Attributes Length Continuous None Continuous None Continuous None Continuous None ( X1 ) Width ( X2 ) Body Thickness ( X3 ) Volume ( X4 ) Type Discrete ( X5 ) Function button Type Block body Flip body Slide body ( X 51 ) ( X 52 ) ( X 53 ) ( X 61 ) ( X 62 ) Round Square ( X 71 ) ( X 72 ) ( X 73 ) Circular Regular Asymmetric ( X 81 ) ( X 82 ) ( X 83 ) Square Vertical Horizontal ( X 91 ) ( X 92 ) ( X 93 ) ( X 101 ) ( X 102 ) ( X 103 ) Discrete ( X6 ) Style ( X 63 ) Discrete ( X7 ) Shape Discrete Number button ( X8 ) Arrangement Discrete ( X9 ) Detail treatment Discrete ( X 10 ) 11 ( X 104 ) Position Discrete Panel ( X 11 ) Shape Middle Upper Lower Full ( X 111 ) ( X 112 ) ( X 113 ) ( X 114 ) Square Fillet Shield Round ( X 121 ) ( X 122 ) ( X 123 ) ( X 123 ) Discrete ( X 12 ) Table 2. Complete list of product form features used in this study. 12 4.2. Training effect of different kernel functions The training effect of polynomial kernel and Gaussian kernel are investigated with the whole product samples. Average training accuracies of kernel functions and the corresponding parameters are shown in Figure 3. For polynomial kernel in Figure 3(a), the average error rates of linear kernel for all parameter C were all larger than 40%. When p 2 , the regulating effect of parameter C was most obvious. As the decrease of parameter C from 1000 to 0.001, the average error rate increased from 0% to 34.8%. This is due to the parameter C can adjust the margin of optimal hyperplane. Since training with smaller C will result in larger margin, training error can also be increased. The parameter C had similar regulating effect when p 3 . However, the training error rate increased more drastically than p 2 . Although the training accuracies of polynomial kernel ( p 1 ) were all superior to linear kernel, they might suffer from the problem of overfitting and had poor generalization ability. For Gaussian kernel in Figure 3(b), the regulating effect of parameter C was less pronounced than polynomial kernel for all kernel parameters . It has been reported in (Wang, Xu et al. 2003) that too large and too small value of both lead to poor generalization performance. Our results exhibited similar effects of . For larger value of , such as 2 10 , all training data were regarded as one data. In a consequence, the training model cannot recognize new data and the training error rate is very high. On the other hand, for smaller value of , such as 2 10 , all training data were regarded as support vectors, and they can be separated correctly. The training error rate declined extremely. However, for untrained data, the training model may not given good result due to overfitting problem. In general, the linear kernel performed worse than nonlinear kernels. The polynomial kernel and Gaussian kernel are capable to nonlinearly map the training samples into higher dimensional space unlike linear kernel, thus they can handle the case when the relation between product form features and class labels is nonlinear. Since every single kernel function has different properties and generalization performance, the advantages of different kernel functions can be combined by using their mixtures (Smits and Jordaan 2002). In addition, there exist some theorems which can help to build kernel functions that take into consideration the domain knowledge (Barzilay and Brailovsky 1999), these issues are beyond the scope of this paper. 13 (a) (b) Figure 3. Average training accuracies using (a) polynomial kernel and (b) Gaussian kernel. 14 4.3. Analysis of cross-validation process In order to obtain best performance and reduce the overfitting of the training model, a two-step cross validation process was used to determine optimal parameters. Figure 4 shows the results of cross-validation for polynomial kernel. Best parameter set (C0 , 0 ) obtained from first step of coarse grid search was (100,1) by choosing the lowest error rate of 71%. The optimal pair of parameter (C , ) obtained in second step of fine grid search was (800,1) . The average error rate of second step improved slightly to 68.1%. The results of cross-validation for Gaussian kernel are shown in Figure 5. Best parameter set (C0 , 02 ) obtained from coarse grid search was (10,10) . Optimal parameter set (C , 2 ) obtained in fine grid search was (40, 4) . The training error rate also improved slightly from 73.9% in first step to 72.4% in second step. As shown in previous section, if the training model is built with whole data samples and selecting one of the parameter set from the region with very low average error rates ( 10% ) in Figure 3, the training model can hardly get rid of the overfitting problem. An interesting result shows that the best parameter set obtained by cross validation of both kernel functions seems to lay on the boundary of the region with very low average error rates. This indicates that the process of cross-validation is capable to balance the trade off between improving training accuracy and prevent overfitting. Since the purpose of cross-validation is to search the best combination of parameters, the accuracy of the individual training model in this process is not we concern, regardless for their high training error rates (all larger than 65%). Each of the optimal parameters of polynomial kernel and Gaussian kernel obtained from cross-validation were then used to build the final training model. 15 (a) (b) Figure 4. Average training accuracy of cross-validation in (a) coarse grid and (b) fine grid using polynomial kernel. 16 (a) (b) Figure 5. Average training accuracy of cross-validation in (a) coarse grid and (b) fine grid using Gaussian kernel. 17 4.4. Performance of the optimal training model The best parameter set of polynomial kernel and Gaussian kernel obtained from the cross-validation process are both used to build the multiclass fuzzy SVM training model. The average accuracy rate of the polynomial kernel model with (C, ) (800,1) was 66.3%, while average accuracy rate of the Gaussian kernel model with (C , 2 ) (40, 4) was 98.6%. The confusion matrices are used for further analysis as shown in Table 3. Diagonal elements are the number of correctly classified samples while off-diagonal elements indicate the number of misclassified samples. For the polynomial kernel model in Table 3(a), the most confusing was the “female” class. More than half of “female” samples were misclassified as “plain,” “sports,” and “simplicity” class and the accuracy was down to 20%. According to our observation, two characteristics of “female” product sample are the area of decoration and the color of body. Since the color and texture of the product samples are ignored, these samples may not provide enough information for polynomial kernel to correctly classify them. For the Gaussian kernel model in Table 3(b), the model performed very well and had only one misclassified sample. The performance of the Gaussian kernel model with parameter set (C , 2 ) (40, 4) was better than polynomial kernel model. Actual Class (a) Predicted Class Accuracy rate (%) plain sports female simplicity business plain 10 3 0 1 0 71.4 sports 1 15 0 1 0 88.2 female 1 5 2 2 0 20.0 simplicity 1 0 0 15 1 88.2 business 1 1 0 2 7 63.6 Average accuracy rate 66.3 Actual Class (b) Predicted Class Accuracy rate (%) plain sports female plain 13 0 1 0 0 92.9 sports 0 17 0 0 0 100.0 female 0 0 10 0 0 100.0 simplicity 0 0 0 17 0 100.0 business 0 0 0 0 11 100.0 Average accuracy rate simplicity business 98.6 Table 3. Confusion matrices and accuracy rate of the optimal training model obtained from (a) polynomial kernel and (b) Gaussian kernel. 18 5. Conclusion and future works In this paper, an approach based on multiclass fuzzy SVM is proposed to develop a prediction model of consumers’ preferences. The OVO multiclass fuzzy SVM model can deal with the nonlinear relationship between product form features by introducing kernel function. The optimal training parameters were determined by a two-step cross-validation process. According to the experimental results of mobile phone design, the optimal training model was obtained by choosing the Gaussian kernel model with lowest average error rates 72.4% of cross-validation. The parameter set (C , 2 ) of the optimal training model was (40, 4) . The optimal Gaussian kernel model training with all product samples also had very high accuracy of 98.6%. In a consequence, the Gaussian kernel model is superior to the polynomial model. The result is consistent with the fact that Gaussian kernel is popular and commonly used in many applications due to its good features. For further discussions of the properties of Gaussian kernel can be found in (Sathiya and Lin 2003; Wang, Xu et al. 2003). Since our case study was developed based on mobile phone design and used relative small amount of product form features, the form features of different product samples such as consumer electronics, furniture, car design etc. may have different characteristics to consider with. A more comprehensive collection of different product samples is needed to study the effectiveness of the proposed multiclass fuzzy SVM model. Extending standard kernel functions such as polynomial kernel and Gaussian kernel by considering the characteristics of product form features is also a very interesting issue and requires further study. Appendix: Fuzzy support vector machines For a binary classification problem, a set S of l training samples, each represented are given as ( xi , yi , i ) where xi is the feature vector, yi is the class label, and i is the fuzzy membership. Each training sample belongs to either of two classes. These samples are given a label yi {1, 1} , a fuzzy membership i 1 with i 1,..., l , and sufficient small 0 . The data samples with i 0 means nothing and can be removed from training set without affecting the result. These training samples can be used to build a decision function (or discriminant function) D( x) , which is a scalar function of an input sample. Decision functions are simple weighted sums of the training samples xi plus a bias are called linear discriminant functions (Duda and Hart 1973), denoted as 19 D( x ) w x b (1) where w is the weight vector and b is a bias value. D( x) can be seen as a hyperplane; w is the normal vector of the separating plane the bias term b is the offset of the hyperplane along its normal vector. A data set is said to be linearly separable if a linear discriminant function can separate it without error. In most cases, finding a suitable linear discriminant function is too restrictive to be of practical use. A solution to this situation is mapping the original input space into a higher dimension feature space and searching the optimal hyperplane in this feature space. Let zi ( xi ) denote the corresponding feature space vector with a mapping function from R N to a feature space Z . The hyperplane can be defined as w z b 0 (2) The set S is said to be linearly separable if there exists (w, b) such that the inequalities w zi b 1 yi = 1 w zi b 1 yi = 1 (3) are valid for all data samples of the set S . For the linearly separable set S , a unique optimal hyperplane can be found for which the margin between the projections of the training points of the two classes is maximized. To deal with data that are not linearly separable, the previous analysis can be generalized by introducing some non-negative variables i 0 such that Eq. (3) is modified to yi ( w zi b) 1 i , i 1,..., l (4) the non-zero i in Eq. (4) are those for which the data samples xi does not satisfy l Eq. (3). Thus the term i can be thought of as some measure of the amount of i 1 misclassifications. Since the fuzzy membership i is the attitude of the corresponding sample xi toward one class and the parameter i is the measure of error in the SVM, the term ii is a measure of error with different weighting. The optimal hyperplane problem is then regarded as the solution to minimize l 1 2 w C i i 2 i 1 subject to yi ( w zi b) 1 i , i 1,..., l , i 0 i 1,..., l (5) 20 where C is a constant. The parameter C can be regarded as a regulation parameter. Tuning this parameter can make balance between the minimization of the error function and the maximization of the margin of the optimal hyperplane. A larger C makes the training of SVMs less misclassifications and narrower margin. The decrease of C makes SVMs ignore more training points and get a wider margin. It is noted that a smaller i reduces the effect of the parameter i such that the corresponding point xi is treated as less important. The optimization problem (5) can be solved by introducing Lagrange multiplier and transformed into: minimize W ( ) 1 2 l i y i j yi y j ( zi i 1 j 1 l subject to l i zj) l i i 1 0, 0 i i C , i 1,..., l (6) i 1 and the Kuhn-Tucker conditions are defined as i ( yi ( w zi b) 1 i ) 0, i 1,..., l (7) (8) ( i C i )i 0, i 1,..., l The data sample xi with the corresponding i 0 is called a support vector. There are two types of support vector. The one with corresponding 0 i i C lies on the margin of the hyperplane. The one with corresponding i i C is misclassified. An important difference between SVM and fuzzy SVM is that the point with the same value of i may indicate a different type of support vectors in fuzzy SVM due to the factor i (Lin and Wang 2002). The mapping is usually nonlinear and unknown. Instead of calculating , the kernel function K is used to compute the inner product of two vectors in the feature space Z and thus implicitly defines the mapping function, which is K ( xi , x j ) ( xi ) ( x j ) zi z j (9) Kernel is one of the core concepts in SVMs and plays a very important role. The following are three types of commonly used kernel functions: linear kernel: K ( xi , x j ) xi x j (10) polynomial kernel: K ( xi , x j ) (1 xi x j ) p (11) Gaussian kernel: K ( xi , x j ) exp( xi x j 21 2 / 2 2 ) (12) where the order p of polynomial kernel in Eq. (11) and the spread with of Gaussian kernel in Eq. (12) are adjustable kernel parameters. The weight vector w and the decision function can be expressed by using the Lagrange multiplier i : w l y z (13) i i i i 1 D( x) sign( w z b) sign( l y K ( x , x ) b) i i i (14) i 1 References Abe, S. and T. Inoue (2002). Fuzzy support vector machines for multiclass problems. European Symposium on Artificial Neural Networks Bruges, Belgium. Barzilay, O. and V. L. Brailovsky (1999). "On domain knowledge and feature selection using a support vector machine." Pattern Recognition Letters 20: 475-484. Berrar, D., I. Bradbury, et al. (2006). "Avoiding model selection bias in small-sample genomic datasets." Bioinformatics 22(10): 1245-1250. Bottou, L., C. Cortes, et al. (1994). Comparison of classifier methods: a case study in handwriting digit recognition. International Conference on Pattern Recognition, IEEE Computer Society Press. Burges, C. (1998). "A tutorial on support vector machines for pattern recognition." Data Mining and Knowledge Discovery 2(2). Duan, K.-B. and S. S. Keerthi (2005). "Which is the best multiclass SVM method? An empirical study." Multiple Classifier Systems: 278-285. Duda, R. O. and P. E. Hart (1973). Pattern classification and scene analysis, Wiley. Hsu, C.-W., C.-C. Chang, et al. (2003). A practical guide to support vector classification. Hsu, C.-W. and C.-J. Lin (2001). "A comparison of methods for multi-class support vector machines." IEEE Transactions on Neural Networks 13: 415-425. Hsu, S. H., M. C. Chuang, et al. (2000). "A semantic differential study of designers' and users' product form perception." International Journal of Industrial Ergonomics 25: 375-391. Huang, H.-P. and Y.-H. Liu (2002). "Fuzzy support vector machines for pattern recognition and data mining." International Journal of Fuzzy Systems 4(3): 826-835. Inoue, T. and S. Abe (2001). Fuzzy support vector machines for pattern classification. Jindo, T., K. Hirasago, et al. (1995). "Development of a design support system for office chairs using 3-D graphics." International Journal of Industrial 22 Ergonomics 15: 49-62. Kaul, A. and V. R. Rao (1995). "Research for product positioning and design decisions." International Journal of Research in Marketing 12: 293-320. Krebel, U. (1999). Pairwise classification and support vector machines. Advances in Kernel Methods - Support Vector Learning. B. Scholkopf, J. C. Burges and A. J. Smola. Cambridge, MA, MIT Press: 255-268. Kwahk, J. and S. H. Han (2002). "A methodology for evaluating the usability of audiovisual consumer electronic products." Applied Ergonomics 33: 419-431. Lin, C.-F. and S.-D. Wang (2002). "Fuzzy support vector machines." IEEE Transactions on Neural Networks 13(2): 464-471. Park, J. and S. H. Han (2004). "A fuzzy rule-based approach to modeling affective user satisfaction towards office chair design." International Journal of Industrial Ergonomics 34: 31-47. Platt, J. C., N. Cristianini, et al. (2000). Large margin DAGs for multiclass classification. Advances in Neural Information Processing Systems, MIT Press. Sathiya, S. and C.-J. Lin (2003). "Asymptotic behaviors of support vector machines with Gaussian kernel." Neural Computation 15(7): 1667-1689. Scholkopf, B., I. Guyon, et al. (2001). Statistical learning and kernel methods in bioinformatics, San Miniato. Shimizu, Y. and T. Jindo (1995). "A fuzzy logic analysis method for evaluating human sensitivities." International Journal of Industrial Ergonomics 15: 39-47. Smits, G. F. and E. M. Jordaan (2002). Improved SVM regression using mixtures of kernels. Proceedings of IJCNN'02 on Neural Networks. Wang, W., Z. Xu, et al. (2003). "Determination of the spread parameter in the Gaussian kernel for classification and regression." Neurocomputing 55: 643-663. 23