Uploaded by IAEME PUBLICATION

DETERMINATION OF THE SIZE OF THE SPACER FORCES ACTING ON THE FEED ROLLERS, TAKING INTO ACCOUNT THE ELASTIC CHARACTERISTICS OF RAW COTTON

advertisement
International Journal of Mechanical Engineering and Technology (IJMET)
Volume 10, Issue 3, March 2019, pp. 603–613, Article ID: IJMET_10_03_062
Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=3
ISSN Print: 0976-6340 and ISSN Online: 0976-6359
© IAEME Publication
Scopus Indexed
PROFIT AGENT CLASSIFICATION USING
FEATURE SELECTION EIGENVECTOR
CENTRALITY
Zidni Nurrobi Agam
Computer Science Department, Bina Nusantara University,
Jakarta, Indonesia 11480
Sani M. Isa
Computer Science Department, Bina Nusantara University,
Jakarta, Indonesia 11480
Abstract
Classification is a method that process related categories used to group data
according to it are similarities. High dimensional data used in the classification
process sometimes makes a classification process not optimize because there are huge
amounts of otherwise meaningless data. in this paper, we try to classify profit agent
from PT.XYZ and find the best feature that has a major impact to profit agent. Feature
selection is one of the methods that can optimize the dataset for the classification
process. in this paper we applied a feature selection based on graph method, graph
method identifies the most important nodes that are interrelated with neighbors nodes.
Eigenvector centrality is a method that estimates the importance of features to its
neighbors, using Eigenvector centrality will ranking central nodes as candidate
features that used for classification method and find the best feature for classifying
Data Agent. Support Vector Machines (SVM) is a method that will be used whether
the approach using Feature Selection with Eigenvalue Centrality will further optimize
the accuracy of the classification.
Keywords: Classification, Support Vector Machines, Feature Selection, Eigenvalue
Centrality, Graph-based.
Cite this Article: Zidni Nurrobi Agam and Sani M. Isa, Profit Agent Classification
Using Feature Selection Eigenvector Centrality, International Journal of Mechanical
Engineering and Technology (IJMET)10(3), 2019, pp. 603–613.
http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=3
http://www.iaeme.com/IJMET/index.asp
603
editor@iaeme.com
Zidni Nurrobi Agam and Sani M. Isa
1. INTRODUCTION
In this era data is a very important commodity used in almost all existing technologies, data
makes researchers examine more data in order to find hidden patterns that can be used as
information. but with the increasing number of data, there are also many data that irrelevant
and redundant dataset, making the quality of the data less good.
Feature Selection is a method that selects a subset of variables from the input which can
efficiently describe the input data while reducing effects from noise or irrelevant variables and
still provide good prediction results [1]. Usually, feature selection operation both ranking and
subset selection [2][3] to get most relational or most important value from a dataset. n
described as total feature the goal of feature selection is to select the optimal feature I, so the
optimal feature selection is I < n. With Feature Selection processing data will improve the
overall prediction because optimal dataset that improves by feature selection.
we applied feature selection for optimizing classification based on graph feature selection,
this feature selection ranked feature based on Eigenvector Centrality. in graph theory, ECFS
measures a node that has major impact on other nodes in the network. all nodes on the
network are assigned relative scores based on the concept that nodes that have high value
contribute more to the score of the node in question than equal connections to low-scoring
nodes [4]. A high eigenvector score means that a node is connected to many nodes who
themselves have high scores. so, relationship between feature (nodes) are measure by weight
the connection between nodes. The problem from feature subset selection refers the task of
identifying and selecting a useful subset of attributes to be used to represent patterns from a
larger set of often mutually redundant, possibly irrelevant, attributes with different associated
measurement costs and/or risks [5]. we try to find the most influential feature to predict the
profit agent with ECFS.
There are many studies that research about Eigenvector Centrality such as Nicholas J.
Bryan and Ge Wang [6] , Nicholas J. Bryan and team research about how music with so
many features can create pattern network between song and used to help describe patterns of
musical influence in sample-based music suitable for musicological analysis. [7] To analyze
rank influence feature between genre music with Eigenvector Centrality. and on 2016
Giorgio Roffo & Simione Melzi research about Feature ranking vie Eigenvector Centrality, in
Giorgio Roffo & Simione Melzi research important feature by identifying the most important
attribute into an arbitrary set of cues then mapping the problem to find where feature are the
nodes by assessing the importance of nodes through some indicator of centrality. for building
the graph and the weighted distance between nodes Giorgio Roffo & Simione Melzi use
Fisher Criteria.
The Goal of this paper is to applied Chi-Square and ECFS feature selection and compare
both features with different dataset. Both Feature Selection test with HCC and Profit agent
dataset, this test validates with K Fold Cross Validation feature selection to test the model’s
ability then evaluated with confusion matrix to measure misclassification. Based on ECFS
results we try to determine which attribute from profit agent that have a major impact on
another attribute on profit agent dataset.
2. RESEARCH METHOD
A discussion
about Feature selection and ranking based on Graph network [8] method will be
discussed in this paragraph. To build the graph first we have to define how to design and build
the graph.
http://www.iaeme.com/IJMET/index.asp
604
editor@iaeme.com
Profit Agent Classification Using Feature Selection Eigenvector Centrality
2.1. Graph Design
Define the graph G = (V,E), V is a vertices corresponding one by one to each variable x, x is a
set of features X = {x(1),x(2),.....x(n)}. E define as a (Weight) edges between nodes (features).
To represent the graph that have relationship between edge. Define node (feature) into
adjency matrix represent on binary matrix:
αij = ℓ(x
(i)
,x(j))
(1)
When the graph not complex, the adjacency matrix are 0 all 1 or (there are no weights on
the edges or multiple edges) and the diagonal entries are all [9] , but if the adjacency matrix
has weighted on edges the adjacency matrix will not 0 and 1 but fill with weighted as figure 1.
(
(
)
)
Figure 1 Matrix A with Weight and Matrix A no Weight
A design graph is a part to weight the graph according to how good the relationship
between two feature in the dataset. we apply Fisher linear discriminant [10] to find the mean
and standard deviation, this methods find a linear combination of features which characterizes
or separates two or more classes of objects or events.
|
|
(2)
( )
Where:
m = represent a mean.
s = represent a variance. (standard deviation)
Subscripts = the subscripts denote the two classes.
After we measure the relationship between to class and we get the weight from fisher
linear Discriminant, then we implement Eigenvector Centrality to rank and filter data from
spearman correlation weight that generated from the relationship between 2 nodes (feature).
For G: = (V, E) with |V| vertices let A = (av,t) adjacency matrix, the relative centrality score of
vertex v can be defined as :
∑
( )
∑
(3)
where M(v) is a set of the neighbors of and is a constant. With a small rearrangement
this can be rewritten in vector notation as the eigenvector equation:
(4)
However, as we count longer and longer paths, this measure of accessibility converges to
an index known as eigenvector centrality measure (EC). Example for node and adjacency
matrix [9] described in figure 2 and Table 1.
http://www.iaeme.com/IJMET/index.asp
605
editor@iaeme.com
Zidni Nurrobi Agam and Sani M. Isa
Figure 2 Node Data Agent Company XYZ
Table 1 Example adjacency Matrix
Example
Age
Gender
City
Balance
Total Trans
Age
0
0
0
0
0
Gender
8.56
0
8.56
7.4
0
City
0
3.4
0
2.6
8.7
Balance
0
7.4
2.6
0
0
Total Trans
0
0
0
0
0
2.2. SVM Classification Method
SVM Classification is classification analysis which fall into the category a supervised
learning algorithm. Given a set of training examples, each marked as belonging to one or the
other of two categories, Support vectro machine builds a model that assigns new examples to
one category or the other, making it a non-probabilistic binary linear classifier. Its basic idea
is to map data into a high dimensional space and find a separating hyperplane with the
maximal margin. Given a training dataset of n points of the form:
(⃗
)
(⃗⃗⃗⃗⃗ )
(5)
where the
are 1 or −1 [11], each point indicating the class to which the point ⃗⃗⃗⃗
belongs. Each ⃗⃗⃗⃗ is p-dimensional real vector. We want to find the "maximum-margin
hyperplane" that divides the group of points ⃗⃗⃗⃗ for which = 1 from the group of points for
which = - 1 which is defined so that the distance between the hyperplane and the nearest
point from either group is maximized.
http://www.iaeme.com/IJMET/index.asp
606
editor@iaeme.com
Profit Agent Classification Using Feature Selection Eigenvector Centrality
Figure 3 SVM try to find best Hayperlane to split class -1 and + 1
2.3. Confusion Matrix
To measure how to optimize SVM Classification used Eigenvector Centrality, we used a
confusion matrix, confusion matrix is a table that is often used to describe the performance of
a classification model (or "classifier") on a set of test data for which the true values are
known. The table of confusion matrix will describe on figure 4 there is a TP (True Positives),
TN (True Negatives), FP (False Positives) and FN (False Negatives).
Figure 4 Confusion Matrix
True Positive: We predicted yes (they have failure), and they do have the failure.
False Positives: We predicted yes, but they don't actually have the failure.
True Negatives: We predicted no, and they don't have the failure.
False Negatives: We predicted no, but they actually do have the failure.
3. RESULT & ANALYSIS
In this step we discuss result from compare between Eigenvector Centrality FS and ChiSquare FS, then we used Support Vector Machines Classification to calculate and compare
accuracy between that feature selection.
3.1. Dataset
Dataset PT. XYZ is chosen analysis feature selection Eigenvector Centrality with the
scenario. first, we describe the dataset used for feature selection as how many features used
for prediction, data agent is a categorical data collected from transaction data and have 6
attributes that used for analysis:
http://www.iaeme.com/IJMET/index.asp
607
editor@iaeme.com
Zidni Nurrobi Agam and Sani M. Isa
1. Type Application
Description: a device used by the agent to the transaction.
Categorical: 1. EDC 2. Android
2. Age
Description: Age agent PT. XYZ
Categorical:
1. <= 23 Years
2. > 23 Years and <= 29 Years
3. > 29 Years
3. City
Description: City based on Agent stay.
Categorical: Convert every city with a number.
4. Balance Agent
Description: Wallet Agent on PT.Xyz
Categorical: 1. <= Rp.500.000 2. > Rp.500.000 and <= Rp.2.000.000 3. >
Rp.2.000.000
5. Transaction
Description: Transaction Agent every day.
Categorical: every transaction agent /day
6. Joined
Description: Duration Agent join with PT.Xyz.
Categorical: Count per day Agent from join until now.
7. Gender
Description: Agent Gender.
Categorical: 1= L 0 = P
8. PulsaPrabayar, PLNPrabayar, TVBerbayar, PDAM, PLNPasca, Telpon, Speedy,
BPJS, Cashin, Asuransi, Gopay, & TiketKereta.
Description: Any transaction from Pt.Xyz application.
Categorical: Count perday detail transaction agent.
Number of data profit: 558
Number of profit: 441
3.2. FS Comparison Approach
Is this step, we compare analysis about Eigenvector Centrality Feature Selection, we applied
chi-square feature selection, chi-square is a numerical test that measures deviation considering
the feature event is independent of the class value [12]. In this section, we applied to compare
between eigenvector centrality and chi-square feature selection used SVM Classification to
dataset Agent and HCC (hepatocellular carcinoma) survival public dataset have 49 attributes
and 2 class (Survive, Not Survive). HCC is the most common type of primary liver cancer.
Hepatocellular carcinoma occurs most often in people with chronic liver diseases, such as
cirrhosis caused by hepatitis B or hepatitis C infection. The entire dataset is taken for our
analysis containing 165 records and have numerical category characteristics.
Result from the comparison between ECFS with data agent, data HCC and Chi-Square
with data agent, data HCC will be analyzed whether there are significant differences between
the two methods. The correlation between the attributes can influence the classification
result[13] and Eliminating crucial features accidentally can reduce the classification result.
Details from dataset describe on table 2 includes further details of considered datasets such as
http://www.iaeme.com/IJMET/index.asp
608
editor@iaeme.com
Profit Agent Classification Using Feature Selection Eigenvector Centrality
the number of samples and variables, number of classes and accuracy result from
classification [14].
Table 2 Dataset used in the comparison of feature selection. for each dataset, the following detailed
are reported.
No
1
2
Dataset
Agent Profit
HCC
Samples
1000
165
Variables
19
49
Classes
2
2
3.3. Test Models & Performance Analysis
This step try to test accuracy both of feature selection we used SVM classification and
measure accuracy with confusion matrix. For Eigenvector Centrality dataset will split into
data and . describe the feature that used for classification model and
describe the
label form dataset, profit label will give -1 and non profit willl give 1. The data will split into
training
and test
then data
then split into two class, profit class will assume -1 and
nonprofit 1. Mean data will used for find the mutual information, linear descriminant used for
measure mutual information between two clasess and find best mutual for process building
the graph on Eigenvector Centrality. the Eigenvector Centrality will rank result by how
strongly each node is conneceted to the other nodes. selected 10 best strongly central attribute
(node) to predict the model with SVM, SVM model used for this classification is FITCSVM.
This models will validate with 10 k fold validation and used random seed (1) to control
random number generation for every feature selection.
Table 3 shows the result obtained by comparing the accuracy from different datasets used
for classification. HCC dataset starts from attribute 9 to attribute 45, after we try 1 iteration to
8 iteration the accuracy of chi-square decreases to 60 % and Agent Profit dataset start from
dataset 10 to 19 attribute, after we try 1 iteration to 9 iteration the accuracy of chi-square
decreases to 60 % so we don’t display the accuracy from 1-9 for HCC and 1-10 for Profit
Agent. Best attribute found based on manual reduction to get the best accuracy from iteration.
Table 3 Performance Analysis dataset HCC with Chi-Square and ECFS
Itteration
45
44
43
42
41
40
39
38
37
36
35
34
33
Chi
Square
70,4412
69,8529
70,4779
69,2279
69,2279
67,3897
65,5515
66,8015
65
65,5515
66,8015
68,0515
71,6176
ECFS
Itteration
71,4706
69,8529
69,8529
69,8162
68,6397
66,8015
64,9265
66,7279
64,9265
64,8897
63,0882
67,9779
67,9779
32
31
30
29
28
27
26
25
24
23
22
21
20
http://www.iaeme.com/IJMET/index.asp
Chi
Square
71,6176
69,8897
69,8529
68,0515
66,2132
67,4632
67,4265
66,1765
67,8676
65,4412
67,2426
64,2279
66,0294
609
ECFS
Itteration
67,9779
69,1912
69,1544
67,9779
67,2794
69,7794
69,7426
68,5662
69,7426
69,8162
69,1912
69,8162
71,0294
19
18
17
16
15
14
13
12
11
10
9
Chi
Square
67,9044
67,9412
68,5294
67,3529
67,9779
66,8382
66,8015
65,625
65,5147
63,1618
59,4853
editor@iaeme.com
ECFS
71,6176
68,6029
69,1912
66,7647
66,2132
66,8382
67,3897
69,7794
71,0294
68,6397
67,3529
Zidni Nurrobi Agam and Sani M. Isa
Figure 5 Performance Analysis dataset HCC with Chi-Square and ECFS
In the chi-square, the highest iteration was generated in 32,33 iteration and produce the
highest accuracy of 71,6176%. while the highest ECFS iteration was generated in 19
iterations and produced the highest accuracy 71.6176%. Chi-Square succeeded in producing
faster accuracy at 33 iterations while ECFS achieved maximum accuracy in the 19th iteration.
both of these FS were tested on the number of datasets that had 45 attributes, then the test will
be carried out on dataset agent profit that has the number of attributes 19 and showing on
table 4 and figures 6.
Table 4 Performance Analysis dataset Agent Profit with Chi-Square and ECFS
Itteration
19
18
17
16
15
14
13
12
11
10
9
http://www.iaeme.com/IJMET/index.asp
ECFS
Chi square
84,5838
84,7838
84,9838
84,8828
85,0848
85,7838
85,3859
86,3889
85,0848
86,7889
85,3859
86,8889
87,7889
87,5899
90,3838
87,5909
90,4848
87,8899
83,7869
88,6899
67,6869
89,1899
610
editor@iaeme.com
Profit Agent Classification Using Feature Selection Eigenvector Centrality
Figure 6 Performance Analysis dataset Agent Profit with Chi-Square and ECFS
The Accuracy produced by ECFS on the Agent profit dataset is 90.48% better than chisquare which produces a maximum accuracy 89,18% and iteration for maximum accuracy is
obtained by ECFS in 11 iterations while chi-square is in iteration 9. the overall performance
for both feature selection indicates that performance from ECFS is more robust than chisquare because on the results obtained from test HCC dataset and Agent Profit Dataset, but
Chi-Square better when attribute is more than 20 attributes when chi-square can reach
maximum accuracy on 32 and 33 iteration. but when chi-square and ECFS reach less than 20
attributes from HCC Dataset, ECFS succeeded reach maximum accuracy on iteration 19.
When the iteration from attributes reach smaller iteration such as 9 attributes, Chi-Square
shows that accuracy decreases significantly as in dataset HCC and agent profit. ECFS is more
robust when the attribute reduced even the accuracy decreases ECFS not significantly
decreases. Value of ECFS has the largest increase in performance when the attribute reduces
[15] on dataset agent profit as many attributes have strong relationships with others. Every
attribute on ECFS rank according to how well they descriminant between two class and
reduces the attribute that doesn’t have a major impact on other attribute make the result from
reduces attribute better.
3.4. Evaluating & Analysis Attribute
From this comparison both feature selection has each best accuracy and evaluated with
Confusion Matrix that shown in table 5, Confusion matrix only show 1 best accuracy from
ECFS and Chi-Square. For ECFS best accuracy obtained from 10fold Dataset, HCC is
71,61% and Dataset Profit Agent is 90,48%. For Chi-Square best accuracy obtained from
10fold Dataset, HCC is 71,61% and Dataset Profit Agent is 89,18%.
http://www.iaeme.com/IJMET/index.asp
611
editor@iaeme.com
Zidni Nurrobi Agam and Sani M. Isa
Table 5 Confussion Matrix Best Accuracy
ECFS (Dataset Agent Profit)
Profit
NonProfit
Profit
40
5
NonProfit
4
51
Chi-Square (Dataset Agent Profit)
Profit
NonProfit
Profit
34
1
NonProfit
10
55
ECFS (Dataset HCC)
Profit
NonProfit
Profit
2
1
NonProfit
4
9
Chi-Square (Dataset HCC)
Profit
NonProfit
Profit
2
4
NonProfit
1
9
There are 12 attributes from dataset Agent that have a major impact on other attributes
after tested by ECFS and may have an impact for analysis profit agent as shown in figure 7.
Figure 7 Attributes Analysis Dataset Profit Agent with ECFS
From figure 7 we analysis attributes that have a major impact for other attribute is City
and many profit agents are determined by where is agent doing the transaction. That is
possible analysis from ECFS because the city is one attribute that has a major impact in real
conditions because people in big cities and small cities have different habits such as
knowledge about technology and culture.
http://www.iaeme.com/IJMET/index.asp
612
editor@iaeme.com
Profit Agent Classification Using Feature Selection Eigenvector Centrality
4. CONCLUSION & FUTURE WORK
In this paper we try to build a model from Eigenvector Centrality, this feature selection used
eigenvector to weight a between the node and find the best node then rank feature based on
the most important feature. The result from this paper tries to compare between ECFS and
Chi-Square and produce analysis that ECFS more robust than Chi-Square if the attribute has
been optimized with reduced attribute and Chi-Square, can reach the maximum accuracy
when having more attribute than ECFS. If Chi-Square test with the reduced attribute such as
fewer than 9 attributes, Chi-Square significantly decreases than ECFS. With this ECFS we
can know each attribute that have any major impact on Agent Profit. Future work from this
paper is trying to boost the performance of feature selection itself when the dataset has many
attributes and try the other dataset with more attribute to test ECFS and make ECFS can
develop on other tools or framework so the resource can be developed and easy access to
many researchers.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach.
Learn. Res., Volume 3, Number 3, pp. 1157–1182, 2003.
R. Wan, L. Vegas, M. Carlo, and P. Ii, Computational Methods of Feature Selection.”
P. S. Bradley, Feature Selection via Concave Minimization and Support Vector Machines,
Number 6.
L. Solá, M. Romance, R. Criado, J. Flores, A. García del Amo, and S. Boccaletti,
Eigenvector centrality of nodes in multiplex networks, Chaos, Volume 23, Number 3, pp.
1–11, 2013.
J. Yang and V. Honavar, Feature Subset Selection Using a Genetic Algorithm, IEEE
Intell. Syst., vol. 13, pp. 44–49, 1998.
N. Bryan and G. Wang, Musical Influence Network Analysis and Rank of Sample-Based
Music, Proc. 12th Int. Soc. Music Inf. Retr. Conf., no. ISMIR, pp. 329–334, 2011.
G. Roffo and S. Melzi, “Ranking to learn: Feature ranking and selection via eigenvector
centrality,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect.
Notes Bioinformatics), vol. 10312 LNCS, pp. 19–35, 2017.
F. R. Pitts, “A graph theoretic approach t o historical geography,” pp. 15–20.
F. Harary, S. Review, and N. Jul, “No Title,” vol. 4, no. 3, pp. 202–210, 2008.
B. Scholkopft and K. Mullert, “Fisher Discriminant Analysis With,” pp. 41–48, 1999.
Y. Bazi, S. Member, F. Melgani, and S. Member, “Toward an Optimal SVM
Classification System for Hyperspectral Remote Sensing Images,” no. December 2013,
2006.
S. Thaseen and C. A. Kumar, “Intrusion Detection Model Using fusion of Chi-square
feature selection and multi class,” J. KING SAUD Univ. - Comput. Inf. Sci., no. 2016,
2015.
I. Sumaiya Thaseen and C. Aswani Kumar, “Intrusion detection model using fusion of
chi-square feature selection and multi class SVM,” J. King Saud Univ. - Comput. Inf. Sci.,
vol. 29, no. 4, pp. 462–472, 2017.
D. Ballabio, F. Grisoni, and R. Todeschini, “Multivariate comparison of classification
performance measures,” Chemom. Intell. Lab. Syst., vol. 174, no. March, pp. 33–44,
2018.
E. M. Hand and R. Chellappa, “Attributes for Improved Attributes: A Multi-Task
Network for Attribute Classification,” pp. 4068–4074, 2016.
http://www.iaeme.com/IJMET/index.asp
613
editor@iaeme.com
Download