Uploaded by karthikeyan arthanary

13 chapter 4 (1)

advertisement
CHAPTER 4
STOCK PRICE PREDICTION USING MODIFIED K-NEAREST
NEIGHBOR (MKNN) ALGORITHM
4.1 Introduction
Nowadays money investment in stock market gains major attention because of its dynamic
nature. So the significant issue in market finance is discovering well organized approaches to
outline and envision the stock market information to provide individuals or organizations helpful
data about the behavior of the market for making decision about investment.The huge amount of
important information produced by the stock market has attracted researchers to investigate this
issue utilizing distinctive approaches. Since stock markets produce huge datasets it data mining
techniques is found to be more efficient.Data mining is utilized for excavate data from databases
and discover the meaningful patterns from the database. The usefulness of this data makes data
mining imperative and necessary.The essentials of data mining in finance are originating from
the need to adopt specific well organized criteria to predict exactness, facilitate multi-resolution
calculation.
4.2 k- Nearest Neighbor (k-NN) algorithm
In pattern identification, the KNN algorithm is a technique for categorizing items according
nearest training samples. KNN is a sort of illustration based learning, or lazy learning where the
task is just approximated locally and all calculation is delayed until classification.
4.2.1 Assumptions in KNN
KNN assumes that the information is present in a feature space. Accurately, the data points are in
a metric space. Mostly these data are either multidimensional or scalar vectors. Since the points
are in feature space, they have a concept of distance. This requirement is not need to be
Euclidean distance yet it is used commonly.
Every training sample comprises of a vectors set and separate class label corresponding with
each vector. These classes may be either positive or negative classes. But KNN have the
efficiency to accomplish different tasks with random number of classes.
Additionally a single number “k” is given. This number makes a decision of what numbers of
neighbors (where neighbors are defined based on the distance metric) impact the classification.
This is typically an odd number if the quantity of classes is 2. In the event that k=1, then the
algorithm is just called the nearest neighbor algorithm.
4.2.2 Basics of KNN
The KNN is the principal and most straightforward classification technique when the information
about the distribution of the data is insufficient. This convention basically holds the whole
training set during learning and allocates to every query a class characterized by the majority
label of its k-nearest neighbors in the training set. The Nearest Neighbor (NN) principle is the
least complex type of KNN when K = 1.
In this algorithm every training samples ought to be grouped to its samples surrounded by it.
Subsequently, if the classification of any of the sample data is obscure, then it could be
anticipated by considering the classification of its nearest neighbor tests. Given an obscure
sample and a training set consisting of samples, all the distances between the obscure sample and
the entire sample in the training set can be calculated by utilizing the accompanying
mathematical statement
(4.1)
where, x1, x2, x3,xp are anticipators of the first sample and u1, u2,u3,… up are anticipators of the
second sample. If distance is of smallest value, then the samples in the training set is close to the
obscure sample. Hence, the obscure sample may be categorized based on this nearest neighbor
classification.
Known
Samples
Unknown
Samples
(a)
(b)
Fig 4.1 KNN decision rule
Fig 4.1 illustrates the KNN decision rule for K= 1 and K= 3 for a set of samples divided into 2
classes.In Fig 4.1(a), an obscure sample (unknown sample) is categorized by using only one
known sample; In Fig 4.1(b) more than one known sample is used. In the last case, the parameter
K is set to 3, hence the closest three samples is considered for classifying the obscure one. Two
of them belong to the same class, whereas only one belongs to the other class. In both cases, the
unknown sample is classified as belonging to the class on the left. Fig 4.2 shows the pseudo code
for the KNN algorithm
Input: Finite set A , Finite Set B, k, function c:B->{1,2,….n}
Output: r:A->{1,2,…..n}
Begin
For each x in A do
Let L<- {}
For each b in B add (a(x,b), c(b)) to L
Sort the elements in L with the first components
Compute the class labels from the first k elements from L
Let r(x) be the class containing highest number of occurrences
End
Return r
End
Fig 4.2 Pseudo code for KNN algorithm
The classifier performance is principally controlled by the decision of K and in addition the
distance metric applied [20-25]. This evaluation is influenced by the sensitivity of the choosing
the neighborhood size K, since local region radius is calculated by the Kth nearest neighbor
distance to the query and diverse value of K yields various conditional class probabilities.
4.2.2.1 Distance Metric
KNN makes estimation according to the result of the K neighbors closest to that point.
Accordingly, to make estimation with KNN, we have to characterize a metric for measuring the
separation between the query point and cases from the samples. A familiar opinion to estimate
this distance is known as Euclidean. Different measures include Euclidean square, City-square,
and Chebychev. Table 4.1 presents the distance metrics and their formula.
Table 4.1 Distance metrics employed in KNN
Distance Metric
Formula (x- query point, p data point from
unknown sample)
Euclidean Distance
Euclidean Squared
City-block
Chebychev
4.2.2.2 K-Nearest Neighbor Predictions
After choosing the value of K, anticipations are made based on the KNN samples. For
regression, KNN prediction is the result of average of the K nearest neighbors:
(4.2)
Where xi is the ith case of the sample and y is the query point anticipation (result).In classification
problem, based on the voting scheme KNN anticipation is performed in which the winner is used
to name the query. Generally the K neighbors have equivalent impact on prediction regardless of
their relative distance from the query point. An optional methodology is to use randomly large K
values with more vitality given to cases nearest to the query point. This is accomplished by using
'distance weighting'.
4.2.2.3Distance Weighting
Since KNN forecasts are based on the belief that items close in distance are conceivably similar,
it is good to differentiate between the K nearest neighbors during prediction, i.e., let the closest
points among the K nearest neighbors have more say in influencing the result of the query point.
This can be attained by presenting a set of weights W, one for every nearest neighbor,
characterized by the relative closeness of each one neighbor regarding the query point. Thus
(4.3)
Where
is the distance between the query point x and the ith case pi of the sample. It is
clear that the weights defined in this manner above will satisfy:
(4.4)
Thus, for regression problems, we have:
(4.5)
For classification problems, the highest value of the above equation is taken for every one of
class variables. It is obvious from the above equation that when K>1, one can basically
characterize the standard deviation for predictions in regression tasks using,
(4.6)
Some of the KNN merits are depicted as follows: Easy to use; resilient to noisy training samples,
particularly if the inverse square of weighted distance is used as the "distance" measure; and
Effective if the training data is vast. In spite of these advantages, it has a few demerits such as: a)
computationally expensive as it needs to find distance of each one query example to all training
sample data; b) The huge memory to execute in extent with size of training set; c) Low precision
rate in multidimensional datasets; d) Need to find the parameter value K, the quantity of nearest
neighbors; e) Distance based learning is not clear which sort of distance to use; and f) decide
which labels are ideal to produce the best results.
Therefore, to overcome the low precision rate of KNN, Modified KNN (MKNN) has been
proposed in this research work. The MKNN preprocesses the training set before using it and
finds the legitimacy of any training data.The final classification is then made by applying
weighted KNN which used validity as the multiplicative factor.
4.3 Modified K-Nearest Neighbor (MKNN)
In this research Modified K-Nearest Neighbor Algorithm is used for prediction of stock index
movement.The fundamental idea of the presented technique is allocating the class label of the
queried instance into K validated data training points and the validity of all data tests in the
training set is calculated. At that point, a weighted KNN is performed on any trained samples.
Fig 4.3 demonstrates the pseudo code of the MKNN algorithm.
Pseudo-code of the MKNN Algorithm
Output_label:= MKNN ( train_set , test_sample)
Begin
For i := 1 to train_size
Validity(i) := Compute Validity of i-th sample;
End for;
Output_label:=Weighted_KNN(Validity,test_sample);
Return Output_label;
End
Fig 4.3 Pseudo code of the MKNN algorithm
4.3.1 Data and Sources of Data
This exploration inspects the monthly change of closing values of NSE-NIFTY and BSE stock
data according to the following predictors: Open price, High price, Low price and Close price.
NSE-NIFTY and BSE stock index values are acquired from the NSE and BSE sites separately
for the period from Jan'2013 to Dec 2013. The data is split into two sub-tests of 80:20 where the
in-test sample or preparing data compasses from Jan' 2010 to Dec' 2012 and the data for the
remaining period from Jan 2013 to Dec 2013are used for out-of sample or test data.
4.3.2 Preprocessing the data
When the data was gathered at first, all the values of the attributes chosen were continuous
numeric values. Data conversion was applied by generalizing the data to a higher-level concept
so as all the values got to be discrete. The rule that was made to convert the numeric values of
each one attribute to discrete values relied on upon the earlier day closing price of the stock. If in
case that the values of the properties open, high, low, and close were more prominent than the
estimation of attribute past for the same trading day, the numeric values of the attribute were
supplanted by the value positive. In the event that the values of the attributes said above were
short of what the value of the attributes used previously, the numeric values of the attributes
were supplanted by negative. If the values of those attributes were equal to the value of the
attribute previous,then values were replaced by the same equal value.
4.3.3 Building the Model
After the data has been arranged and converted, the upcoming step was to build the forecast
model using the MKNN algorithm. The MKNN was chosen since the development of MKNN
classifiers does not require any domain information, along these lines it is fitting for exploratory
learning discovery. Also, it can deal with high dimensional data. In the MKNN algorithm, each
sample in training set must be validated at the first step. The validity of each one point is found
as per its neighbors. The validation procedure is performed for all train samples. To accept a
sample point in the training set, the H nearest neighbors of the point is considered. Among the H
nearest neighbors of a training test x, validity(x) enumerate the quantity of points with the same
name to the label of x. The formula which is proposed to calculate the validity of every point in
train set is
(4.7)
where H is the number of considered neighbors and lbl(x)returns the true class label of the
sample x. also, Ni(x) stands for the ith nearest neighbor of the point x. The function S takes into
account the similarity between the point x and the ith nearest neighbor.
(4.8)
4.3.3.1 Prediction Model
The prediction model considers Opening value, High value, Low value and Closing value of the
market index as independent variables and the next day’s closing value as the dependent
variable. The MKNN algorithm identifies ‘k’ nearest neighbors in the training data set in terms
of the Euclidean distance with respect to the day for which prediction is to be done. Once knearest neighbors are identified, the prediction for that day is computed as the average of the
next day’s closing prices of those neighbors. The MKNN employs weighted KNN on the test
data set for predicting the next day’s closing value. The output of the predictive model is
compared with the actual values of the test dataset for validation.
Applying weighted KNN
Each of the K samples is given a weighted vote that is usually equal to some decreasing function
of its distance from the unknown sample. For example, the vote might set be equal to 1/(de+1),
where de is Euclidian distance. These weighted votes are then summed for each class, and the
class with the largest total vote is chosen. This distance weighted KNN technique is very similar
to the window technique for estimating density functions. For example, using a weighted of 1/
(de+1) is equivalent to the window technique with a window function of 1/ (de+1) if K is chosen
equal to the total number of training samples.
In the MKNN method, first the weight of each neighbor is computed using
(4.9)
Then, the validity of that training sample is multiplied on its raw weight which is based on the
Euclidian distance. In the MKNN method, the weight of each neighbor sample is derived
according to
(4.10)
Here v (i) and Val (i) stand for the weight and the validity of the i th nearest sample in the train
set.
4.3.3.2 Classifier Model
The classifier model considers opening value, high value, low value, closing value and returns of
the market index as independent variables and the next day’s class as the dependent variable.
Returns for a day is calculated as
(4.11)
Where vt is the closing value of the index on the current day and vt-1 is the closing value of the
index of previous day. If the next days’ return is positive, the next day’s class is classified as
“bull” otherwise “bear”.The yield of the classifier is compared with the real classes of the test
data set to improve the effectiveness of the approach.
4.4 Empirical Results
The examined data sample comprises of daily returns from January 2010 to December 2013 of
three stock market indices, BSE oil and gas, CNX-100 and CNX-NIFTY. Data samples are
collected from the historical values of NSE- NIFTY and BSE (Bombay Stock Exchange) data.
The total data set is split into two one for training the network and remaining for testing the
performance of the network. In this experiment, the stock index data from January 2010 to
December 2012 is used to train the network and the data from January 2013 to December 2013 is
used to test the performance of the proposed approach.
4.4.1 Performance Measures
The following performance measures are used to gauge the performance of the trained
forecasting model for the test data: The Mean Squared Error (MSE), Root Mean Squared Error
(RMSE), R-Squared (R2), Adjusted R-squared (RA2), Hannan-Quinn Information Criterion
(HQ). Table 4.2 illustrates various performance measures that are used to evaluate the
effectiveness of the proposed approach.
Table 4.2: Performance Criteria and the related formula
Performance Criteria
Mean Squared Error
Formula
Root Mean Squared Error (RMSE)
R-Squared(R2)
= real value,
mean value
= estimated value,
=
Adjusted R-Squared(RA2)
Hannan-Quinn Information Criterion (HQ)
SSR =
4.4.2 Results
4.4.2.1 Prediction Model
Fig 4.4 presents the results for the returns (close price) for the year 2013 of the BSE Oil and Gas
index obtained using Modified KNN (MKNN) and table 4.3 shows the error rate of the proposed
approach using various performance measures.
Actual
Predicted
Jan
Feb
Mar
April
May
June
July
Aug
Sep
Oct
Nov
Dec
Close Price
100
95
90
85
80
75
70
65
60
55
50
Fig 4.4 BSE Predicted Close Price Value
Table 4.3 Error Rate of BSE
Test Criteria
Error Rate (%)
Mean Squared Error
3.87
Root Mean Squared Error (RMSE)
5.98
2
R-Squared(R )
0.35
Adjusted R-Squared(RA2)
1.67
Hannan-Quinn Information Criterion (HQ)
-5.03
Fig 4.5 presents the results for the returns (close price) for the year 2013 of the NSE CNX 100
index and table 4.4 shows the error rate of the MKNN approach using various performance
measures.
64
Close Price
62
60
58
Actual
56
Predicted
54
52
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
50
Fig 4.5 Predicted Close Price of CNX-100 Stock Index
Table 4.4 Error Rate of CNX-100 Stock Index
Test Criteria
Error Rate (%)
Mean Squared Error
3.67
Root Mean Squared Error (RMSE)
4.98
R-Squared(R2)
0.38
Adjusted R-Squared(RA2)
1.98
Hannan-Quinn Information Criterion (HQ)
-5.03
Fig 4.6 presents the results for the returns (close price) for the year 2013 of the NSE CNX
NIFTY index and table 4.5 shows the error rate of the proposed approach using various
performance measures.
64
Close Price
62
60
58
56
Actual
54
Predicted
52
Jan
Feb
Mar
April
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
50
Fig 4.6 Predicted Close Price of CNX-NIFTY Stock Index
Table 4.5 Error Rate of CNX-NIFTY Stock Index
Test Criteria
Error Rate (%)
Mean Squared Error
3.89
Root Mean Squared Error (RMSE)
4.43
2
R-Squared(R )
Adjusted
R-Squared(RA2)
Hannan-Quinn Information Criterion (HQ)
0.45
1.78
-5.03
4.2.2.2 Classification Model
The results obtained from the two classifying models for BSE oil and gas, CNX-100 and CNXNIFTY are given below.
Table 4.6 Comparison of Classifier Models on the Test Dataset for BSE Oil and Gas
k-NN
algorithm
Instances Accuracy
MKNN
algorithm
Instances
Accuracy
Correctly
classified
258
77.9%
294
88.8%
Incorrectly
classified
73
22.1%
37
12.2%
Table 4.6 shows that the MKNN algorithm rightly classifies the next day’s index movement of
BSE Oil and Gas Index for 294 instances out of the total of 331 instances with an accuracy rate
of 88.8% and misclassifies 37instances with an error rate of 12.2%.But the KNN correctly
classifies the next day’s index movement only for 258 instances out of the total of 331 instances
with an accuracy rate of 77.9% and misclassifies 73 instances with an error rate of 22.1%
respectively.
Table 4.7 Comparison of Classifier Models on the Test Dataset for CNX-100
k-NN
algorithm
Instances Accuracy
MKNN
algorithm
Instances
Accuracy
Correctly
classified
254
76.89%
290
88.01%
Incorrectly
classified
77
22.87%
41
12.79%
Table 4.7 shows that the MKNN algorithm rightly classifies the next day’s index movement of
CNX-100 Index for 294 instances out of the total of 331 instances with an accuracy rate of
88.01% and misclassifies 41 instances with an error rate of 12.79% respectively.But the KNN
correctly classifies the next day’s index movement only for 254 instances out of the total of 331
instances with an accuracy rate of 76.89% and misclassifies 77 instances with an error rate of
22.87% respectively.
Table 4.8Comparison of Classifier Models on the Test Dataset for CNX-NIFTY
k-NN
algorithm
Instances Accuracy
MKNN
algorithm
Instances
Accuracy
Correctly
classified
256
77.01%
295
88.57%
Incorrectly
classified
75
22.45%
36
11.98%
Table 4.8 shows that the MKNN algorithm rightly classifies the next day’s index movement of
CNX-NIFTY Index for 295 instances out of the total of 331 instances with an accuracy rate of
88.57% and misclassifies 36instances with an error rate of 11.98% respectively. But the KNN
correctly classifies the next day’s index movement only for 256 instances out of the total of 331
instances with an accuracy rate of 77.01% and misclassifies 75 instances with an error rate of
22.45% respectively.
Table 4.9Confusion Matrices for BSE Oil and Gas
K-NN
algorithm
MKNN
Algorithm
Predicted Class
Bull
Bear
104
60
Predicted Class
Bull
Bear
150
14
Actual Class
Bull
Bear
19
148
9
158
It is seen from the Table 4.9 that MKNN algorithm rightly classifies 150 bull class instances out
of the total of 164 bull class instances and rightly classifies158 bear class instances out of the
total of 167 bear class instances. But the KNN has lower performance compared to the MKNN
model.
Table 4.10Confusion Matrices for CNX-100
K-NN
algorithm
MKNN
Algorithm
Predicted Class
Bull
Bear
100
64
Predicted Class
Bull
Bear
149
15
Actual Class
Bull
22
Bear
145
12
155
Table 4.10 shows that MKNN algorithm rightly classifies 149 bull class instances out of the total
of 164 bull class instances while KNN classifies only about 100 instances and MKNN correctly
classifies158 bear class instances out of the total of 167 bear class instances which is about 145
instances in bear class .
Table 4.11 Confusion Matrices for CNX-NIFTY
K-NN
algorithm
MKNN
Algorithm
Predicted Class
Bull
Bear
101
63
Predicted Class
Bull
Bear
151
13
Actual Class
Bull
Bear
21
146
11
156
Table 4.11 shows that MKNN algorithm rightly classifies 151 bull class instances out of the total
of 164 bull class instances while KNN classifies only about 101 instances and MKNN correctly
classifies156 bear class instances out of the total of 167 bear class instances which is about 146
instances in bear class .
Download