International Journal of Science, Engineering and Technology

International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 1, Issue 1, July 2012
Selection of Appropriate Candidates for
Scholarship Application Form Using KNN
Algorithm
Khin Thuzar Tun1, Aung Myint Aye2
Department of Information Technology, Mandalay Technological University
khinthuzar71183@gmail.com
Abstract—The proposed system is purposed to make decision
for Universities’ scholarship programs. This system defines
required facts for specified application forms and rules for
these facts. KNN (K-Nearest Neighbor) provides nearest result
for scholarship program based on some suitable similarity
function or distance metric. Euclidean distance is used to
calculate the distance between training and test data. In this
system, the required data of administrator and participants are
stored in SQL database. C#.Net programming language is used
to implement this system.
Keywords— Decision-making, universities’ scholarship, KNN,
Euclidean distance, SQL
I. INTRODUCTION
Since technological developments has been increased, web
based applications have been popular in various fields such as
business, education, medical and so on. Today, different
systems are very popular in web applications. Similarly,
candidates who need to attend in scholar programs at foreign
universities can study and apply using online system. So,
decision-making system is used in various web application
as well as in other fields.
Decision-making can be regarded as the cognitive process
resulting in the selection of a belief and/or a course of action
among
several
alternative
possibilities.
Every
decision-making process produces a final choice [1] that may
or may not prompt action. Decision-making can also be
known as a problem-solving activity terminated by a solution
deemed to be satisfactory.
Three major types of pattern recognition terminology:
unsupervised, semi-supervised and supervised learning.
Supervised learning is based on training a data sample
from data source with correct classification already assigned.
Self-Organizing neural networks learn using unsupervised
learning algorithm to identify hidden patterns in unlabelled
input data. This unsupervised refers to the ability to learn and
organize information without providing an error signal to
evaluate the potential solution [2].
The supervised category is also called classification or
regression, each object of the data comes with a
pre-assigned class label. The task is to train a classifier to
perform the labeling, using the teacher. A procedure which
tries to leverage the teacher‘s answer to generalize the
problem and obtain his knowledge is learning algorithm. The
data and the teacher‘s labeling are supplied to the machine to
run the procedure of learning over the data. Although the
classification knowledge learned by the machine in this
process might be obscure, the recognition accuracy of the
classifier will be the judge of its quality of learning or its
performance [3].
There are many classification and clustering methods as
well as the combinational approaches [4-5]. While the
supervised learning tries to learn from the true labels or
answers of the teacher, in semi-supervised the learner
conversely uses teacher just to approve or not to approve the
data in total. It means that in semi-supervised learning there is
not really available teacher or supervisor. The procedure first
starts with fully random manner, and when it reaches the state
of final, it looks to the condition whether he win or lose. K
Nearest Neighbor (KNN) classification is one of the most
fundamental and simple classification methods.
A technique that classifies each record in a dataset based
on a combination of the classes of the k record(s) most similar
to it in a historical dataset (where k= 1). Sometimes it is
called the k-nearest neighbor technique. K-Nearest Neighbor
is a supervised learning algorithm where the result of new
instance query is classified based on majority of K-Nearest
Neighbor category. . Many researchers have found that the
K-NN algorithm accomplishes very good performance in
their experiments on different data sets.
In this system, KNN algorithm with Euclidean distance is
used to make suitable decision for online scholarship
programs in order to choose the suitable candidates.
III. CLASSIFICATION AND CLUSTERING
Data mining has recently emerged as a growing field of
multidisciplinary research. It combines disciplines such as
databases, machine learning, artificial intelligence, statistics,
automated scientific discovery, data visualization, decision
science, and high performance computing.
Data mining technique is used often with large database,
data warehouse etc. It is mainly applied in decision support
systems for modeling and prediction. There are several kinds
of data mining: classification, clustering, association,
sequencing etc.
Two common data mining techniques are clustering and
classification.
For classification, the classifier model is needed. Data are
divided into training set and test set. The training data is used
to create the model. Then the test set applied for checking the
model correctness. Until satisfied, the model is trained and
adjusted by training data. Common techniques used in
1
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 1, Issue 1, July 2012
classification are decision tree, neural network, naïve bayes,
Euclidean distance, etc.
For clustering, a loose definition of clustering could be
“the process of organizing objects into groups whose
members are similar in some way”. A cluster is therefore a
collection of objects which are “similar to each other and are
“dissimilar” to the objects belonging to other clusters. Cluster
analysis is also used to form descriptive statistics to ascertain
whether or not the data consists of a set distinct subgroups,
each group representing objects with substantially different
properties.
Imaging a database of customer records, where each record
represents a customer's attributes. These can include
identifiers such as name and address, demographic
information such as gender and age, and financial attributes
such as income and revenue spent.
Clustering is an automated process to group related
records together. Related records are grouped together on the
basis of having similar values for attributes. In fact, the
objective of the analysis is often to discover segments or
clusters, and then examine the attributes and values that
define the clusters or segments. As such, interesting and
surprising ways of grouping customers together can become
apparent.
Classification is a different technique than clustering.
Classification is an important part of machine learning that
has attracted much of the research endeavors. Various
classification approaches, such as, k-means, neural networks,
decision trees, and nearest neighborhood have been
developed and applied in many areas.
A classification problem occurs when an object needs to be
assigned into a predefined group or class based on a number
of observed attributes related to that object. There are many
industrial problems identified as classification problems. For
examples, Stock market prediction, Weather forecasting,
Bankruptcy prediction, Medical diagnosis, Speech
recognition, Character recognitions to name a few [6-7].
Classification technique is capable of processing a wider
variety of data and is growing in popularity. The various
classification techniques are Bayesian network, tree
classifiers, rule based classifiers, lazy classifiers, Fuzzy set
approaches, rough set approach etc.
IV. K -NEAREST NEIGHBORS ( KNN )
In 1968, Cover and Hart proposed an algorithm the
K-Nearest Neighbor, which was finalized after some time.
K-Nearest Neighbor can be calculated by calculating
Euclidean distance, although other measures are also
available but through Euclidean distance we have splendid
intermingle of ease, efficiency and productivity [8].
Nearest Neighbor Classification is quite simple, examples
are classified based on the class of their nearest neighbors.
For example ,if it walks like a duck, quacks like a duck, and
looks like a duck, then it's probably a duck. The k - nearest
neighbor classifier is a conventional nonparametric classifier
that provides good performance for optimal values of k. In the
k – nearest neighbor rule, a test sample is assigned the class
most frequently represented among the k nearest training
samples. If two or more such classes exist, then the test
sample is assigned the class with minimum average distance
to it. It can be shown that the k – nearest neighbor rule
becomes the Bayes optimal decision rule as k goes to infinity
[9]. The K-NN classifier (also known as instance based
classifier) perform on the premises in such a way that
classification of unknown instances can be done by relating
the unknown to the known based on some distance/similarity
function. The main objective is that two instances far apart in
the instance space those are defined by the appropriate
distance function are less similar than two nearly situated
instances to belong to the same class [10].
The k-nearest neighbour (k-NN) technique, due to its
interpretable nature, is a simple and very intuitively
appealing method to address classification problems.
However, choosing an appropriate distance function for
k-NN can be challenging and an inferior choice can make the
classifier highly vulnerable to noise in the data. The best
choice of k depends upon the data; generally, larger values of
k reduce the effect of noise on the classification, but make
boundaries between classes less distinct. A good k can be
selected by various heuristic techniques.
In binary (two class) classification problems, it is helpful to
choose k to be an odd number as this avoids tied votes. The
K-Nearest Neighbor algorithm is amongst the simplest of all
machine learning algorithms: an object is classified by a
majority vote of its neighbors, with the object being assigned
to the class most common amongst its k nearest neighbors (k
is a positive integer, typically small). Usually Euclidean
distance is used as the distance metric; however this is only
applicable to continuous variables. In cases such as text
classification, another metric such as the overlap metric or
Hamming distance, for example, can be used.
K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on a similarity
measure (e.g., distance functions). KNN has been used in
statistical estimation and pattern recognition already in the
beginning of 1970’s as a non-parametric technique.
K nearest neighbor algorithm is very simple. It works based
on minimum distance from the query instance to the training
samples to determine the K-nearest neighbors. The data for
KNN algorithm consist of several attribute names that will be
used to classify. The data of KNN can be any measurement
scale from nominal, to quantitative scale.
The KNN algorithm is shown in the following form:
Input: D, the set of k training objects, and test object z= (x',
y').
Process: Compute d(x', x), the distance between z and every
object, (x, y) ∈ D. Select Dz ⊆ D, the set of k closet training
objects to z.
Output: y'= argmaxv ∑(xi, yi) ⊆ Dz I(v= yi)

v is a class label

yi is the class label for the ith nearest neighbors
 I (.) is an indicator function that returns the value 1 if
its argument is true and 0 otherwise.
In this system, KNN algorithm is used the suitable result by
mixing the Euclidean distance among the various kinds of
distance metric. The Euclidean distance is as shown in below
:
d ij 
x
i1
 x j1   xi 2  x j 2     xip  x jp 
2
2
2
Where
d ij = the distance between the training objects and test
object
xi = input data for test object
xj = data for training objects stored in the database
In KNN algorithm, there are several advantages and
disadvantages:
2
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 1, Issue 1, July 2012


Advantages
 Robust to noisy training data
 KNN is particularly well suited for
multi-modal classes as well as applications
in which an object can have many class
labels.
 KNN is simple but effective method for
classification.
 KNN is an easy to understand and easy to
implement classification technique.
 Effective if the training data is large
Disadvantages




KNN is low efficiency for dynamic web
mining with a large repository.
Distance based learning system is not clear
which type of distance to use and which
attribute to use to produce the best results.
Need to determine value of parameter K
(number of nearest neighbors)
Computation cost is quite high because it
needs to compute the distance of each
query instance to all training samples.
V. PROPOSED SYSTEM DESIGN
The proposed system design is illustrated in Figure 1.
In this system, the user firstly registered to enter and apply
for scholarship program in the system. If the user is new, he
must fill in the register completely for a new account and
then must login. If he doesn’t fill in fully, the system will
prompt the error message. After the user login, he can select
the educational location and university he need to apply. He
can enter consequently home page of the desired university
and download the scholarship application form by clicking
the download link. To submit scholar, firstly the user fill the
information required for scholarship program and click the
submit button. After submitting, the system calculates
distance using the KNN classifier and Euclidean distance
with training data in database. Finally the system decides the
appropriate result for scholarship according to distance.
Three scholarship universities is included to implement this
system.
Decision-making system for online scholarship application
form is implemented by C#.Net programming language on
Microsoft .NET framework 3.0 and above and Microsoft
Internet Information Services (IIS) is intended to use in this
system. SQL database is used to store the applicants’ data and
training data for scholarship universities. This is also used for
various kinds of other online applications.
VI. SYSTEM IMPLEMENTATION RESULTS
This section describes the implementation results. When
the system starts, “Home Page” appears as shown in Figure 2.
In this page, there are three menus: Register, User login,
Admin login.
Figure 2. Home page
When the user is a new one, the user creates a user account
at the registration page as shown in Figure 3. If the user
doesn’t input completely, the error message will be
prompted.
Figure. 1 Proposed system flow chart
Figure 3. User registration page
3
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 1, Issue 1, July 2012
If user registration is complete successfully, the user login
with the correct login name and password to enter the system
as shown in Figure 4.
Figure 7. Scholarship submittion page
Figure 4. User Login page
Figure 5 shows the scholarship’s Home page including the
login user name who logged into this system.
For administrator page, the administrator must enter the
login by entering the valid administrator name and password
in the Figure 8. The administrator is one who authorizes to
manage the whole system.
Figure 8. Administrator Login page
Figure 5. Scholarship home page
Figure 6 and 7 illustrate the home page and the scholarship
submission page for Tokyo University. In the submission
page, user can view the application form for this university
from download link. The user must fill in the required data in
the submission form fully and will the user get the result
whether he is appropriate to attend at that university. If he is
incomplete, the error message will appear in the submission
form.
The administrator can add the rules for each university in
the system corresponding to the university’s requirements.
The rules adding page corresponding to the university is as
shown in Figure 9. In Figure 10, the rules table for Tokyo
university is illustrated as example .
Figure 6. Tokyo University’s home page
Figure 9. Rules adding page
4
All Rights Reserved © 2012 IJSETR
International Journal of Science, Engineering and Technology Research (IJSETR)
Volume 1, Issue 1, July 2012
[7] U. Khan, T. K. Bandopadhyaya, and S. Sharma,
“Classification of Stocks Using Self Organizing Map”,
International Journal of Soft Computing Applications, Issue
4, 2009, pp.19-24.
[8] Dasarathy, B. V., “Nearest Neighbor (NN) Norms,NN
Pattern Classification Techniques”. IEEE Computer Society
Press, 1990.
[9] ^ James Reason (1990). Human Error. Ashgate.
ISBN 1-84014-104-2.
[10] Man Lan, Chew Lim Tan, Jian Su, and Yue Lu,
“Supervised and Traditional Term Weighting Methods for
Automatic Text Categorization”, IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 31, No. 4,
April 2009.
Figure 10. Rules table
VII. CONCLUSION
This paper proposed an online decision making system for
scholarship. KNN classification algorithm is used to select
suitable candidate for scholarship program. This algorithm
classifies the instances based on the similarity function to the
instance in the training data (rules data). The system decides
the selection of appropriate candidates by using C#.NET
programming language. Internet Information Services (IIS)
as web server and SQL database to store the data are used.
This system is suitable for many online decision making
systems in various fields.
VIII. SYSTEM LIMITATIONS
In the proposed system, as KNN is a “lazy” learning
algorithm and results a high computational cost at the
classification time, the administrator of the proposed system
want to prepare and store the conforming rules of the each
corresponding universities’ standards clearly and correctly in
the database. The drawback of K-NN is its inefficiency for
large scale and high dimensional data sets. Thus, better
algorithms are more appropriate than KNN if the system’s
data sets is very large scale.
REFERENCES
[1] ^ James Reason (1990). Human Error. Ashgate. ISBN 184014-104-2.
[2]Annamma Abraham , Dept. of Mathematics
B.M.S.Institute of Technology, Bangalore, India ,
“ Comparison of Supervised and Unsupervised Learning
Algorithms for Pattern Classification”, (IJARAI) .
[3] Cover, T.M., Hart, P.E., “Nearest neighbor pattern
classification”,
IEEE
Trans.
Inform.
Theory,
IT-13(1):21–27, 1967
[4] K. ITQON, Shunichi and I. Satoru, “Improving
Performance of k-Nearest Neighbor Classifier by Test
Features”, Springer Transactions of the Institute of
Electronics, Information and Communication Engineers
2001.
[5] Michael Steinbanch and Pang-Nang Tan “kNN:
k-Nearest Neighbors”
[6] Moghadassi, F. Parvizian, and S. Hosseini, “A New
Approach Based on Artificial Neural Networks for Prediction
of High Pressure Vapor-liquid Equilibrium”, Australian
Journal of Basic and Applied Sciences, Vol. 3, No. 3, pp.
1851-1862, 2009.
5
All Rights Reserved © 2012 IJSETR