Keystroke Biometric System Test Taker Setup and Data Collection Hassan Poorshatery, Geoffrey Garcia, Elizabeth Teracino, Xiaolu Zhao, Vinnie Monaco, John Stewart, and Charles Tappert {hp47222n, gg30810n, et75813p, zx84933n, jm51645n, ctappert}@pace.edu Seidenberg School of CSIS, Pace University, White Plains, NY, 10606, USA Keystroke dynamics are the patterns of rhythm and timing created when a person types. They include overall speed, variations of speed moving between specific keys, common errors and the length of time that keys are depressed. Data is recorded when each key was pressed, for how long (duration) by recording when the key was released, and the latency between each key stroke. This rhythm is believed to be unique to an individual and is captured to develop a unique biometric template for the future authentication of that same individual. Abstract Pace University’s Seidenberg School of Computer Science and Information Systems (CSIS) has developed over the past seven years a robust Pace Keystroke Biometric System (PKBS) for both identifying and authenticating users via their typing rhythms and patterns which can be used to uniquely differentiate between users. The PKBS consists of three components: the Keystroke Entry System (KES) that collects raw keystroke data over the Internet, the Keystroke Feature Extractor (KFE) that extracts feature vectors from the raw data, and the Keystroke Pattern Classifier (KPC) that is used in the authentication process. The work described in this paper focuses on enhancements to the keystroke entry system to support a real-world application to authenticate students taking online tests. Pace University has been researching this method of identification and authentication via experimentation and the implementation of the PKBS software for more than seven years. With the increase of enrolment in online education there is a concern for evaluation security and academic integrity.[4] Ensuring that students are who they say they are during online examinations is extremely important. In a similar vein, this sort of authentication could be used for the same reasons when training and orientation examinations are administered in a business setting. 1. Introduction Keystroke biometrics is one of the least studied of the biometrics; however this has been changing over recent years due to the increase in online testing and email monitoring in corporate settings. It is a behavioural biometric that can be used for identification and authentication purposes. User Authentication is a process that determines whether to confirm or deny a user’s claimed identity. [5] Passwords are a common form of authentication, however, user authentication can also be accomplished with biometric systems via what you are (i.e. fingerprints, iris) or how you behave (handwriting, signature, typing rhythm). An advantage of using a keystroke behavioural biometric system over alternatives is that the only piece of hardware needed is a keyboard, making this an inexpensive tool. The Keystroke Biometric System process begins with an initial training period where users register and login to the system where they are asked to answer a set of practice questions to gather initial sampling data. Later, another set of tests is required to test the authentication of the users from the initial sampling. The KES was revised and updated to use JavaScript instead of the original Java Applet configuration in order to eliminate the user’s need to have Java installed and simplify the data entry process on the user’s end. Feature measurements are extracted from the raw data using the Keystroke Feature Extractor program (KFE) and are then processed by the Keystroke Pattern Classifier (KPC) which uses the k-Nearest 1 Neighbour classifier. [1] The collected raw keystroke data samples (test) are processed and compared against the archived enrolment samples (train) to make an authentication decision. The test taker is either matched (accepted) or not matched (rejected). The system is first initiated by the instructor by entering the course name and questions into a text file called “Prompts.txt”. Next, the students login to the KES website which is the test taking environment and register as new users. During this registration phase, the student will be asked to enter their first name, last name, if they are right or left-handed, and whether they are on a laptop or desktop. Registered students can then login to the site to take the test, answering the questions from the questions file. The KES displays each question while JavaScript event handlers monitor the text input area where keystrokes entered by a student are captured. After completing each question, the student is required to submit their answer which the system saves as keystroke dynamics information in text files on the hosting server. The purpose of this paper is to explain how to turn the KES system into a real-world application that can be used for the authentication of students taking an online test, with an additional focus on the improvement of the accuracy of the system. 2. Revised KES Interface The Keystroke Entry System (KES) collects raw keystroke data over the Internet. The Keystroke Entry System (KES) has been customized for use by students to take online tests. Unlike the previous KES, the updated KES saves new raw data text files that contain both the keystroke codes and the answer to each question from each student. After using a data convertor program the keystroke code of these raw data files are ready to be later analyzed and compared using the Biometric Authentication Feature Extractor and Feature (data) Classifier to identify the keystroke patterns of each test taker (student), and finally, to authenticate them. The actual answers as typed by the students in the raw data file are used by the instructor for grading purposes. 2.1. Changes to the KES Figure 1 shows the components of the new KES along with other components of the PKBS. Figure 1: PKBS revised keystroke entry system In the previous version of the KES, the four experimental categories were copy task on a desktop, copy task on a laptop, free-text entry on a desktop and free-text entry on a laptop. At the completion of registration or, upon returning to the site, a user is redirected to the activity selection page. The six pieces of information sent to, and required by, the Java applet included data such as experiment style, sequence number, keyboard style, and awareness. Lastly, the user was required to use Internet Explorer. [5] The new Keystroke Entry System is a PHP based web application that uses JavaScript compatible with the Mozilla Firefox browser (Figure 2). Unlike the previous version there is no need to have java installed on the user’s end. 2 Figure 2: KES home page Figure 3: The data input screen 2.2. Data Format As can be seen in the Figure 4, other than the file name which displays the course/test name and question number, the raw data file generated from a user’s input within the Keystroke Entry System includes three sections of information: Regarding the application requirements and recommendations from previous team research, unnecessary information has been removed. For example, students are not informed that their keystrokes are captured during test taking for the purpose of biometric authentication. Instead this is simply acquired via background or stealth mode. Test answers from the KES (Figure 3) are valid only if at least 200 keystrokes (reduced from 300 keystrokes used in previous research) are collected. Upon completion of the entry the user is thanked and the data output is stored as a .txt file within the application structure in the format of <First>_<Last>_<Course Name>_PROMPT<Question #>. In addition to the raw data file, an additional file of the format <First>_<Last>_PROMPTS_COMPLETED is created which identifies all of the questions a user has already answered. For returning users the file is appended following the data entry to include which additional questions have been answered. 1. Header area: Student or User’s name 2. Data Entry area: a. #: Sequence number for each keystroke b. Key: Character displayed onscreen: a, b, c… c. Keycode: ASCII character code corresponding to the Key field d. Press: Time of key press (ms) e. Release: Time of key release (ms) f. Duration (ms): Difference in Press and Release time for a particular Sequence g. Latency (ms): Difference in Press time for a Sequence and the Release time from the previous Sequence Answer area a. Unformatted copy of the user’s input 3. Notes: 3 Key combinations, such as SHIFT + i, recorded as individual sequences 3. Authentication Experimental Method Holding down a key results in a new sequence every 31 milliseconds following an initial 500ms delay (depending on computer configuration) and until the key is released the release time records as a 0 As Figure 5 shows, in order to determine authentication accuracy the KBS uses a manually operated approach with multiple steps, which begins with the capture of training and test data through the Alpha and Beta Keystroke Entry Systems. Following the collection of data, raw files are input into the Keystroke Feature Extractor to simultaneously determine the collective key features on the training and testing data. Figure 4: A raw data file of a student Figure 5: PKBS Experimental Procedure 2.3. Alpha and Beta Version. In an effort to gather user feedback and test the system, both alpha (initial) and beta (real) environments were configured. The difference in the two is that the alpha version is intended to test the new KES and PKBS as a general purpose authentication system which uses randomly selected questions from a list of generic questions, while the beta version is intended to work as an online test taker system as a real-world application of PKBS. The beta version uses test questions presented in a particular order determined by the instructor where all questions need to be answered by the students. For the alpha testing, four training data samples (free text) were collected from a group of fourteen users. The same group was asked for a second set of four testing data samples a week later. Although the questions are generic, it simulates taking an online test and the data is marked as test data. For the beta testing, a group of fourteen students from Lake Erie College, under the supervision of Professor John Stewart were to be asked to submit sample training data and complete an actual test as their final exam using the system during the same sitting. However, due to the time restriction, we obtained both the training and testing data in a single exam session, later splitting half of the answered questions of each student into training and testing. A number of measurements or features are used to characterize a user’s typing pattern. These features are designed to describe an individual’s keystroke dynamics over writing samples of at least 200 characters. The features characterize a user’s keypress duration times, transition times in going from one key to the next, the percentages of usage of the non-letter keys and mouse clicks, and the typing speed . The feature extractor program was designed and implemented to extract 239 features. The resulting output is a feature vector file which is then manually split in half into two files corresponding to training and testing. Finally, the split files are input into the Authentication Classifier to determine authentication accuracy. The BAS System uses the k-Nearest Neighbour classifier. As part of the processing the multi-class input data is dichotomized into two classes. The test taker samples are decided to be within-class or between-class by the classifier. Withclass samples are decided by the classifier to be “you are authenticated”. Between-class samples are decided by the classifier to be: “you are not authenticated”. [3] 4 asks for the maximum amount of dichotomy data to use which equates to the maximum number of inter or intra class samples to create for experimentation, as well as the lowest N choices (which is used to optimize testing performance and stipulates the maximum nearest neighbour test). [2] After the dichotomy model is applied the data is saved and is then ready to be processed using the BAS: Accuracy Calculator (Figure 9). The calculator uses the output file from the Biometric Authentication System and applies nearest neighbour calculations to determine the false acceptance rate and the false rejection rate, as well as the overall performance of the test for each of the nearest neighbour calculations. 4. Efficient Authentication Process.. The Keystroke Feature Extractor program was modified by the previous team to offer different processes based upon whether it is working on train or test data. By means of a switch in the interface a choice can be made to use the training process to output a file containing the standardization x-min and x-max values for each feature. For the testing process the recorded x-min and x-max values from the training process will be imported and used to perform the standardization. [5] Figure 6 shows that efficient experimental procedure is as follows: 1. 2. 3. Extract a feature file from the training raw data and output a file of xmin/xmax values Extract a feature file from the testing raw data by reading in the xmin/xmax value file from step 1 Run the authentication classifier on the training and testing feature files Figure 8: KFE interface Figure 6: KBS efficient Authentication Process Figure 8: BAS or KPC component interface 5. Test Results Analysis.. After generating a features vector file by using the KFE (Figure 7), the authentication component of this system will utilize the Biometric Authentication System (BAS) which compares test and train data to determine if they are a match (Figure 8). The system 5 Figure 9: Accuracy Calculator Table 1: BAS Results for Alpha version (generic) kNN 1 3 5 7 9 Avg In the user authentication system, for a given attempt by a user, one of the following four cases can happen where u1 is a registered user and u2 is an unregistered user unknown to the system: FRR 36.90% 60.71% 57.14% 57.14% 57.14% 53.78% FAR 7.69% 1.85% 1.30% 1.17% 0.89% 2.58% Performance 90.71% 94.94% 95.65% 95.71% 96.04% 94.61% Table 2 illustrates the result for the student’s realexam (Beta test) by using 56 samples for training and 56 samples for testing the system with the average of 435 keystrokes per sample. True Positive: u1 claims to be u1 and is accepted False Reject: u1 claims to be u1 but is rejected Table 2: BAS Results for Beta version (real-exam) False Accept: u2 claims to be u1 but is accepted kNN 1 3 5 7 9 Avg True Negative: u2 claims to be u1 and is rejected False Acceptance Rate (FAR) and False Rejection Rate (FRR) are the error rates used to evaluate the performance of the biometric classifiers. [3] As the raw keystroke data samples were gathered via two different scenarios as previously explained, four different experiments were administered in order to test the performance of the new system. FRR 16.67% 39.29% 44.05% 44.05% 42.86% 37.38% FAR 16.83% 7.14% 7.42% 7.42% 6.87% 9.14% Performance 83.18% 91.10% 90.58% 90.58% 91.17% 89.32% By comparing Table 1 and 2 we can see the performance for the real-world application is lower than in the generic test. Table 1 shows the result for the generic (Alpha) test by using 56 samples of training data and 56 samples testing data with the average of 279 keystrokes per sample. As it can be seen from this table, the best performance is 96.04% when the kNN is 9 with the lowest FAR(.89%) and the highest FRR(57.14%). The average performance is calculated as 94.61%. Table 3 shows the result for the merging both the generic and the student’s real-exam (Alpha and Beta) by using 112 samples for training and 112 samples for testing the system with the average of 385 keystrokes per sample. The performance result is higher than that of both the Alpha test (Table 1) and Beta test (Table 2) at 97.41% when the kNN is 9. The FAR is 0.94% and is lower than that of the Beta test, and only slightly higher than that of the Alpha test (a 0.04% difference) when the kNN is 9. The trade-off is that the FRR for this case is higher than in both the Alpha and Beta tests, by quite a bit. The FRR is 61.90% as opposed to the 57.14% of the Alpha test, and 42.86% of the Beta test. 6 Table 3: BAS Results for mixing Alpha & Beta raw data kNN 1 3 5 7 9 Avg FRR 42.26% 67.26% 63.69% 66.07% 61.90% 60.24% FAR 5.90% 1.57% 1.37% 1.16% 0.94% 2.19% As a result, in order to increase accuracy of the authenticating students taking online exam using revised KES, it is necessary to increase the number of extra samples merged into the data from prior samples in the data bank in order to increase the training of the system. Furthermore, working to improve the feature extractor to include more features and decrease error rates will help to increase the BAS performance. Performance 93.11% 96.65% 96.94% 97.09% 97.41% 96.24% 6. Future Works. .....The current Pace Keystroke Biometric System (PKBS) is the result of several evolutionary prototyping projects each focusing on different components of the system. Because of this approach, the system lacks certain elements of cohesion, automation and process which would be necessary before being released in a production environment. It is recommended that the following steps be taken to further the current work: Table 4 shows the result of the merging of both exams (Alpha and Beta) by using 168 samples for training and 56 samples for testing the system with the same average of 385 keystrokes per sample. The results show that this mix increased the performance by almost an entire percentage point from the previous mix (Table 3), and when the kNN is 9, the performance is 98.38% which is the highest of the four experiments. Furthermore, the FAR is the lowest of the four experiments, at 0.33%. The FRR is the highest as well however, at 71.43%, when the kNN is 9. Table 4: BAS Results for mixing Alpha & Beta with more training samples kNN 1 3 5 7 9 Avg FRR 60.71% 78.57% 78.57% 71.43% 71.43% 72.14% FAR 1.85% 0.46% 0.46% 0.60% 0.33% 0.74% Performance 97.08% 98.12% 98.12% 98.12% 98.38% 97.96% Both Tables 3 and 4 indicate that the more the system is trained, the higher the performance percentages. In the last case illustrated by Table 4, we tested the new KBS with around 86,240 test patterns (merged the generic and student real-exam with more training samples) and obtained the best FAR of 0.33% when the FRR was 71.43% and the performance value was 98.38%. While the best performance for Table 2 (pure training and testing features from students only) is 91.17%. 7 Key hold times for the space characters should not considered because the user usually pause after pressing space key to recollect of what has to be typed next. Adapt the system to the changing typing patterns of the users Implement a spell check tool for users on internet browser which will discourage them from copying/pasting from other programs which do offer spell checking Utilize a database for all keystroke biometric system data Utilize web services and stored procedures for all components of the keystroke biometric system rather than executable files Develop an administrator interface allowing instructors the ability to create sample and actual tests, as well as set timeframes for when they can be taken Following each course individual keystroke biometric information should be merged into a master user table which tracks an individual’s academic career Develop the Keystroke Entry System to be cross-browser compatible Develop the system to either immediately process results or to have the processing run as a nightly job Develop a system (email and through the admin interface) to alert an instructor of suspicious results Revising Keystroke Feature Extractor (KFE) program to increase PKBS performance for real-world application with small amount of samples 8. Resources [1] S. Janapala, S. Roy, J. John, Luca Columbu, J. Carrozza, R. Zack, and C. Tappert, “Refactoring a Keystroke Biometric System”. paper b1, Proc. Student-Faculty Research Day, Seidenberg School of CSIS, Pace University, New York. Integrate the Keystroke Entry System with Blackboard for security and a seamless user experience 7. Conclusion [2] S. Bharati, R. Haseem, R. Khan, M. Ritzmann, A. Wong, “Biometric Authentication System using the Dichotomy Model” paper c3, Proc. Student-Faculty Research Day, Seidenberg School of CSIS, Pace University, New York, May 2008. Keystroke biometric is an inexpensive, yet effective method of user identification and authentication. The Pace keystroke Biometric System (PKBS), if developed further, could be particularly ideal for student online testing when embedded within a browser and customized per the institution utilizing the system. [3] C. C Tappert, S.-H. Cha, M. Villani, and R. S. Zack, “A Keystroke Biometric System for Long-Text Input”. Int. J. Info. Security and Privacy (IJISP), Vol 4, No 1, 2010, pp 32- While the main concern is currently surrounding academic integrity during online testing, this sort of authentication could be used for the same reasons when training and orientation examinations are administered in a business setting. [4] R.S. Zack, C.C. Tappert and S.-H. Cha, "Performance of a Long-Text-Input Keystroke Biometric Authentication System Using an Improved k-Nearest-Neighbor Classification Method," Proc. IEEE 4th Int Conf Biometrics: Theory, Apps, and Systems (BTAS 2010), Washington, D.C., Sep 2010. The results of our four different experiments demonstrate that in order to increase authentication accuracy of students taking online exam using the revised Keystroke Entry System (KES), we need to gather more training samples which can be taken prior to the exams with the purpose of training the system more prior to exams. Moreover, revising the Keystroke Feature Extractor (KFE) program to generate an increase in features and decrease in error rate would further prepare the system for real-world applications with a smaller amount of samples, thus increasing the BAS performance rate. [5] A. C. Caicedo, K. Chan, D. A. Germosen, S. Indukuri, M. N. Malik, D. Tulasi, M. C. Wagner, R. S. Zack and C. C. Tappert,” Keystroke Biometric: Data/Feature Experiments” paper b5, Proc. Student-Faculty Research Day, Seidenberg School of CSIS, Pace University, New York, May 2010. 8