2011 International Conference on Information and Electronics Engineering IPCSIT vol.6 (2011) © (2011) IACSIT Press, Singapore Psychological Elements for Data Collection:Responsibility in Data Accuracy Ning Fu Assumption University of Thailand, Ramkhamhaeng 24 Road, Bangkok 10240, Thailand funing65510@163.com Abstract. In the information age, data quality issues occur in many areas such as economic data, scanning data, financial services, billing, medical profession, and even publication of lottery numbers. “If it’s in the computer, it must be right”, it is accepted widely. Obviously, this thought is plain wrong. Until now, many articles depicting the methodologies of improving data quality, including inspect or revise after the data had been entered [2, 4, 7, 10, 11, 17]. However, until recently, the portion of obtaining data is not referenced or paid much attention. This paper presents the assessment of the assistance of psychological selective attention, color perception, and cue action in acquiring data. It also reports the discrepancy between cultural mistakes in particular fields. Finally, experimental results of the proposed approach show the effectiveness of psychological elements to assist data collectors in accurately acquiring data. Keywords: data quality, data collectors, selective attention, color perception, cue action. 1. Introduction Lou Gerstner, former CEO of IBM, once said, “inside IBM, we talk about 10 times more connected people, 100 times more network speed, 1,000 times more devices and a million times more data”. “GIGO (Garbage In, Garbage Out)” is a technical term in computer science. It means if you input garbage into a computer, its output will also be garbage. Strong, Lee, and Wang indicated that 50% to 80% of computerized criminal records in the USA were found to be inaccurate, incomplete, or ambiguous [3]. Redman emphasized that poor data quality cost the typical company at least 10% of revenue. However, 20% was probably a better estimate. He also cited a survey conducted by The Data Warehousing Institute that estimated that in the USA, $611 billion a year was lost as a result of poor customer data [16]. It was reported that wrong price data in retail databases cost American consumers $2.5 billion annually [8]. About 75% of organizations have identified costs originating from defective data [12]. The life-cycle of data quality can be divided visually into three phases which are depicted as past, present and future. Past includes tables, data domain, and so on in database. Present includes obtain the data. Finally, future includes data cleansing, data audit, and so on. The further data chains process, the higher the cost for detecting and correcting errors. Currently, a great deal of work has been done in two main categories of improving data quality by scientists. One group of scientists has focused on mathematical and statistical models. They worked at the database layer. Another group focused on the management of the process of data generation. Most data comes into an organization from “supplier”, and it is much easier to develop good data collection practices than to correct errors downstream [1]. Not long ago, I worked on a project. In one of the forms was a field called Lionstic Year. This form had another field, Start Date. Logically, they are related to each other. Assume the value is “2009-2010” in Lionstic Year, the value of Start Date must be from “MM/DD/2009” to “MM/DD/2010”. Maybe programming is a good co-worker in avoiding errors. However, technically speaking, this kind of problem can’t be resolved by 10 or 20 lines of coding. Nevertheless, understanding of data collectors could be the key. In addition, I received an unofficial transcript of my Master’s Degree from the office of the registrar. When I checked my document, I found that my graduation year for my Bachelor’s Degree was incorrect. What could cause the error? The highest possibility might be the data entry clerk who entered the data from my application form. In 2010, I had studied the recruiting information of data collectors in the USA from around 150 500 examples using systematic analysis. Consequently, the responsibility of accurately acquiring data rests with the collector of the data in the obtaining of said data. It is their responsibility to make sure that: • The data is recorded and entered correctly. • They understand the relationship between fields. • They have knowledge or experience in related field. • They are able to identify and correct errors. Hence, this paper aims to verify the suitability of psychological elements to assist data collectors in accurately acquiring data. In fact, one of psychological goals is to control an organism’s behavior [14]. The more precisely specify the stimulus that evokes a certain reaction, the better the understanding of the behavior itself. 2. Responsibility in Data Collection 2.1. Data Quality Juran defines data quality as “data to be of high quality if they are fit for their intended uses in operations, decision making and planning” [6]. A few decades later, Redman reinforces this definition to be “getting the right and correct data in the right place at the right time to complete the task at hand” [16]. Data quality dimensions must be specified and monitored in accordance to user specifications. The users define what is high or low quality. Characteristics of “ideal” quality dimensions are based on users’ perspective [9, 15] such as accuracy, appreciation, comprehensiveness, consistent representation, and interpretability. 2.2. Selective Attention Redman indicates “good data quality satisfies the criteria that they should focus on the most important data” [15]. The “most important”, he defines that “data are most critical to the enterprise’s business strategies”. Selective attention allows us to direct our attention to the most important aspect of the environment at any one time [13]. Moreover, when we encounter information that supports our views, we tend to give it our full attention. The process whereby the brain sorts, then, only attends to the important messages from the senses. In practice, several particularly important fields are merely focused on as collect data so that selective attention is given to those fields is one way to decrease errors. 2.3. Color Perception Incoming light waves end up on the retina. The retina contains special light-sensitive cells called rods and cones. Rods enable us to see in dim light. Cones enable us not only to see in color but also to see things in fine detail. However, cones function better in bright light and diminish in function as the light dims. Human behavior is related to particular colors because visual inputs stimulate people to act and respond. They also pay attention to the stimulant. Most information systems usually adopt black and white as central colors of forms. It is one way to decrease errors that particular color except black and white is adopted in fields which are more likely to produce errors. 2.4. Cue Action Cue is to give somebody a signal so they know when to start doing something. In a database, each attribute holds a specific data type or data domain in tables [5]. For example, date formats are different between cultures. Assume a date, Date: 12/11/2010. Obviously, for Thailand understanding, it is Year: 2010, Month: 11, Day: 12. Conversely, for native English speaking countries’ understanding, it is another distinct meaning, Year: 2010, Month: 12, Day: 11. Country China Thailand Native English speaking Date Format YYYY/MM/DD DD/MM/YYYY MM/DD/YYYY Table 1: Different date formats in different countries 151 Therefore, adopting cues in fields which are ambiguous or easier to produce errors is one of ways in avoiding errors. It reminds user to pay attention to data type or data domain. Several rules are listed in setting cues [18]: • Avoid complexity. Use simple, conversational language. Complex language must not be used as cues. Words used in cues should be easily understood by the data collector. • Avoid ambiguity. Be as specific as possible. • Avoid burdensome questions. Burdensome questions may tax the data collector’s memory. A simple fact of human life is that people forget. 3. Experimental Implementation In this paper, psychological elements are combined rationally in order to assist data collectors with obtaining data accurately. During the experiment, participants include 6 Chinese, 6 Thai and 6 native English speakers. They are separated into three groups. Every participant inputs 200 records each day and the task is limited to 10 days. The participants of group 1 use the form without psychological elements. Oppositely, group 2 participants use the psychological elements’ form and the form without the color perception of psychological elements is used in group 3. Errors of member name and registration date are taken into account. The total time used of inputting data each day by every participant is recorded. Figure 1: Forms are used for collecting data 4. Results and Analysis The following figures show the error ratio which is computed by equation 1 in member name and registration date, respectively. C-1 is one of Chinese participants, T-1 is one of Thai participants, and W-1 is one of native English speaking participants. Error Ratio = the quantity of errors in field * 100 200 200 is the number of records that participant inputs each day Figure 2: Error ratio in Member Name-Group 2 compares with Group 1 152 (1) Figure 3: Error ratio in Member Name-Group 2 compares with Group 3 Figure 4: Error ratio in Registration Date-Group 2 compares with Group 1 Figure 5: Error ratio in Registration Date-Group 2 compares with Group 3 The following figure shows the total time used each day. Figure 6: The total time used in Group 1, Group 2 and Group 3 5. Conclusions Most errors are letters which are not typed with block capitals in member name. Through the assistance of psychological elements, the error ratio is decreased in member name. In registration date, errors are made because of discrepancies between cultures that use different date formats. After the registration date has been combined with psychological elements, errors are decreased by data collectors. Moreover, errors in the form with color perception are less than the form without color perception. In addition, several less important fields are blank in most documents. The total time used is decreased after the important fields have been combined with selective attention. With the same impact, the total time used in the form with color perception is also less than the form without color perception. Overall, the approach of psychological elements assists data collectors with decreasing errors in fields effectively and economizing the total time used for inputting data, furthermore, the color perception of psychological elements is the key factor in decreasing errors and economizing the total time used. 6. Further Research This paper still has the drawback that psychological color perception hasn’t been studied exhaustively. In fact, different colors represent different meanings in different countries. For example, perfect yellow is used to represent joy and honor and green can represent happiness in Britain. In the USA, black is a serious color and communicates sophistication. It also has high-tech connections. Red is a provocative color, it stands for warning. Blue can suggest superiority. In China, red represents joy and happiness. Yellow is emblematic, tradition says that if clouds are yellow, prosperity will follow. In the future I plan to do research relates to the 153 influence of different colors in typing data. Then, verify the feasibility of adopting specific color to decrease errors for data collectors of different nationalities. 7. Acknowledgements I would like to express my most sincere appreciation and thanks to my advisor, Asst. Prof. Dr. Chanintorn Jittawiriyanukoon, for his invaluable advice. I am also grateful to Dr. Rapeepat Techakittiroj and Dr. Chittipa Ngamkroeckjoti of Assumption University of Thailand, and Assoc. Prof. Li Wei of Linyi University of China. Special thanks to my parents for their encouragement and support. 8. References [1] A. D. Chapman. 2005. Principle of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. [2] B. Carlo, B. Daniele, C. Federico, G. Simone. A Data Quality Methodology for Heterogeneous Data. International Journal of Database Management Systems. February 2011 Vol. 3, No. 1, pp. 60-79. [3] D. M. Strong, Y. W. Lee, R. Y. Wang. Data Quality in Context. Communications of the ACM. May 1997 Vol. 40, No. 5, pp. 103-110. [4] F. G. Alizamini, M. M. Pedram, M. Alishahi, K. Badie. Data Quality Improvement using Fuzzy Association Rules. 2010 International Conference on Electronics and Information Engineering. 2010 Vol. 1, pp. 468-472. [5] G. V. Post. Database Management Systems: Designing and Building Business Applications (2nd ed.). USA: McGraw-Hill, 2002. [6] J. M. Juran. Managerial Breakthrough. USA: McGraw-Hill, 1964. [7] J. Zhang, Q. Y. Wen, H. Zhang. The Research in Improving the Quality of DW Data: The Job-Scheduling and Checking Based Program in Upgrading DW Performance. Proceedings of the 5th International Conference on Wireless Communications, Networking and Mobile Computing. Beijing, China, September 2009. [8] L. English. Information Quality Management: The Next Frontier. Information Management Magazine, April 2000. Retrieved from http://www.information-management.com/issues/20000401/2073-1.html [9] L. L. Pipino, Y. W. Lee, R. Y. Wang. Data Quality Assessment. Communications of the ACM. April 2002 Vol. 45, No. 4, pp. 211-218. [10] M. Arlitt, K. Farkas, S. Lyer, P. Kumaresan, S. Rafaeli. Systematically Improving the Quality of IT Utilization Data. ACM SIGMETRICS Performance Evaluation Review. Communications of the ACM. March 2010 Vol. 37, No. 4, pp. 42-49. [11] M. Yakout. Guided Data Quality Improvement Through Direct/Indirect Interactions. VLDB 2010 PhD Workshop, Singapore. September 13, 2010, pp. 24-29. [12] R. Marsh. Drowing in Dirty data? It’s Time to Sink or Swim: A Four-Stage Methodology for Total Data Quality Management. Database Marketing & Customer Strategy Management. 2005, 12 (2): 105-112. [13] R. P. Philipchalk, and J. V. McConnell. Understanding Human Behavior (8th ed.). USA: Holt, Rinenapt and Winston Inc., 1994. [14] R. Plotnik, and H. Kouyoumdjian. Introduction to Psychology (8th ed.). Canada: Thomson Learning Inc., 2008. [15] T. C. Redman. Data Quality for the Information Age. USA: Artech House Inc., 1996. [16] T. C. Redman. Data: An Unfolding Quality Disaster. Information Management Magazine, August 2004. Retrieved from http://www.information-management.com/issues/20040801/1007211-1.html [17] T. N. Manjunath, S. H. Ravindra, G. K. Ravikumar. Analysis of Data Quality Aspects in Data Warehouse Systems. In: T. N. Manjunath et al. International Journal of Computer Science and Information Technologies. 2010 Vol. 2 (1), pp. 477-485. [18] W. G. Zikmund. Business Research Methods (6th ed.). USA: South-Western, Thomson Learning Inc., 2000. 154