Psychological Elements for Data Collection:Responsibility in Data Accuracy Ning Fu

advertisement
2011 International Conference on Information and Electronics Engineering
IPCSIT vol.6 (2011) © (2011) IACSIT Press, Singapore
Psychological Elements for Data Collection:Responsibility in Data
Accuracy
Ning Fu
Assumption University of Thailand, Ramkhamhaeng 24 Road, Bangkok 10240, Thailand
funing65510@163.com
Abstract. In the information age, data quality issues occur in many areas such as economic data, scanning
data, financial services, billing, medical profession, and even publication of lottery numbers. “If it’s in the
computer, it must be right”, it is accepted widely. Obviously, this thought is plain wrong. Until now, many
articles depicting the methodologies of improving data quality, including inspect or revise after the data had
been entered [2, 4, 7, 10, 11, 17]. However, until recently, the portion of obtaining data is not referenced or
paid much attention. This paper presents the assessment of the assistance of psychological selective attention,
color perception, and cue action in acquiring data. It also reports the discrepancy between cultural mistakes in
particular fields. Finally, experimental results of the proposed approach show the effectiveness of
psychological elements to assist data collectors in accurately acquiring data.
Keywords: data quality, data collectors, selective attention, color perception, cue action.
1. Introduction
Lou Gerstner, former CEO of IBM, once said, “inside IBM, we talk about 10 times more connected
people, 100 times more network speed, 1,000 times more devices and a million times more data”. “GIGO
(Garbage In, Garbage Out)” is a technical term in computer science. It means if you input garbage into a
computer, its output will also be garbage. Strong, Lee, and Wang indicated that 50% to 80% of computerized
criminal records in the USA were found to be inaccurate, incomplete, or ambiguous [3]. Redman emphasized
that poor data quality cost the typical company at least 10% of revenue. However, 20% was probably a better
estimate. He also cited a survey conducted by The Data Warehousing Institute that estimated that in the USA,
$611 billion a year was lost as a result of poor customer data [16]. It was reported that wrong price data in
retail databases cost American consumers $2.5 billion annually [8]. About 75% of organizations have
identified costs originating from defective data [12].
The life-cycle of data quality can be divided visually into three phases which are depicted as past,
present and future. Past includes tables, data domain, and so on in database. Present includes obtain the data.
Finally, future includes data cleansing, data audit, and so on. The further data chains process, the higher the
cost for detecting and correcting errors. Currently, a great deal of work has been done in two main categories
of improving data quality by scientists. One group of scientists has focused on mathematical and statistical
models. They worked at the database layer. Another group focused on the management of the process of data
generation. Most data comes into an organization from “supplier”, and it is much easier to develop good data
collection practices than to correct errors downstream [1].
Not long ago, I worked on a project. In one of the forms was a field called Lionstic Year. This form had
another field, Start Date. Logically, they are related to each other. Assume the value is “2009-2010” in
Lionstic Year, the value of Start Date must be from “MM/DD/2009” to “MM/DD/2010”. Maybe
programming is a good co-worker in avoiding errors. However, technically speaking, this kind of problem
can’t be resolved by 10 or 20 lines of coding. Nevertheless, understanding of data collectors could be the key.
In addition, I received an unofficial transcript of my Master’s Degree from the office of the registrar. When I
checked my document, I found that my graduation year for my Bachelor’s Degree was incorrect. What could
cause the error? The highest possibility might be the data entry clerk who entered the data from my
application form. In 2010, I had studied the recruiting information of data collectors in the USA from around
150
500 examples using systematic analysis. Consequently, the responsibility of accurately acquiring data rests
with the collector of the data in the obtaining of said data. It is their responsibility to make sure that:
• The data is recorded and entered correctly.
• They understand the relationship between fields.
• They have knowledge or experience in related field.
• They are able to identify and correct errors.
Hence, this paper aims to verify the suitability of psychological elements to assist data collectors in
accurately acquiring data. In fact, one of psychological goals is to control an organism’s behavior [14]. The
more precisely specify the stimulus that evokes a certain reaction, the better the understanding of the
behavior itself.
2. Responsibility in Data Collection
2.1. Data Quality
Juran defines data quality as “data to be of high quality if they are fit for their intended uses in
operations, decision making and planning” [6]. A few decades later, Redman reinforces this definition to be
“getting the right and correct data in the right place at the right time to complete the task at hand” [16].
Data quality dimensions must be specified and monitored in accordance to user specifications. The users
define what is high or low quality. Characteristics of “ideal” quality dimensions are based on users’
perspective [9, 15] such as accuracy, appreciation, comprehensiveness, consistent representation, and
interpretability.
2.2. Selective Attention
Redman indicates “good data quality satisfies the criteria that they should focus on the most important
data” [15]. The “most important”, he defines that “data are most critical to the enterprise’s business
strategies”. Selective attention allows us to direct our attention to the most important aspect of the
environment at any one time [13]. Moreover, when we encounter information that supports our views, we
tend to give it our full attention. The process whereby the brain sorts, then, only attends to the important
messages from the senses. In practice, several particularly important fields are merely focused on as collect
data so that selective attention is given to those fields is one way to decrease errors.
2.3. Color Perception
Incoming light waves end up on the retina. The retina contains special light-sensitive cells called rods
and cones. Rods enable us to see in dim light. Cones enable us not only to see in color but also to see things
in fine detail. However, cones function better in bright light and diminish in function as the light dims.
Human behavior is related to particular colors because visual inputs stimulate people to act and respond.
They also pay attention to the stimulant. Most information systems usually adopt black and white as central
colors of forms. It is one way to decrease errors that particular color except black and white is adopted in
fields which are more likely to produce errors.
2.4. Cue Action
Cue is to give somebody a signal so they know when to start doing something. In a database, each
attribute holds a specific data type or data domain in tables [5]. For example, date formats are different
between cultures. Assume a date, Date: 12/11/2010. Obviously, for Thailand understanding, it is Year: 2010,
Month: 11, Day: 12. Conversely, for native English speaking countries’ understanding, it is another distinct
meaning, Year: 2010, Month: 12, Day: 11.
Country
China
Thailand
Native English speaking
Date Format
YYYY/MM/DD
DD/MM/YYYY
MM/DD/YYYY
Table 1: Different date formats in different countries
151
Therefore, adopting cues in fields which are ambiguous or easier to produce errors is one of ways in
avoiding errors. It reminds user to pay attention to data type or data domain. Several rules are listed in setting
cues [18]:
• Avoid complexity. Use simple, conversational language. Complex language must not be used as cues.
Words used in cues should be easily understood by the data collector.
• Avoid ambiguity. Be as specific as possible.
• Avoid burdensome questions. Burdensome questions may tax the data collector’s memory. A simple
fact of human life is that people forget.
3. Experimental Implementation
In this paper, psychological elements are combined rationally in order to assist data collectors with
obtaining data accurately. During the experiment, participants include 6 Chinese, 6 Thai and 6 native English
speakers. They are separated into three groups. Every participant inputs 200 records each day and the task is
limited to 10 days. The participants of group 1 use the form without psychological elements. Oppositely,
group 2 participants use the psychological elements’ form and the form without the color perception of
psychological elements is used in group 3. Errors of member name and registration date are taken into
account. The total time used of inputting data each day by every participant is recorded.
Figure 1: Forms are used for collecting data
4. Results and Analysis
The following figures show the error ratio which is computed by equation 1 in member name and
registration date, respectively. C-1 is one of Chinese participants, T-1 is one of Thai participants, and W-1 is
one of native English speaking participants.
Error Ratio =
the quantity of errors in field
* 100
200
200 is the number of records that participant inputs each day
Figure 2: Error ratio in Member Name-Group 2 compares with Group 1
152
(1)
Figure 3: Error ratio in Member Name-Group 2 compares with Group 3
Figure 4: Error ratio in Registration Date-Group 2 compares with Group 1
Figure 5: Error ratio in Registration Date-Group 2 compares with Group 3
The following figure shows the total time used each day.
Figure 6: The total time used in Group 1, Group 2 and Group 3
5. Conclusions
Most errors are letters which are not typed with block capitals in member name. Through the assistance
of psychological elements, the error ratio is decreased in member name. In registration date, errors are made
because of discrepancies between cultures that use different date formats. After the registration date has been
combined with psychological elements, errors are decreased by data collectors. Moreover, errors in the form
with color perception are less than the form without color perception. In addition, several less important
fields are blank in most documents. The total time used is decreased after the important fields have been
combined with selective attention. With the same impact, the total time used in the form with color
perception is also less than the form without color perception.
Overall, the approach of psychological elements assists data collectors with decreasing errors in fields
effectively and economizing the total time used for inputting data, furthermore, the color perception of
psychological elements is the key factor in decreasing errors and economizing the total time used.
6. Further Research
This paper still has the drawback that psychological color perception hasn’t been studied exhaustively. In
fact, different colors represent different meanings in different countries. For example, perfect yellow is used
to represent joy and honor and green can represent happiness in Britain. In the USA, black is a serious color
and communicates sophistication. It also has high-tech connections. Red is a provocative color, it stands for
warning. Blue can suggest superiority. In China, red represents joy and happiness. Yellow is emblematic,
tradition says that if clouds are yellow, prosperity will follow. In the future I plan to do research relates to the
153
influence of different colors in typing data. Then, verify the feasibility of adopting specific color to decrease
errors for data collectors of different nationalities.
7. Acknowledgements
I would like to express my most sincere appreciation and thanks to my advisor, Asst. Prof. Dr.
Chanintorn Jittawiriyanukoon, for his invaluable advice. I am also grateful to Dr. Rapeepat Techakittiroj and
Dr. Chittipa Ngamkroeckjoti of Assumption University of Thailand, and Assoc. Prof. Li Wei of Linyi
University of China. Special thanks to my parents for their encouragement and support.
8. References
[1] A. D. Chapman. 2005. Principle of Data Quality, version 1.0. Report for the Global Biodiversity Information
Facility, Copenhagen.
[2] B. Carlo, B. Daniele, C. Federico, G. Simone. A Data Quality Methodology for Heterogeneous Data. International
Journal of Database Management Systems. February 2011 Vol. 3, No. 1, pp. 60-79.
[3] D. M. Strong, Y. W. Lee, R. Y. Wang. Data Quality in Context. Communications of the ACM. May 1997 Vol. 40,
No. 5, pp. 103-110.
[4] F. G. Alizamini, M. M. Pedram, M. Alishahi, K. Badie. Data Quality Improvement using Fuzzy Association Rules.
2010 International Conference on Electronics and Information Engineering. 2010 Vol. 1, pp. 468-472.
[5] G. V. Post. Database Management Systems: Designing and Building Business Applications (2nd ed.). USA:
McGraw-Hill, 2002.
[6] J. M. Juran. Managerial Breakthrough. USA: McGraw-Hill, 1964.
[7] J. Zhang, Q. Y. Wen, H. Zhang. The Research in Improving the Quality of DW Data: The Job-Scheduling and
Checking Based Program in Upgrading DW Performance. Proceedings of the 5th International Conference on
Wireless Communications, Networking and Mobile Computing. Beijing, China, September 2009.
[8] L. English. Information Quality Management: The Next Frontier. Information Management Magazine, April 2000.
Retrieved from http://www.information-management.com/issues/20000401/2073-1.html
[9] L. L. Pipino, Y. W. Lee, R. Y. Wang. Data Quality Assessment. Communications of the ACM. April 2002 Vol. 45,
No. 4, pp. 211-218.
[10] M. Arlitt, K. Farkas, S. Lyer, P. Kumaresan, S. Rafaeli. Systematically Improving the Quality of IT Utilization
Data. ACM SIGMETRICS Performance Evaluation Review. Communications of the ACM. March 2010 Vol. 37,
No. 4, pp. 42-49.
[11] M. Yakout. Guided Data Quality Improvement Through Direct/Indirect Interactions. VLDB 2010 PhD Workshop,
Singapore. September 13, 2010, pp. 24-29.
[12] R. Marsh. Drowing in Dirty data? It’s Time to Sink or Swim: A Four-Stage Methodology for Total Data Quality
Management. Database Marketing & Customer Strategy Management. 2005, 12 (2): 105-112.
[13] R. P. Philipchalk, and J. V. McConnell. Understanding Human Behavior (8th ed.). USA: Holt, Rinenapt and
Winston Inc., 1994.
[14] R. Plotnik, and H. Kouyoumdjian. Introduction to Psychology (8th ed.). Canada: Thomson Learning Inc., 2008.
[15] T. C. Redman. Data Quality for the Information Age. USA: Artech House Inc., 1996.
[16] T. C. Redman. Data: An Unfolding Quality Disaster. Information Management Magazine, August 2004. Retrieved
from http://www.information-management.com/issues/20040801/1007211-1.html
[17] T. N. Manjunath, S. H. Ravindra, G. K. Ravikumar. Analysis of Data Quality Aspects in Data Warehouse Systems.
In: T. N. Manjunath et al. International Journal of Computer Science and Information Technologies. 2010 Vol. 2
(1), pp. 477-485.
[18] W. G. Zikmund. Business Research Methods (6th ed.). USA: South-Western, Thomson Learning Inc., 2000.
154
Download