Reliability and Validity of Research Instruments Correspondence to kubaiedwin@yahoo.com Edwin Kubai UNICAF University - Zambia Introduction This paper primarily focuses explicitly on two terms namely; reliability and validity as used in the field of educational research. When conducting any educational study it is worth noting that designing and measuring the research instruments is very essential especially to novice researchers. The data collection tools (research instruments) should be designed in such way that they would be able to accurately measure the intended construct under investigation and ensure the meaningfulness of the study findings. This would greatly enhance believability and trust worthiness of the research findings especially if the study is repeated by different investigators under the same conditions or with different research instruments measuring the same construct. It is absolutely true to note that reliability and validity are two terms used in any investigation of which novice researchers find them difficult to differentiate them. They find difficult on how accurately to explain to the audience if their research instruments meet the minimum threshold for reliability and validity conditions. It has been noted with concern that most novice researchers fail to clarify how reliability and validity was achieved in their respective studies due to lack of sufficient knowledge about the concept or some fail completely to mention about it in their research methodology. This paper attempts to clarify issues related to reliability and validity of research instruments to ensure tranquility and transferability of research findings. The next section defines validity and reliability concepts as used in designing research instruments 1 © September 15, 2019 Definition of Reliability and Validity According to Drost (2011), reliability is “the extent to which measurements are repeatable when different people perform the measurement on different occasion, under different condition, supposedly with alternative instruments which measure the construct or skill”. It can also be defined as the degree to which the measure of a construct is consistent or dependable. For instance when several people guess your own weight, the value of the weigth might not be necessarily correct since it will be inconsistence with the accurate value hence the measurement is said to be unreliable. If a weighing scale is used by different people to give the value of your weight then there is likelihood of getting the same value every time a measurement is done hence this measurement would be said to be reliable. “The extent to which a measure adequately represents the underlying construct that it is supposed to measure” (Drost, 2011) is called validity. The term construct refers to the skill, knowledge, attribute or attitude that the researcher is investigating. For instance if a researcher wanted to measure compassion ,it is vital to know if the measure would accurately measure compassion or empathy because the two terms are closely related. Some of the constructs under investigation might be imaginary (they don’t exist in reality) it is important to develop a scale that would consistently and precisely measure the intended unobservable construct. Reliability and validity form psychometric properties of measurement scales that are very important in estimating adequacy and accuracy procedures of a scientific research as mentioned by Bajpai and Bajpai (2014). The next section discusses types of reliability and how to use them in designing instrument for educational research. 2 © September 15, 2019 Reliability From the previous section, reliability has been defined as stability of measurement over a variety of conditions in which the results should be obtained (Nunnally, 1978). It is basically the repeatability or replication of research findings. When a study is conducted by a researcher under some conditions and then the same study is done again for the second time and yields the same results then the data is said to be reliable. According to Drost (2011), reliability of data from research instruments is affected by two errors; namely random error and systematic error. Random error is attributed to a set of unknown and uncontrollable external factors that randomly influence some observations but not others. For example respondents who might have nicer moods might respond positively to constructs like self-esteem, happiness and satisfaction as compared to respondents with bad mood. Random error is seen as noise in measurement hence it is usually ignored. Systematic error is an error that is introduced by factors that systematically affect all observations of a construct across the entire sample. Systematic error is considered as a bias in measurement and should be corrected to yield better results of the sample. The best way to estimate reliability is to measure the associations between tests, items and raters by calculating reliability coefficient (Rosnow and Rosenthal, 1991). The following are the type’s reliability; Test-retest reliability It is a measure of consistency between measurements of the same construct administered to the same sample at two different points in time (Drost, 2011). If the correlation between the two sets of test is significant then observations have not changed substantially hence the aspect of time is very critical for this type of reliability. 3 © September 15, 2019 Split-half reliability Heale & Twycross, (2015) defined Split-half reliability as a measure of consistency between two halves of a construct measure. For example, if a researcher uses a ten-item measure to measure a construct, the items are divided into half or two sets of even and odd if the total number of items is an odd one. It is assumed that the number of items for measuring a construct is available and is measured within the same time period hence minimize the random error. The correlation between the two halves must be obtained to determine the coefficient of reliability. A practical advantage of this method is that it is cheaper and obtained easily as compared test retest reliability where the researcher has to design new set of items to administer later. Inter-rater reliability It is also called inter-observer rating or an agreement. It involves rating of observations using a specific measure but by different judges. The rating is basically independent but happens at the same time. Reliability is obtained by correlation of scores from the two or more raters on the same construct or sometimes it is the decision of agreement of the judgments of the same raters. This is basically used when judges are rating or scoring a piece of an artistic work or music performance on stage. There scores are correlated to give the Cohen` s Kappa coefficient of inter-rater reliability especially if the variables are categorical. Internal consistency reliability It is a measure of consistency between different items of the same construct. It measures the consistency within the instrument and questions on how well a set of items measures a particular characteristic of the test. Single items within a test are correlated to estimate the coefficient of reliability. Cronbach`s alpha coefficient is used to determine internal consistency between items (Cronbach, 1951). 4 © September 15, 2019 An individual item of a test might have a small correlation with true scores attest with higher items might have a higher correlation. For instance, 5-item test might have a correlation of 0.40 while a 12-item test might have a correlation of 0.80. According to Cortina, (1993) coefficient alpha is used to estimate reliability for item-specific variance in a one-dimensional test. If the coefficient alpha is low, it means that the test is too short or the items have little in common. Validity As defined earlier validity is the extent to which an instrument measures what it purports to measure. Validity is the trying to explain the truth of research findings as explained by Zohrabi, (2013). For example does IQ test measure intelligence? Validity is measured using both theoretical and empirical evidences. Theoretical assessment is where an idea of a construct is translated or represented into an operational measure. This is done by panel of experts who are judges or university lectures that rate suitability of each item and evaluates its fitness in the definition of the construct. Empirical assessment is where validity is based on quantitative analysis involving statistical techniques. The following are type’s validity in educational research; Construct validity This refers to how a concept, idea or behavior that is a construct is translated or transformed into functioning and operating reality (Trochim, 2006). This happens especially if the relationship has its cause and effect hence the construct validity justifies the existence of relationship. Construct validity is critically substantiated under the following validity; face validity, content validity, concurrent and predictive validity, and convergent and discriminant validity. 5 © September 15, 2019 Face validity It is where an indicator seems to be a reasonable measure of its underlying construct “on its face”. It actually ascertains that the measure is appears to be assessing the intended construct under investigation. For example the aspect of an individual going to church every Sunday can make someone conclude that the person is religious which might not be really true. The face validity is often used by university lectures when assessing research instruments designed by their students. Content validity This is an assessment on how well a set of scale of items matches with the relevant content domain of the construct that it is trying to measure. According to Bollen (1989), as cited in Drost (2011) content validity is a qualitative type of validity where the domain of the concept is made clear and the analyst judges whether the measures fully represent the domain (p.185). The researcher should design a research instrument that adequately addresses the construct or area under investigation. For instance if a researcher wants to cover an investigation on implementation of a new curriculum then the research instrument or test items designed by the researcher must adequately address the domain to yield valid research findings. A group of judges or experts that have content in the area under investigation can be used to assess this type of validity. Convergent and Discriminant validity They are assessed together or jointly for a set of measure. Convergent validity refers to closeness of which the measure relates to the construct that it purported to measure or simply it converges with the construct. Discriminant validity refers to the degree to which a measure does not measure or discriminates the construct it is not supposed to measure. To effectively obtain convergent validity comparison of observed values of one indicator of one construct with others indicators of the same construct is done. 6 © September 15, 2019 Discriminant validity is obtained by demonstrating that indicators of one construct are dissimilar. A statistical procedure called bivariate correlation is used to analyze items using exploratory factor analysis for convergent and discriminant validity. Criterion-related validity It is the degree of correspondence between a test measure and one or more external referents (criteria) by correlation (Mohajan, 2017). For instance suppose students sat for an examination and scored some scores and then we ask them about their scores. A correlation can be done between their observed scores and true scores from the teachers’ record. Criterion –related validity is closely related concurrent or predictive types of validity. Concurrent validity is where one measure relates to other concrete criterion that is presumed to occur simultaneously. It happens when a criterion exist at the same as the measure. An example could be the Students’ performance scores obtained from calculus and linear algebra since all of them are mathematics test. Predictive validity is where a measure successfully predicts a future outcome that it is theoretically expected to predict. A good example of predictive validity is the use of Students Continuous Assessment Test (CAT) to predict their performance in final Examination. The scores for the CAT can be correlated with the scores obtained from the Final Examination. Conclusion This paper has critically examined the definition of the terms reliability and validity as used in educational research. It is important for novice researchers to have sufficient knowledge on the concepts of reliability and validity when designing research instrument to enhance trustworthiness and generalizability of research findings. The types of reliability identified include; Test-retest reliability, split-half reliability, inter-rater reliability and internal consistency reliability. The function of reliability in research is to ensure that the observed score is almost similar to true score obtained by minimizing the errors in measurement. 7 © September 15, 2019 The following types of validity have been discussed; Face validity, content validity, convergent, discriminant and criterion-related validity. Validity requires that the research instrument is reliable but an instrument might be reliable without being valid. The interpretation of the results of a test depends entirely on the underlying construct and validity of the research findings. Reference Bajpai, S. R., & Bajpai, R. C. (2014). Goodness of Measurement: Reliability and Validity. International Journal of Medical Science and Public Health, 3(2), 112-115. Bollen, K. A. (1989). Structural Equations with Latent Variables (pp. 179-225). John Wiley & Sons. Campbell, D.T. and Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105. Cortina, J. M. (1993). What is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78 (1), 98-104. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297-334. Drost, E., A. (2011). Validity and reliability in social science research. Education Research and Perspectives, 38 (1), 105-124. Fiske, Donald W. (1982). Convergent -Discriminant Validation in Measurements and Research Strategies. In Brinberg, D. and Kidder, L. H., (Eds), Forms of Validity in Research, pp. 77-93. Heale, R., & Twycross, A. (2015). Validity and Reliability in Quantitative Studies. Evidence Based Nurs, 18(4), 66-67. 8 © September 15, 2019 Mohajan, H. (2017). Two criteria for good measurements in research: validity and reliability. Annals of Spiru Haret University. Economic Series, 17(4), 59-82. Nunnally, J. C. (1978). Psychometric Theory. McGraw-Hill Book Company, pp. 86-113, 190255. Rosenthal, R. and Rosnow, R. L. (1991). Essentials of Behavioral Research: Methods and Data Analysis. Second Edition. McGraw-Hill Publishing Company, pp. 46-65. Trochim, W. M. K. (2006). Introduction to Validity. Social Research Methods, retrieved from www.socialresearchmethods.net/kb/introval.php, September 9, 2010. Zohrabi, M. (2013). Mixed Method Research: Instruments, Validity, Reliability and Reporting Findings. Theory and Practice in Language Studies, 3(2), 254-262. 9 © September 15, 2019