PSM106 CHAPTER 3: A STATISTIC REFRESHER Scales of Measurement Measurement – the act of assigning numbers or symbols to characteristics of things (people, events, whatever) according to rules. The rules used in assigning numbers are guidelines for representing the magnitude (or some other characteristic) of the object being measured. Scale – a set of numbers (or other symbols) whose properties model empirical properties of the objects to which the numbers are assigned.2 There are various ways in which a scale can be categorized. Continuous Scale – a scale used to measure a continuous variable. Ordinal scales – permit classification. However, in addition to classification, rank ordering on some characteristic is also permissible with ordinal scales. Interval scales – contain equal intervals between numbers. Each unit on the scale is exactly equal to any other unit on the scale. But like ordinal scales, interval scales contain no absolute zero point. With interval scales, we have reached a level of measurement at which it is possible to average a set of measurements and obtain a meaningful result. Ratio scale - has a true zero point. All mathematical operations can meaningfully be performed because there exist equal intervals between the numbers on the scale as well as a true or absolute zero point. Describing Data Discrete Scale – a scale used to measure a discrete variable. Error – refers to the collective influence of all of the factors on a test score or measurement beyond those specifically measured by the test or measurement. Nominal scales – are the simplest form of measurement. These scales involve classification or categorization based on one or more distinguishing characteristics, where all things measured must be placed into mutually exclusive and exhaustive categories. Distribution – may be defined as a set of test scores arrayed for recording or study. Raw score – is a straightforward, unmodified accounting of performance that is usually numerical. Frequency Distribution – all scores are listed alongside the number of times each score occurred. - a frequency distribution is referred to as a simple frequency distribution to indicate that individual scores have been used and the data have not been grouped. Grouped Frequency Distribution – frequency distribution used to summarize data. Range – is equal to the difference between the highest and the lowest scores. Bar graph - numbers indicative of frequency also appears on the Y-axis, and reference to some categorization appears on the X-axis. Quartiles – the dividing points between the four quarters in the distribution. Frequency Polygon – are expressed by a continuous line connecting the points where test scores or class intervals (as indicated on the X-axis) meet frequencies (as indicated on the Y-axis). Measure of Central Tendency – is a statistic that indicates the average or midmost score between the extreme scores in a distribution. Arithmetic mean or mean – which is referred to in everyday language as the “average.” Median – defined as the middle score in a distribution, is another commonly used measure of central tendency. Mode – the most frequently occurring score in a distribution of scores. Bimodal distribution – there are two scores (51 and 66) that occur with the highest frequency. Measures of Variability – statistics that describe the amount of variation in a distribution. Variability – is an indication of how scores in a distribution are scattered or dispersed. Interquartile range – is a measure of variability equal to the difference between Q3 and Q1. Semi-interquartile range – which is equal to the interquartile range divided by 2. Average Deviation – another tool that could be used to describe the amount of variability in a distribution. Standard Deviation – as a measure of variability equal to the square root of the average squared deviations about the mean. Variance – is equal to the arithmetic mean of the squares of the differences between the scores in a distribution and their mean. Skewness – or the nature and extent to which symmetry is absent. Positive Skew – when relatively few of the scores fall at the high end of the distribution. Negative Skew – when relatively few of the scores fall at the low end of the distribution. Kurtosis – refer to the steepness of a distribution in its center. Platykurtic – relatively flat Leptokurtic – relatively peaked strength of the relationship between two things. Mesokurtic – somewhere in the middle Correlation – is an expression of the degree and direction of correspondence between two things. Normal Curve – is a bell-shaped, smooth, mathematically defined curve that is highest at its center. CHAPTER 4: OF TESTS AND TESTING Standard Score – is a raw score that has been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation. z score – results from the conversion of a raw score into a number indicating how many standard deviation units the raw score is below or above the mean of the distribution. T scores – can be called a fifty plus or minus ten scale. Stanine – a term that was a contraction of the words standard and nine. Linear Transformation – is one that retains a direct numerical relationship to the original raw score. Nonlinear Transformation – may be required when the data under consideration are not normally distributed yet comparisons with normal distributions need to be made. Coefficient of correlation (or correlation coefficient) – is a number that provides us with an index of the A. Some Assumptions Psychological Testing Assessment. About and Assumption 1. Psychological Traits and States Exist Trait – any distinguishable, relatively enduring way in which one individual varies from another. States – distinguish one person from another but are relatively less enduring. Psychological trait exists only as a construct. Construct – an informed, scientific concept developed or constructed to describe or explain behavior. We can’t see, hear, or touch construct but we can infer their existence from an overt behavior. Overt behavior – refers to an observable action or product of an observable action. Assumption 2. Psychological Traits and States can be Quantified and Measured Measuring traits and states by means of a test entail developing not only appropriate test items but also appropriate ways to score the test and interpret the results. be compensated for by data from other sources. Assumption 5. Various Sources of Error Are Part of the Assessment Process Error refer to mistakes, miscalculations and the like. The test score is presumed to represent the strength of the targeted ability or trait or state and is frequently based on cumulative scoring. - Traditionally refers to something that is more than expected; it is a component of the measurement process; Assumption 3. Test-Related Behavior Predicts Non-Test-Related Behavior. - Refers to a long-standing assumption that factors other than what a test attempts to measure will influence performance on the test. Patterns of answer to true-false questions on one widely used test of personality are used in decision making regarding mental disorder. The tasks in some tests mimic the actual behaviors that the test user is attempting to understand. Assumption 4. Tests and other Measurement Techniques Have Strengths and Weaknesses. Competent test users understand a great deal about the tests they use. They understand among other things, how a test was developed, the circumstances under which it is appropriate to administer the test, how the test should be administered and to whom, and how the test result should be interpreted. Competent test users understand and appreciate the limitations of the tests they use as well as how those limitations might Test scores are always subject to questions about the degree to which the measurement process includes error. Error Variance - the component of a test score attributable to sources other than the trait or ability measured. Sources of Error Variance 1. Assessee 2. Assessor 3. Measuring instruments Classical test theory or true score theory - the assumption is made that each test taker has a true score on a test that would be obtained but for the action of measurement error. Assumption 6. Testing and Assessment can be conducted in a fair and unbiased manner. Today all major test publishers strive to develop instruments that are fair when used in strict accordance with guidelines in the test manual. One source of fairness-related problems is the test user who attempts to use a particular test with people whose background and experience are different from the background and experience of people for whom the test was intended. Assumption 7. Testing Assessment Benefit Society. and In a world without tests, teachers and school administrators could arbitrarily place children in different types of special classes simply because that is where they believed the children belonged. In a world without tests, there would be a great need for instruments to diagnose educational difficulties in reading and math and point the way to remediation. The criteria for a good test would include clear instructions for administering, scoring, and interpretation. It would also seem to be a plus if a test offered economy in time and money it took to administer, score, and interpret it. CHAPTER 5: RELIABILITY A good test or more generally, a good measuring tool or procedure is reliable. The criterion of reliability involves the consistency of the measuring tool. The precision with which the test measures and the extent to which error is present in measurements. In theory, the perfectly reliable m measuring tool consistently measures in the same way. Classical test theory states that a score on an ability test is presumed to reflect not only the test takers true score on the ability being measured but also error. Error - refers to the component of the observed test score that does not have to do with the test takers ability. A statistic useful in describing sources of test score variability is the variance - the standard deviation squared. True variance - variance from true differences Error variance – variance from irrelevant, random sources Reliability refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more reliable the test. Measurement error – refers to collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. Categories of Measurement Error Random error – is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Systematic error – refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. Sources of Error Variance 1. Test construction, administration, scoring, and/or interpretation Item sampling or content sampling – is one source of variance during test construction. It refers to the variation among items within a test as well to variation among items between tests. The extent to which a test takers score is affected by the content sampled on a test and by the way the content is sampled is a source of error variance. Sources of error variance during test administration: 1. Test environment like room temperature, level of lightning and amount of ventilation and noise. 2. Test taker variables like pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication. 3. Formal learning experiences, casual life experiences, therapy, illness and changes in mood or mental state. 4. Body weight, can be source of error variance. 5. Examiner-related variables are potential sources of error variance. Scorers and scoring system are potential sources of error variance. Test-retest reliability – is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. It is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time such as personality trait. Coefficient of stability – is when the interval between testing is greater than six months, the estimate of testretest reliability is often referred to as _________. The degree of relationship between various forms of a test can be evaluated by means of an alternateforms or parallel-forms coefficient of reliability which is often termed as coefficient of equivalence. Parallel forms of a test exist when for each form of the test the means and the variances of observed test scores are equal. Parallel forms reliability – refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when for each form of the test, the means and variances of observed tests scores are equal. Alternate forms are simply different versions of a test that have been constructed so as to be parallel. Alternate forms reliability refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error or other error. An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to some people. The computation of a coefficient of split-half reliability generally entails three steps: 1. Divide the test into equivalent halves. 2. Calculate a Pearson r between scores on the two halves of the test 3. Adjust the half-test reliability using Spearman-Brown formula. Split-half reliability is also referred to odd-even reliability. The Spearman-Brown formula – allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test. Inter-item consistency refers to the degree of correlation among all the items on a scale. A measure of interitem consistency is calculated from a single administration of a single trait. Test are said to be homogeneous if they contain items that measure a single trait. Internal consistency estimates of reliability – an estimate of reliability of a test obtained from a measure of inner-item consistency. Heterogeneity – describes the degree to which a test measures different factor. Methods of obtaining internal consistency estimates of reliability: Heterogeneous test is composed of items that measure more than one trait. 1. Split-half estimate – is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. Coefficient alpha – developed by Cronbach and elaborated on by others Coefficient alpha – the mean of all possible split-half correlation Average proportional distance (APD) – a relatively new measure for evaluating the internal consistency of a test. It is a measure that focuses on the degree of difference that exists between item scores. It is a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores. Inter-scorer reliability is the degree of agreement or consistency between two or more scorers with regard to a particular measure. The simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation - coefficient of inter-scorer reliability. Basic 3 approaches to the estimation of reliability: 1. Test-retest 2. Alternate or parallel forms 3. Internal or inter-item consistency A dynamic characteristic – is a trait state, or ability presumed to be everchanging as a function of situational and cognitive experiences. Static characteristics – a trait, state or ability presumed to be relatively unchanging overtime; contrast with dynamic. Restriction or inflation of range – also referred to as inflation of variance, a reference to a phenomenon associated with reliability estimates wherein the variance od either variable in a correlational analysis is inflated by the sampling procedures used and so the resulting correlations coefficients tends to be higher contrast with restriction of range. Speed tests – generally contain items of uniform level of difficulty so that when given generous time limits all test takers should be able to complete all the test items correctly. Power tests – is when a time limit is long enough to allow test takers to attempt all items and if some items are so difficult that no test taker is able to obtain a perfect score. Criterion-referenced tests – is designed to provide an indication of where a test taker stands with respect to some variable or criterion such as an educational or a vocational objective. Classical test theory – is also referred to as the true score model of measurement. It is the most widely used and accepted model in the psychometric literature today. True score – as a value that according to classical test theory genuinely reflects an individual’s ability level as measured by a particular test. Domain sampling theory – seek to estimate the extent to which specific sources of variations under defined conditions are contributing to the test score. It is a test reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample. Generalizability study – examines how generalizable scores from a particular test are if the test is administered in different situations. Stated in the language of generalizability theory, generalizability study examines how much of impact different facets of the universe have on the test score. The influence of particular facets on the test score is represented by coefficients of generalizability. In the decision study developers examine the usefulness of test scores in helping the test user make decisions. Item response theory (IRT) another alternative to the true score model. It is also referred to as latent-trait theory or the latent-trait model, a system of assumptions about measurement and the extent to which each test items measures the trait. In the context of IRT discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured. There are IRT models designed to handle data resulting from the administration of tests with Dichotomous test items. Dichotomous test item – a test item or question that can be answered with only one of two response options such as true or false or yes-no. Polytomous test items – a test item or question with three or more alternative responses where only one alternative is scored correct or scored as being consistent with a targeted trait or other construct. CHAPTER 6: VALIDITY A test is considered valid for a particular purpose if it does. It measures what it purports to measure. A test reaction time is valid test if it accurately measures reaction time. A test of intelligence is a valid test if it truly measures intelligence. Other considerations A good test is one that trained examiners can administer, score, and interpret with a minimum of difficulty. A good test is useful test, one can yields actionable results that will ultimately benefit individual test takers or society at large. If the purpose of a test is to compare the performance of the test takers with the performance of other test takers, then a “good test” is one that contains adequate norms also referred to as normative data. Norms – provide a standard with which the results of measurement can be compared. Norm-referenced testing and assessment as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual test taker score and comparing it to scores of a group of test takers. A common goal of norm-referenced tests is to yield information on a test taker’s standing or ranking relative to some comparison group of test takers. Norm – refer to behavior that is usual, average, normal, standard, expected, or typical. Norm in the psychometric context are the test performance data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individuals test scores. Standardization or test standardization is the process of administering a test to representative sample of test takers for the purpose of establishing norms. Sampling – some defined group as the population for which the test is designed. Sample – a portion of the universe of people deemed to be representative of the whole population. Sampling the process of selecting the portion of the universe deemed to be representative of the whole population. Stratified-random sampling is the process of developing a sample based on specific subgroups of a population in which every member has the same chance of being included in the sample. Stratified sampling is the process of developing a sample based on specific subgroups of a population. 2 types of sampling procedure: 1. Purposive sampling – the arbitrary selection of people to be part of a sample because they are thought to be representative of the population being studied. 2. Incidental sampling – referred to as convenience sampling. The process of arbitrarily selecting some people to be part of sample because they are readily available, not because they are most representative of the population being studied. Types of Norms 1. Age Norms – also known as ageequivalent scores, indicate the average performance of different samples of test takers who were at various ages at the time the test was administered. 2. Grade Norms – designed to indicate the average test performance of test takers in a given school grade. – are developed by administering the test to representative samples of children over a range of consecutive grade levels. 3. National Norms – are derived from a normative sample that was nationally representative of the population at the time the norming study was conducted. 4. National Anchor Norms - – an equivalency table for scores on two nationally standardized test designed to measure the same thing. 5. Local Norms – provide normative information with respect to the local population’s performance on some test. Percentage correct – refers to the distribution of raw scores more specifically to the number of items that were answered correctly multiplied by 100 and divided by the total number of items. Equipercentile method – the equivalency of scores on different tests is calculated with reference to corresponding percentile score. Fixed reference group scoring system – a system of scoring wherein the distribution of scores obtained on the test from one group of test takers is used as the basis for the calculation of test scores for future administrations. Fixed reference group is used as the basis for the calculation of test scores for future administration of the test. 6. Norms from a fixed reference group Norm-referenced – a way to derive meaning from a test scores. An approach to evaluate the test score in relation to other scores on the same set. 7. Subgroup Norms – norms for any defined group within a large group. Criterion – a standard on which a judgment or decision may be based. 8. Percentile Norms – are the raw data from a test’s standardization sample converted to percentile form. Criterion-referenced testing and assessment may be defined as a method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard. Percentile – is an expression of the percentage of people whose score on a test or measure falls below a particular raw score. It is a converted score that refers to a percentage of test takers. Content-referenced testing and assessment – also referred to as criterion-referenced or domain- referenced testing and assessment. A method of evaluation and a way of deriving meaning from test scores by evaluating an individual’s score with reference to a set standard; contrast with norm-referenced testing and assessment.