Psychological Testing & Assessment Overview

1 PSYCH ASSESSMENT TOPIC 1: OVERVIEW PSYCHOLOGICAL TESTING AND ASSESSMENT =PSYCHOLOGICAL TESTS= • Are objective and standardized measure of a sample of human behavior (Anastasi & Urbina, 1997). • These are instruments with three defining characteristics: o It is a sample of human behavior. o The sample is obtained under standardized conditions. o There are established rules for scoring or for obtaining quantitative information from the behavior sample. *PYSCHOLOGICAL MEASUREMENT- The process of assigning numbers (i.e. test scores) to persons in such a way that some attributes of the person being numbers. =GENERAL TYPES OF PSYCHOLOGICAL TESTS= These tests are categorized according to the manner of administration, purpose, and nature. *Administration- Individual; Group *Item Format- Objective; Projective *Response Format- Verbal; Performance *Domain Measures- Cognitive; Affective =T Y P E S O F T E S T S= *STANDARDIZED TESTS- Are those instruments that have prescribed directions for administration, scoring, and interpretation. • Examples: MBTI, MMPI, SB-5, WAIS *GROUP TESTS- Are those that can be administered to a group and usually done on a paper-pencil method; and can be administered individually. • Examples: Achievement Tests, RPM, MBTI *SPEED TESTS- Are administered under a prescribed time limit usually for a short period of time and not enough for an individual to finish answering the entire test. • The level of difficulty is the same for all items. • Example is the SRA Verbal Test *POWER TESTS- Are those that measure competencies and abilities. • The time limit prescribed is usually enough for one to accomplish the entire test. • Example is the Differential Aptitude Test *VERBAL TESTS- Are instruments that involve words to measure a particular domain. • Example, admission tests for many educational institutions. *NONVERBAL TESTS- Are instruments that do not use words, rather, they use geometrical drawing or patterns. • Examples, RPM. *COGNITIVE TESTS- Are those that measure thinking skills. • Examples are the broad range of intelligence and achievement tests. *AFFECTIVE TESTS- Are those that measure personality, interests, values, etc. • Examples: Life Satisfaction Scale, 16PF, MBTI =TESTING OF HUMAN ABILITY= *NON-STANDARDIZED TESTS (Informal Tests)- Are exemplified by teacher-made tests either for formative or summative evaluation of student performance. • Examples: Prelim Exam, Quizzes *NORM-REFERENCED TESTS- Instruments whose score interpretation is based on the performance of a particular group. • For example, Ravens Progressive Matrices (RPM) which has several norm groups which serve as a comparison group for the interpretation of scores. *CRITERION-REFERENCED TESTS- These are the measures whose criteria for passing or failing have decided before hand. • For example, a passing score of 75%. *INDIVIDUAL TESTS- Are instruments that are administered one-on-one, face-to-face. • Examples: WAIS, SB-5, BenderGestalt *TESTS FOR SPECIAL POPULATION-Developed similarly for use with persons who cannot be properly or adequately examined with traditional instruments, such as the individual scales. • Follows performance, or nonverbal tasks. =PERSONALITY TESTS= These are instruments that are used for the measurement of emotional, motivational, interpersonal, and attitudinal characteristics. 2 *Approaches to the Development of Personality Assessment • Empirical Criterion Keying • Factor Analysis • Personality Theories = GENERAL PROCESS OF ASSESSMENT= =R E A S O N S F O R U S I N G T E S T S= *PROJECTIVE TECHNIQUES- Are relatively unstructured tasks; a task that permits almost an unlimited variety of possible responses; a disguised procedure task. Examples: • Rorschach Inkblot Test • Thematic Apperception Test • Sentence Completion Test • Drawing Test *PSYCHOLOGICAL TESTING - The process of measuring psychology-related variables by means of devices or procedures designed to obtain a sample of behavior. *PSYCHOLOGICAL ASSESSMENT- The gathering and integration of psychology-related data for the purpose of making a psychological evaluation that is accomplished through the use of tools such as tests, interviews, case studies, behavioral observation, and especially designed apparatuses and measurement procedures. =THE TOOLS OF PSYCHOLOGICAL ASSESSMENT= *THE TEST- A test is defined simply as a measuring device or procedure. *PSYCHOLOGICAL TEST- Refers to a device or procedure designed to measure variables related to psychology: • Intelligence • Personality • Aptitude • Interests • Attitudes • Values = DIFFERENCES IN PSYCHOLOGICAL TESTS AND OTHER TOOLS= =DIFFERENT APPROACHES TO ASSESSMENT= *COLLABORATIVE PSYCHOLOGICAL ASSESSMENT- The assessor and the assesse work as partners from initial contact through final feedback *THERAPEUTIC PSYCHOLOGICAL ASSESSMENTTherapeutic self-discovery and new understandings are encouraged throughout the assessment process *DYNAMIC ASSESSMENT-The interactive approach to psychological assessment that usually follows a model of: evaluation> intervention> evaluation. Interactive, changing, and varying nature of assessment *THE INTERVIEW- A method of gathering information through direct communication involving reciprocal exchange. Differences: purpose, length, and nature. Uses: diagnosis, treatment, selection, decisions. 3 * THE PORTFOLIO- Contains a sample of one’s ability and accomplishment which can be used for evaluation. *THE CASE HISTORY DATA- Are the records, transcripts, and other accounts in written, pictorial, or other form that preserve archival information, official and informal accounts, and other data and items relevant to the assesse. *Case Study or Case History- a report of illustrative account concerning a person or an event that was compiled on the basis of case history data. *BEHAVIORAL OBSERVATION- Monitoring the actions of others or oneself by visual or electronic means while recording quantitative or qualitative information regarding the actions. Aids the development of therapeutic intervention which is extremely useful in institutional settings such as schools, hospitals, prisons, and group homes. *THE ROLE-PLAY TESTS- Acting in improvised or partially improvised part in a simulated situation. Assesses are directed to act as if they were in a particular situation. Evaluation: expressed thoughts, behaviors, abilities, and other related variables. Can be used as both a tool of assessment and a measure of outcome. *THE COMPUTERS AS TOOLS- Can serve as test administrators (online or offline) and as highly efficient test scores. *Interpretive Reports- distinguished by its inclusion of numerical or narrative interpretive statements in the report. *Consultative Reports- written in language appropriate for communication between assessment professionals and may provide expert opinion concerning data analysis. *Integrative Report- employs previously collected data into the test report. =PARTICIPANTS IN THE TESTING PROCESS AND THEIR ROLES= *Test authors and developers- Conceive, prepare, and develop test. Also find a way to disseminate their tests. *Test publishers- Publish, market, and sell tests, thus controlling their distribution. *Test reviewers- Prepare evaluative critiques of tests based on technical and practical merits. * Test users- Select or decide which specific test/s will be used for some purposes. May also act as examiners or scorers. *Test sponsors- Institutional boards or agencies who contract test developers or publishers for various testing services. *Test administrators or examiner- Administer the test either to one individual at a time or to groups. *Test takers- Take the test by choice or necessity. *Test scorers- Tally the raw scores and transform into test scores through objective or mechanical scoring or through the application of evaluative judgment. *Test score interpreters- Interpret test results to consumers such as; individual test takers or their relatives, other professionals, or organizations of various kinds. =SETTINGS WHERE ASSESSMENTS ARE CONDUCTED= *EDUCATIONAL SETTINGS- Helps to identify children who may have special needs. • Diagnostic tests and/or achievement tests *CLINICAL SETTINGS - For screening and or diagnosing behavioral problems. • May be intelligence, personality, neuropsychological tests, or other specialized instruments depending on the presenting or suspected problem area. *COUNSELING SETTINGS- Aims to improve the assesee’s adjustment, productivity, or some related variables. • May be personality, interest, attitude, and values tests. *GERIATRIC SETTINGS- Quality of life assessment which measures variables related to perceived stress, loneliness, sources of satisfaction, personal values, quality of living conditions, and quality of friendships and social support. *BUSINESS & MILITARY SETTINGS- Decision making about the careers of the personnel. *GOVERNMENTAL & ORGANIZATIONAL CREDENTIALINGLicensing or certifying exams. *ACADEMIC RESEARCH SETTINGS- Sound knowledge of measurement principles and assessment tools are required prior to research publication. 4 =SOME TERMS TO REMEMBER= *PROTOCOL- Typically refers to the form or sheet or booklet on which a test taker's responses are entered. • May also be used to refer to a description of a set of test-or assessment-related procedures. *RAPPORT- The working relationship between the examiner and the examinees. *ACCOMMODATION- The adaptation of a test, procedure, or situation or the substitution of one test for another, to make the assessment more suitable for an assessee without an exceptional need. *ALTERNATE ASSESSMENT - An evaluative or diagnostic procedure or process that varies from the usual, customary, or standardized way a measurement is derived. • Alternative methods designed to measure the same variables. = TEST USER QUALIFICATION LEVELS= • ONLINE DATABASES- maintained by APA; PsycINFO, ClinPSYC, PsyARTICLES, etc. =A BRIEF HISTORY OF PSYCHOLOGICAL TESTING= *20TH Century France- The roots of contemporary psychological testing and assessment. *1905- Alfred Binet and a colleague published a test to help place Paris schoolchildren in classes. *1917 World War I- The military needed a way to screen large numbers of recruits quickly for intellectual and emotional problems. *World War II- Military depend even more on psychological tests to screen recruits for the service. *Post-war- More and more tests purporting to measure an ever-widening array of psychological variables were developed and used. =PROMINENT FIGURES IN THE HISTORY OF PSYCHOMETRICS= =I N D I V I D U A L D I F F E R E N C E S= In spite of our similarities, no two humans are exactly the same. *CHARLES DARWIN- Believed that some of the individual differences are more adaptive than others. • Individual differences, over time, lead to more complex, intelligent organisms. = SOURCES OF INFORMATION ABOUT TESTS= • TEST CATALOGUES- usually contain only a brief description of the test and seldom contain the kind of detailed technical information. • TEST MANUALS- detailed information concerning the development if a particular test and technical information relating to it. • REFERENCE VOLUMES- periodically updated which provides detailed information for each test listed; Mental Measurements Yearbook. • JOURNAL ARTICLES- contain reviews of the test, updated, or dependent studies of its psychometric soundness, or examples of how the instrument was used in either research or applied context. *FRANCIS GALTON- Cousin of Charles Darwin • He was an applied Darwinist. He claimed that some people possessed characteristics that made them more fit than others. • Wrote Hereditary Genius (1869). • Set up an anthropometric laboratory at the International Exposition of 1884. • Noted that persons with mental retardation also tend to have diminished ability to discriminate among heat, cold, and pain. 5 *Charles Spearman- Had been trying to prove Galton’s hypothesis concerning the link between intelligence and visual acuity. • Expanded the use of correlational methods pioneered by Galton and Karl Pearson, and provided the conceptual foundation for factor analysis, a technique for reducing a large number of variables to a smaller set of factors that would become central to the advancement of testing and trait theory. • Devised a theory of intelligence that emphasized a general intelligence factor (g) present in all intellectual activities. *KARL PEARSON-Famous student of Galton. • Continued Galton’s early work with statistical regression. • Invented the formula for the coefficient of correlation; Pearson’s r. *JAMES MCKEEN CATTELL- The first person who used the term mental test. • Made a dissertation on reaction time based upon Galton’s work. • Tried to link various measures of simple discriminative, perceptive, and associative power to independent estimates of intellectual level, such as school grades. =EARLY EXPERIMENTAL PSYCHOLOGISTS= • Early 19th century, scientists were generally interested in identifying common aspects, rather than individual differences. • Differences between individuals were considered as source of error, which rendered human measurement inexact. *JOHAN FRIEDRICH HERBART- Proposed the Mathematical models of the mind. • The founder of Pedagogy as an academic discipline. *ERNST HEINRICH WEBER- Proposed the concepts of sensory thresholds and Just Noticeable Differences (JND). *GUSTAV THEODOR FECHNER- Involved in the Mathematic sensory thresholds of experience. • Founder of Psychophysics, and one of the founders of Experimental Psychology. • Weber-Fechner-Law was the first to relate to situation and stimulus. It states that the strength of a sensation grows as the logarithm of the stimulus intensify. • Was considered by some as the founder of Psychometrics. *GUY MONTROSE WHIPPLE- Was influenced by Fechner and was a student of Titchner. • Pioneered the human ability testing. • Conducted seminar that changed the field of psychological testing (Carenegie Institute in 1918). • Because of his criticisms, APA issued its first standards for professional psychological testing. • Construction of Cernegie Interest Inventory – Strong Vocational Interest Blank. *LOUIS LEON THURSTONE- Was a large contributor to factor analysis and attended Whipple’s seminars. • His approach to measurement was called the Law of Comparative Judgment. =INTEREST IN MENTAL DEFICIENCY= *JEAN ETIENNE ESQUIROL- A French physician and was the favorite student of Phillippe Pinel– the founder of Psychiatry. • Was responsible for the manuscript on mental retardation which differentiated between insanity and mental retardation. *EDOUARD SEGUIN- A French physician who pioneered in training mentallyretarded persons. • Rejected the notion of incurable mental retardation (MR). • 1837, he opened the first school devoted to teaching children with MR. • 1866, he conducted experiments with physiological training of MR which involved sense/muscle training used until today and leads to nonverbal tests of intelligence (Seguin Form Board Test). *EMIL KRAPELIN- Devised a series of examinations for evaluating emotionally impaired individuals. =INTELLIGENCE TESTING= *ALFRED BINET- Appointed by the French government to develop a test that will place Paris schoolchildren to special classes who failed to respond to normal schooling. • Devised the first intelligence test: the Binet-Simon scale of 1905. • The scale has standardized administration and used a standardization sample. *LEWIS MADISON TERMAN- Translated the Binet-Simon Scales in English to be used in the US and in 1916, it was published as the StanfordBinet Intelligence Scale. • SB scale became more psychometrically sound and the term IQ was introduced. • IQ = Mental Age / Chronological Age X 100 6 *ROBERT YERKES- President of the APA who was commissioned by the US Army to develop structured tests of human abilities. • WW1 arose the need for large-scale group administered ability tests by the army. • Army Alpha – verbal; administered to literate soldiers. • Army Beta – nonverbal; administered to illiterate soldiers. • Started with great suspicion; first serious study made in 1932. • Symmetric colored and black & white inkblots. • Introduced to the US by David Levy. *DAVID WECHSLER- Subscales on his tests were adopted from the army scales. • Produces several scores of intellectual ability rather than Binet’s single scores. • Evolved to the Wechsler Series of intelligence tests (WAIS, WISC, etc.) *THEMATIC APPERCEPTION TEST (TAT)- Was developed in 1935 and composed of ambiguous pictures that were considerably more structured than the Rorschach. • Subjects are shown pictures and asked to tell a story including: o What has led up to the event shown; o What is happening at the moment; o What the characters are feeling and thinking; and o What the outcome of the story was. =PERSONALITY TESTING- =PERSONALITY TESTING: Second Structured Test= These tests were intended to measure personality traits. *MINNESOTA MULTIPHASIC PERSONALITY INVENTORY (MMPI 1943)- Tests like Woodworth made too many assumptions. • The meaning of the test response could only be determined by empirical research. • MMPI-2 and MMPI-A are most widely used. *TRAITS- are relatively enduring dispositions (tendencies to act, think, or feel in a certain manner in any given circumstance). *1920s- The rise of personality testing *1930s- The fall of personality testing *1940s- The slow rise of personality testing =PERSONALITY TESTING: First Structured Test= *WOODWORTH PERSONAL DATA SHEET- The first objective personality test meant to assist in psychiatric interviews. • Was developed during the WW1. • Designed to screen out soldiers unfit for duty. • Mistakenly assume that subject’s response could be taken at face value. *RAYMOND B. CATTELL: The 16 PF- The test was based on factor analysis – a method for finding the minimum number of dimensions or factors for explaining the largest number of variables. • J. R. Guilford, the first to apply factor analytic approach to test construction. =THE RISE OF MODERN PSYCHOLOGICAL TESTING= *1900s- Everything necessary for the rise of the first truly modern and successful psychological test was in place. *1904- Alfred Binet was appointed to devise a method of evaluating children who could not profit from regular classes and would require special education. *1905- Binet and Theodore Simon published the first useful instrument to measure general cognitive abilities or global intelligence. *1908- Binet revised, expanded, and refined his first scale. *1911- The birth of the IQ. William Stern (1911) proposed the computation for IQ based on Binet-Simon scale (IQ = Mental Age / Chronological Age X 100). =PERSONALITY TESTING: Slow Rise – Projective Techniques= *HERMAN RORSCHACH: The Inkblot Test- Pioneered the projective assessment using his inkblot test in 1921. *1916- Lewis Terman translated the Binet-Simon scales to English and published it as the Stanford-Binet Intelligence Scale. 7 *1905- Binet and Theodore Simon published the first useful instrument to measure general cognitive abilities or global intelligence. *1908 and 1911- Binet revised, expanded, and refined his first scale. *1917 World War I- Robert Yerkes, APA President developed a group test of intelligence for US Army. Pioneered the first group testing; Army Alpha and Army Beta. *1918- Arthur Otis devised a multiple choice items that could be scored objectively and rapidly. Published Group Intelligence Scale that had served as model for Army Alpha. *1919- E.L Throndike produced an intelligence test for high school graduates =PROMINENT FIGURES IN THE MODERN PSYCHOLOGICAL TESTING= •Alfred Binet •Theodore Simon •Lewis Terman •Robert Yerkes • Arthur Otis •Lewis Terman TOPIC 2: STATISTICS REVIEW *MEASUREMENT- The act of assigning number or symbols to characteristics of things (people, events, etc.) according to rules. *SCALE- A set of numbers (or other symbols) whose properties model empirical properties of the objects to which the numbers are assigned. =CATEGORIES OF SCALES= *DISCRETE- Values that are distinct and separate; they can be counted. • For example, if subjects were to be categorized as either female or male, the categorization scale would said to be discrete because it would not be meaningful to categorize a subject as anything other than female or male. • Examples: Gender, Types of House, Color *CONTINUOUS- Exists when it is theoretically possible to divide any of the values of the scale. • The values may take of any value within a finite or infinite interval. • Examples: Temperature, Height, Weight *E R R O R- Refers to the collective influence of all the factors on a test score or measurement beyond those specifically measured by the test or measurement. • It is very much an element of all measurement, and it is an element for which any theory of measurement must surely account. • Always present of measurement that follows a continuous scale. =SCALES OF MEASUREMENT= *N O M I N A L S C A L E S- Known as the simplest form of measurement. • Involve classification or categorization based on one or more distinguishing characteristics, where all things measured must be placed into mutually exclusive and exhaustive categories. • Example: DSM5, Gender of the patients, colors *O R D I N A L S C A L E S- It also permit classification and in addition, rank ordering on some characteristics is also permissible. • It imply nothing about how much greater one ranking is than another; and the numbers do not indicate units of measurement. • No absolute zero point. • Examples: fastest reader, size of waistline, job positions *I N T E R V A L S C A L E S- Permit both categorization and rank, in addition, it contain intervals between numbers, thus, each unit on the scale is exactly equal to any other unit on the scale. • No absolute zero point however, it is possible to average a set of measurements and obtain a meaningful result. • For example, IQs of 80 and 100 is thought to be similar to that existing between IQs of 100 and 120. If an individual achieved IQ of 0, it would not be an indication of zero intelligence or total absence of it. • Examples: temperature, time, IQ scales, psychological scales *R A T I O S C A L E S- Contains all the properties of nominal, ordinal, and interval scales, and it has a true zero point; negative values are not possible. • A score of zero means the complete absence of the attribute being measured. • Examples: exam score, neurological exam (i.e. hand grip), heart rate *DESCRIPTIVE STATISTICS- Is used to say something about a set of information that has been collected only. 8 =D E S C R I B I N G D A T A= *DISTRIBUTION- A set of test scores arrayed for recording or study. *RAW SCORE- Is a straightforward, unmodified accounting of performance that is usually numerical. • May reflect a simple tally, such as the number of items responded to correctly on an achievement test. *AVERAGE DEVIATION- Another tool that could be used to describe the amount of variability in a distribution. • Rarely used perhaps due to the deletion of algebraic signs renders it is a useless measure for purpose of any further operations. *FREQUENCY DISTRIBUTION- All scores are listed alongside the number of times each score occurred. • Scores might be listed in a tabular or graphical forms. *STANDARD DEVIATION (SD)- A measure of variability that is equal to the square root of the average squared deviations about the mean. • The square root of the variance. • A low SD indicates that the values are close to the mean, while a high SD indicates that the values ae dispersed over a wider range. *MEASURES OF CENTRAL TENDENCY- It is a statistic that indicates the average of midmost scores between the extreme scores in a distribution. *S K E W N E S S- Refers to the absence of symmetry. • It is an indication of how a measurements in a distribution are distributed. *MEAN- The most common measure of central tendency. It takes into account the numerical value of every score. *MEDIAN- The middle most score in the distribution. Determined by arranging the scores in either ascending or descending order. *MODE- The most frequently occurring score in a distribution of scores =MEASURES OF VARIABILITY= *Variability- is an indication of how scores in a distribution are scattered or dispersed. *K U R T O S I S- The steepness of the distribution in its center.• Describes how heavy or light the tails are. • PLATYKURTIC- relatively flat, gently curved • MESOKURTIC- moderately curved, somewhere in the middle • LEPTOKURTIC- relatively peaked *RANGE- he simplest measure of variability. • It is the difference between the highest and the lowest score. *Interquartile Range- A measure of variability equal to the difference between Q3 and Q1. *Semi-Interquartile Range - Equal to the interquartile range divided by two 9 *T H E N O R M A L C U R V E- Is a bell-shaped, smooth, mathematically defined curve that is highest at its center. • It is perfectly symmetrical with no skewness. • Majority of the test takers are bulked at the middle of the distribution; very few test takers are at the extremes. • Mean = Median = Mode • Q1 and Q3 have equal distances to the Q2 (median). • Approximately 95% of all scores occur between the mean and +/- 2 SD. *T H E S T A N D A R D S C O R E S-These are raw scores that have been converted from one scale to another scale, where the latter scale has some arbitrarily set mean and standard deviation. • It also provide a context of comparing scores on two different tests by converting scores from the two tests into z-score. =TYPES OF STANDARD SCORES= *z Scores- Known as the golden scores. • Results from the conversion of a raw score into a number indicating how many SD units the raw score is below or above the mean of the distribution. • Mean = 0; SD = 1 • Zero plus or minus one scale (0 +/- 1) • Scores can be positive or negative. =AREAS UNDER THE NORMAL CURVE= *t Scores- Fifty plus or minus ten scale (50 +/- 10) • Mean = 50; SD = 10 • Devised by W.A McCall (1922, 1939) and named a T score in honor of his professor E.L. Thorndike. • Composed of a scale that ranges from 5 SD below the mean to 5 SD above the mean. • None of the scores are negative. *Stanine- Takes the whole numbers from 1 to 9 without decimals, which represent a range of performance that is half of SD in width. • Mean = 5; SD = 2 • Used by US Airforce Assessment • 50% of the scores occur above the mean and 50% of the scores occur below the mean. • Approximately 34% of all scores occur between the mean and 1 SD above the mean. • Approximately 34% of all scores occur between the mean and 1 SD below the mean. • Approximately 68% of all scores occur between the mean and +/- 1 SD. *Deviation IQ- Used for interpreting IQ scores • Mean = 100; SD = 15 *STEN- Standard ten • Mean = 5.5; SD = 2 *Graduate Record Exam (GRE) or Scholastic Aptitude Test (SAT)- Used from admission to graduate school and college. • Mean = 500; SD = 100 10 = RELATIONSHIP BETWEEN STANDARD SCORES= = RELATIONSHIP BETWEEN STANDARD SCORES= * CORRELATION AND INFERENCE- Correlation coefficient is a number that provides an index of the relationship between two things. *C O R R E L A T I O N A L S T A T I S T I C S- Are statistical tools for testing the relationships or associations between variables. • A statistical tool of choice when the relationship between variables is linear and when the variables being correlated are continuous. • COVARIANCE- how much two scores vary together. • CORRELATION COEFFICIENT- a mathematical index that describes the direction and magnitude of a relationship; always ranges from -1.00 to +1.00 only. *PEARSON PRODUCT MOMENT CORRELATIONDetermines the degree of variation in one variable that can be estimated from knowledge about variation in other variable. • Correlated two variables in interval or ratio scale format. • Devised by Karl Pearson =THREE TYPES OF CORRELATIONS= *SPEARMAN RHO CORRELATION- A method of correlation for finding the association between two sets of ranks thus, two variables must be in ordinal scale. • Frequently used when the sample size is small (fewer than 30 pairs of measurements). • Also called rank-order correlation coefficient or rankdifference correlation. • Devised by Charles Spearman *BISERIAL CORRELATION- Expresses the relationship between a continuous variable and an artificial dichotomous variable. • For example, the relationship between passing or failing the bar exam (artificial dichotomous variable) and general weighted average (GPA) in law school (continuous variable) *POINT BISERIAL CORRELATION- Correlates one continuous and one true dichotomous data. • For example, score in the test (continuous or interval) and correctness in an item within the test (true dichotomous). *TRUE DICHOTOMY- There are only two possible categories that are formed naturally. • For example: Gender (M/F) *ARTIFICIAL DICHOTOMY- Reflect an underlying continuous scale forced into a dichotomy; there are other possibilities in a certain category. • For example: Exam score (Pass or Fail) *PHI COEFFICIENT- Correlates two dichotomous data; at least one should be true dichotomy. • For example, gender population who passed or fail the 2018 Physician Licensure Exam. *TETRACHORIC COEFFICIENT- Correlated two dichotomous data; both are artificial dichotomy. • For example, passing or failing a test and being highly anxious or not. =ISSUES IN THE USE OF CORRELATION= *RESIDUAL- difference between the predicted and the observed values. 11 *STANDARD ERROR OF ESTIMATE- standard deviation of the residual; measure of accuracy and prediction. *SHRINKAGE- the amount of decrease observed when a regression equation is created for one population and then applied to another. *COEFFICIENT OF DETERMINATION (r2)- tells the proportion of the total variation in scores on Y that we know as a function of information about X. It also suggests the percentage shared by two variables; the effect of one variable to the other. *COEFFICIENT OF ALIENATION- measures the nonassociation between two variables. *RESTRICTED RANGE- significant relationships are difficult to find if the variability is restricted. =Essential Facts About Correlation= • The degree of relationship between two variables is indicated by the number in the coefficient, whereas the direction of the relationship is indicated by the sign. • Correlation, even if high, does not imply causation. • High correlations allow us to make predictions *R E G R E S S I O N- Defined broadly as the analysis or relationships among variables for the purpose of understanding how one variable may predict the other through the use of linear regression. • Predictor (X) –serves as the IV; causes changes to the other variable. • Predicted (Y) –serves as the DV; result of the change as the value of predictor changes. • Represented by the formula: Y = a + bX • INTERCEPT (a)- the point at which the regression line crosses the Y axis • REGRESSION COEFFICIENT (b)- the slope of the regression line • REGRESSION LINE- best fitting straight line through a set of points in a scatter plot • STANDARD ERROR OF ESTIMATE- measures the accuracy of predi *MULTIPLE REGRESSION ANALYSIS- A type of multivariate (three or more variables) analysis which finds the linear combination of the three variables that provides the best prediction. • Statistical technique in predicting one variable from a series of predictors. • Intercorrelations among all the variables involved. • Applicable only for all data that are continuous. *STANDARDIZED REGRESSION COEFFICIENTS- Also known as beta weights • Tells how much a variable from a given list of variables predict a single variable. *FACTOR ANALYSIS- Is used to study the interrelationships among a set of variables without reference to a criterion. • Factors–these are the variables; also called principal components. • Factor Loading –the correlation between original items and the factors. *META-ANALYSIS- The family of techniques used to statistically combine information across studies to produce single estimates of the data under study. ADVANTAGES: • Can be replicated • Conclusions tend to be more reliable and precise than conclusions from single studies • More focus on effect size than statistical significance alone • Promotes evidenced-based practice –a professional practice that is based on clinical research findings. •Effect Size – the estimate of strength of relationship or size of differences. Typically expressed as a correlation coefficient. =PARAMETRIC VS NON PARAMETRIC TESTS= *PARAMETRIC- Assumptions are made for the population • Homogenous data; normally distributed samples • Mean and SD • Randomly selected samples *NON-PARAMETRIC- Assumptions are made for the samples only • Heterogeneous data; skewed distribution. • Ordinals and categories • Highly purposive sampling =NON PARAMETRIC TESTS= *MANN-WHITNEY U TEST o Counterpart of t-test for independent samples o Ordinal data o Assumption of heterogeneous group *WILCOXON SIGNED RANK TEST o Counterpart of t-test for dependent samples o Ordinal data o Assumption of heterogeneous data 12 *KRUSKAL WALLIS H TEST o Counterpart for One-Way ANOVA o Ordinal data o Assumption of heterogeneous group *FRIEDMAN TEST o Counterpart of t-test for dependent samples o Ordinal data o Assumption of heterogeneous data TOPIC 3: ESSENTIALS OF TEST SCORE INTERPRETATION (Of Test and Testing) =ASSUMPTIONS ABOUT PSYCHOLOGICAL TESTING AND MEASUREMENT= *Assumption 1- Psychological Traits and States Exist *TRAIT -Any distinguishable, relatively enduring way in which one individual varies from one another •Psychological traits exist only as a construct–an informed, scientific concept developed or constructed to describe or explain behavior. *STATES -Distinguish one person from the another but are relatively less enduring *Assumption 2- Psychological Traits and States Can Be Quantified and Measured • Traits and states shall be clearly defined to be measured accurately. • Test developers and researchers, like other people in general, have many different ways of looking at and defining the same phenomenon. • Once defined, test developer considers the types of item content that would provide insight into it. *Cummulative scoring- assumption that higher the tesk taker’ score is, there is the presumption to be on the targeted ability or trait whom, and how the test results should be interpreted. • Competent test users understand and appreciate the limitations of the tests they use *Assumption 5- Various Sources of Error Are Part of the Assessment Process *Error- Is trasitionally refereed to as a something that is more than expected and is a component of the measurment process. •Is a long-standing about the assumptions that factors other than what a test attempts tos measure will influence perfromance on the test *Error Variance- The component of a test score atrributable to sources than the trait or ability being measured *Classical Test Theory- assumption is made of that each test taker has a score on a test that would be obtained but for the action of measurement error * Assumption 6- Testing and Assessment Can Be Conducted in a fair asn Unbiased Manner •Sources of fairness-related problems is the best test who attempts to use a particular test with people whose background and experience of people for whom the test was intended. •It is more political than psychometric *Assumption 7- Testing and Assessment Benefit Society Without Test,there will be…. •Subjective personnel hiring process •Children with special needs might be assigned to certain classes by gut feel of the teachers and school administrators •Great needs to diagnose adecational difficulties *Assumption 3- Test-Related Behavior Predicts Non-TestRelated Behavior •No instruments todiagnose neuropsychological impairments • The tasks in some tests mimic the actual behaviors that the test user is attempting to understand. • Obtained sample of behavior is typically used to make predictions about future behavior. •No practical way for military to screen thousands of recruits *Assumption 4- Tests and Other Measurement Techniques Have Strengths and Weakness • Competent test users understand how a test was developed, the circumstance under which it is appropriate to administer the test, how to administer the test and to =WHAT IS A GOOD TEST?= Psychometric Soundness *Reliability- consistency in measurement •The precision with which the test measures and the extent to which error is present in measurement •Perfectly reliable measuring tool consistenly measures in the same way 13 *Validity- when a test measures what it purports to measure •An intelligence test is valid test because it measures intelligence; the same way with personality tests; and with other psychological tests •Questions on test’s valifity may focus on the items that is collectively make up the test. *Norms- These are the test performace data of a particular group of test takers that are designed for use as a reference when evaluating or interpreting individual test scores •Obtained by adminis tering the test to a sample of people and obtaining the distribution of scores for that group *Normative Sample- Group of people whose performancr on a particular test in analyzed for reference in evaluating the persormance of an individual test taker *Norming- The perocess of deriving norms •Maybe modified to describe a particular type of norm derivation =SAMPLING TO DEVELOP NORMS= *Sample- The representative of the whole population •It could be a small as one person, though samples that approach the size of the population reduce thepossible sources of error due to insufficient sample size *Sampling- The process of selecting theportion of the universe deemed to be representative of the whole population =Developing Norms for a Standardized Test= •The test developer administers the test according to the standard set of instructions that will be used with the test •The test developer describes the recommended setting for giving the test *Standardization- the process of adminitering a test to a representative sample of test takers for the purpose of establishing norms •A test is said to be standardized when it has clearly specified procedures for administration and scoring. Typically including normative data •The test developer will summarize the data using descriptive statistics, including measures of central tendency and variability =TYPES OF STANDARD ERROR= Exact figure exact # of sample, specify the demograpic *Standard Error of Measurement (SEM)- a statistic to estimate the extent to which an observed score deviates from a true score = EVALUATION OF NORMS= *Standard Error of Estimate (SEE)- In regression, it is an estimate of the degree of error involved in predicting the value of one variable from another *Standard Error of the Mean (SEM)- a measure of sampling error *Standard Error of the Difference (SED)- a statistic used to estimate how large difference between two scores should be before the difference is considered statistically significant •The test developer provides a precise description od the standardization saple itself *Norm-referenced Test- a score is interpreted by comparing it with the scores obtained by others on the same test •Methoed of evaluation and a way of deriving meaning from test scores by evaluationg an individual’s score with reference to a set of standard *Criterion-referenced Test- is uses as specified content domain rather than a specified poppukation of people. The score is interpreted based on the performance of a standardized group •Also known as content-referenced or domain-reference Criterion: how the test developer pre determined the cut score 14 =TYPES OF NORMS REFERENCED TESTING= *Development Norms- Norm developed on the basis of any trait, ability, skill or other characteristics that is presumed to develop, deteriorate, or otherwise be affected by chronoligical age, school grade, or stage of life *Age Norms- average performance of different samples of test takers who were at various ages at the time the test was administered *Grade Norms- designed to indicate the average test performance of test takers in a given school grade *Ordinal Scale- are digned to identify the stage reached by the child in the development of specific behavior functions *Within Group Norms-the individual’s performance evaluated in terms of the performance of the most nearly comparable standardization group *Percentiles- an expression of the percentage of people whose score on a test or measure falls below a particular score •it indicates the individual’s relative position in the standardization sample (Percentile rank: your position in the entire rank Example: Kyla placed in 95th percentile Interpretation: Kyla is in the 95th percentile rank which means that she outsmarted 95% of the population who also took the test.) *Standard scores- are derived scores which uses as its unit the SD of the population upon which the test was standardized *Deviation IQ- a standard score on an intelligence test with a mean of 100 and an SD that approximates the SD of the SB IQ distribution *National Norms- Norms on a large scale sample •National representatives *Subgroup Norms- segmented by any of the criteria initially used in selecting subjects for the sample *Local Norms- provide normative information with respect to the normative population performance on some test *TRACKING – The tendency to stay at about the same level relative to one’s peer. • Staying at the same level of characteristics as compared to the norms. • This process is applied to children when parents want to know if the child is growing normally. TOPIC 4: RELIABILITY *Reliability- refers to the dependability or consistency in measurement •The proportion of the total variancr attibuted to true variance *Reliability Coefficient- an index of reliblity. A proportion that indicates the ratio between the true score variance on a test and the total variance If we use X to represent an observed score, T to represent a true score, and E to represent error, then the fact that an observed score equals the true score plus error may be expressed as follows: X = T + E =Concepts in Reliability= *Variance (σ2) - useful in describing sources of test score variability. The standard deviation squared. This statistic is useful because it can be broken into components. *True Variance- variance from true differences *Error variance- variance from irrelevant, random sources *Measurement error- refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured •If σ2 represents the total variance, the true variance, and the error variance, then the relationship of the variances can be expressed as σ2 = σ2 th + σ2e •In this equation, the total variance in an observed distribution of test scores (σ2) equals the sum of the true variance (σ2 th) plus the error variance (σ2 e). Types of Measurement Error *Random error- is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. *Systematic error- refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. 15 =Theory of Reliability= Puts sampling error and correlation together in the context of measurement *Test Score Theory: Classical Test Theory (CTT)- known as the true score model of measurement •X = T + E where X is the observed score, T is the true score, and E is the error •Standard Error of Measurement- the standard deviation of errors because of the assumption that the distribution of random error will be the same for all people •Widely used in assessment of reliability and has been around for over 100 years •Disadvantage: requires that exactly the same test items be administered to each person •Errors of measurement are random •Each person has a true score obtained if there were no error inn measurement •The true score for an individual will not change with repeated applications of the same test *Item Response Theory (IRT)- provides a way to model the probability that a person with X ability will be able to perform at a level of Y •Latent-Trait Theory- synonym of IRT I academic literature because often times, a psychological or educational construct being measured is physically unobservable and because the construct being measured may be trait •The computer is used to focus on the range of item difficulty that helps assess on individual’s ability level •For example, if the test taker gets several easy items correct, the computer might quickly move to difficult items. If the person gets several difficult items wrong, the computer moves back to the area of item difficulty where the person gets some items right and some wrong •Requires a bank of items that have systematically evaluated for level of difficulty •The more questions that the test takers have answered the higher the chance that the question is easy •The more questions that the test takers have not answered the higher possibility the chance that the question is difficult *Domain Sampling Model- The greater the numbers of items, the higher the reliability is •It considers the problems created by using a limited number of items to represents a large and more complicated construct •Conceptualizes reliability as the ratio of the variance of the observed score on the test and the variance of the longrun true score =Sources of Error Variance= *Test Construction- the item sampling or content sampling is one of the sources of variance during test construction due to the variation among items within a test as well as variation among items between tests *Test Administration- Sources of error variance that occur during test administration may influence the test taker’s attention or motivation. •test taker’s reactions to those influences are the source of one kind of error variance, test environment for instances •Test taker variables such as emotional problems, physical discomfort, lack of sleep, effects of drugs or medication, etc. •Examiner- related variables- such as physical appearance and demeanor, or the mere presence or absence of an examiner *Test Scoring and Interpretation- Not all tests can be scored using a computer such as the tests administered individually • a test may employ objective-type items amendable to computer scoring yet technical glitch may contaminate the data •if subjectively is involved in scoring, the scorer or rater can be the source of error variance 16 =Models/ Estimates of Reliability= 1. Test-retest Reliability- is used to evaluate error associated with administering a test at two different times; correlating the scores of two test administered •Should be used when measuring traits or characteristics that do not change over time- static attributes •Also known as coefficient of stability *Alternate-Forms Reliability- Different versions of a test that have been constructed to be parallel •Designed to be equivalent with respect to variables such as content and level of difficulty •refers to the extent to which these different forms of the same test have been affected item sampling error, or other error. It is same with parallel but different version *Conveyer Effect- occurs when the first testing session influences scores from the second session 3. Internal Consistency- estimates how the test itmes are consistent with one another •the degree of correlation among all the items on a scale *Practice Effect- type of carryover wherein skills are improved with practiced =Types to measure internal consistency= •Scores on the second administration are usually higher than they were on the first; thus, changes are not constant across the group. •If a test manual has test-retest correlation, always pay attention to the interval between two testing sessions •poor test-retest correlations do not always mean that the test is unreliable; it could mean that the characteristics under the study has changed over time 2. Parallel-And Alternate-Forms Method- coefficient of equivalence •responsible for item-sampling error •Also uses correlation •Disadvantages: •Burdensome to develop two forms of the same test •Practical constraints include retesting of the same group of individuals •creating a new secondary form that has the same question and same difficulty but different presentation (for board exam) *Parallel-Forms Reliability- compares two equivalent forms of a test that measure that same attitude •the two forms use different items; however, the rules used to select items of a particular difficulty level are the same •theoretically, scores obtained on parallel forms correlate equally with the true score or with other measures •An estimate of the extent to which item sampling and the errors have affected scores on versions of the same test when, for each form of the test, the means and variances observed test scores are equal *Split-Half Reliability- a tets is given and is divided into halves that are scores separately, the results of one half of the test are then compared with the results of the other; one administration required •Equalizes the difficulty of the test •a useful of reliability when it is impractical and undesirable to use two tests or to administer a test twice-odd-even system •Three steps; Divide the test into equivalent halves •Calculate the pearson’s r between scores on the two halves of the test •adjust the half-test reliability using the spearman-brown formula; corrected r= 2r + r *Spearman-Brown Formula- estimates what the correlation would have been if each half had been the length of a whole test •It increases the estimate of reliability •can be used to estimate the effect of the shortening at reduction of test items on the test’s reliability •can also be used to determine the number of items needed to attain a desired level of reliability; hence the new items must be equivalent in content and difficulty so that the longer test still measure what the original test measured (Magkakaroon ng correlation kung yung length ng half test is the same of the length of the whole test Ilang test items ang dapat mong alisin para bumaba ang reliability coefficient to an acceptable range for a newly develop test Sinagot niya yung weakness ni split half) *Kuder-Richardson Formula: *KR-20- used for a test in which items are dichotomous or can be scored right or wrong (merong tamang sagot) ; homogeneous, and items with unequal difficulty •the 20th formula that was developed in the series due to G. Frederic Kuder and M.W. Richardson dissatisfaction with the spilt-half methods 17 *KR-21- for items that have equal difficulty or that average difficulty level is 50% •Can’t apply in personality test *Coefficient Alpha (a)- Can be thought of as a mean of all possible split-half correlations, corrected by the SpearmanBrown formula; thus, providing the lowest estimate of reliability that one can expect. • Can be used when the two halves of a test hove unequal variances. • Appropriate for tests that have nondichotomous (no correct and incorrect) items; personality tests. • The most preferred statistic for obtaining an estimate of internal consistency reliability. •Best use in in personality test • Widely used measure of reliability, in pod because it requires only one administration of the test. • Values typically range from 0 to +1 only. *COEFFICIENT OMEGA- The extent to which all items measure the some underlying trait •Overcomes the weakness of coefficient alpha as its coefficient can be greater than zero even though the items are not assessing the same trait. *Average Proportional Distance (APD)• Evaluates the internal consistency of a test that focuses on the degree of difference that exist between items scores. • General rule: if the obtained value is .2 or lower, it indicates on excellent internal consistency. The value of .25 to .2 is in the acceptable range. • A relatively new measure developed by Sturman et al.. 4. INTER-SCORER (RATER) RELIABILITY- The degree of agreement or consistency between Iwo or more scorers (raters or judges) with regord to a particular measure. • Often used to code nonverbal behavior. • Coefficient of Inter-Scorer Reliability- the degree of consistency among scorers. *Kappa Statistics- the best method to ossess the level of agreement omong several observers. *Cohen’s kappa- small no. Of raters and scorers (2 raters) *Fleiss kappa- measure the ahreement between the fixied no. of raters (more than 2) It is the reliability of instrument (performance and score of a group) that we are measuring and not the test itself How the raters in agreement with the performance etc. *Kappa- indicates the actual agreement os a proportion of the potential agreement following a correction for a chance agreement. POINTS TO REMEMBER: • Indices of reliability provide an index that is the characteristic of a particular group of test score, not the test itself (Caruso. 2000: Yin & Fan. 2000). • Measures of reliability are estimates, and estimates are subject to error. • The precise amount of error inherent in a reliability estimate will vary with various factors such as the sample of test takers from which the data were drawn. • Reliability index is published in a test manual and might be impressive. Remember that the report reliability was achieved with a particular group of test takers. HOW RELIABLE IS RELIABLE? • Usually depends on the use of the test. • Reliability estimates in the range of .70 to .80 are good enough for most purposes of basic research. • .95 are not very useful because it may suggest that all of the items ore essentially the something and the measure could easily be shortened. =SOLVING LOW RELIABILITY= *Increase the Number of Test Items • the larger the sample, the more likely that the test will represent the true characteristic. • In domain-sampling model, the reliability of a test increases, as the number of items increases. • Can use the general Spearman-Brown formula to achieve a desired level of reliability *Factor Analysis • Unidimensional makes the test more reliable: thus, one factor should account for considerable more of the variance than any other factor. • Remove items that do not load on the one factor being measured. *Item Analysis • Correlation between each item and the total score for the test often coiled as discriminability analysis. • When the correlation between performance on a single item and the total test score is low, the item is probably measuring something different from other items on the test. *Correction of Attenuation • Used to determine the exact correlation between variables in the test is deemed affected by error. 18 • The estimation of what the correlation between tests would have been is there had been no error in measurement. =NATURE OF THE TEST= *Homogenous or Heterogenous Test Items • HOMOGENOUS: items that ore functionally uniform throughout. The test is designed to measure one factor such as one ability or one trait which yields a high degree of internal consistency. • HETEROGENOUS: more than one factor is being measured; thus, internal consistency might be low relative to a more appropriate estimate of test-retest reliability. *Dynamic or Static Characteristic, Ability, Trait • DYNAMIC- a trait, state, or ability is presumed to be everchanging as a function of situational and cognitive experiences. • STATIC- a characteristic, ability, or trait that is stable; hence, test-retest and alternate forms methods would be appropriate *Range of Test Scores is or is not Restricted • Important to be used when interpreting a coefficient of reliability. • If the variance of either variable in o correlational analysis Is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. • If the variance is inflated by the sampling procedure, then the resulting correlation coefficient is higher. *Speed Test or Power Test • SPEED TESTS- generally contain items of uniform level of difficulty (typically low) so that when given generous time limits, test taker should be able to complete all the items correctly (with time limit) • POWER TESTS- when the time Emit is long enough to attempt all items, and if some test items are so difficult that no lest taker is able to attain o perfect score (no time limit) *Test is or is not Criterion-References • Designed to provide on indication of where 0 test taker stands with respect to some variable criterion • Contains materials that have been mastered in a hierarchical fashion. For example, tracing a letter pattern before attempting to master writing • Scores on this measure tends to be interpreted in pussfoil terms, and any scrutiny of performance on individual items tends to be for diagnostic and remedial purposes. =OTHER CONCEPTS TO REMEMBER= *STANDARD ERROR OF MEASUREMENT (SEM) • A tool used to estimate or infer the extent to which a test score deviates from o true score. • Provides on estimate of the amount of error inherent in on observed score of measurement. • An Index of the extent to which one individual score vary over tests presumed to be parallel. • Confidence Interval- the range or bond of test scores that is likely to contain the hue score. *STANDARD ERROR OF DIFFERENCE (SED) • Determining how lorge the difference should be before it is considered statistically significant. • Larger than SEM for either score alone because the former is affected by the measurement in both sides. • If two scores each contain error such that in each case that true score could be higher or lower, then we would wont the two scores to be further apart before we conclude that there is a significant difference between them. *Don't forget: • The relationship between the SEM and the ratability of a test that is inverse; the higher the reliability of o test (or individual subtest within a test), the lower the SEM. • In accordance with the True Score Model, an obtained test score represented one point in the theoretical distribution of scores the test faker could have obtained 19 TOPIC 5: VALIDITY *VALIDITY- A judgment of estimate of how well a test measures what it purports to measure in a particular context. *Validation: the process of gathering and evaluating evidence about validity. *LOCAL VALIDATION STUDIES- necessary when the test user plans to alter II some way the format, instruction, language, or content of the test. •Also necessary it a test use sought to test with o population al lest token Mot differed in some significant way from the population on which the test was standardized. (Knows the context of introversion *Local Validation- no one has the right to translate a psychological test and administer it to a certain group of people whom it was translated to the language that they understood without undergoing local validation) behavior representative of the universe of behavior that the test was designed to sample. • Requites subject matter words to ensure all items are valid. (How sufficiently represented the domain that should be measure. All discussion is covered in the test) *CONSTRUCT UNDERREPRESENTATION- Describes the failure to capture important components of a construct. • For example, an English Proficiency exam that did not cover the Ports of Speech knowledge of the test takers. *CONSTRUCT-IRRELEVANT VARIANCE- Occurs when scores ore influenced by factors irrelevant to the construct. • For example, test anxiety, physical condition, etc. *Quantification of Content Validity (Lawshe, 1975) • Essential (Accepted) • Useful but not essential (Revise) • Not necessary (Rejected) =ASPECTS/ TRINITARIAN MODELS OF VALIDITY= * Face Validity- A judgment concerning how relevant test items appear to be. • Makes the test taker to seriously take the test. • Even if a test lacks face validity, it may still be relevant and useful. (It is mathematics test but there are subject verb agreement) * CONTENT-RELATED VALIDITY- Adequacy of representation of the conceptual domain the test is designed to cover. • It describes a judgment of how adequately a test samples *CRITERION-RELATED VALIDITY- A judgment of how adequately a test score can be used to infer an individual's most probable standing on some measure on interest - the measure of interest being the criterion. 20 *Criterion- the standard against which the test or test score is compared to or evaluated. It can be a test score, behavior, amount of time. rating, psychiatric diagnosis, training cost, index of absenteeism, etc. • For example, a test might be used to predict which engaged couples will have successful montages and which ones will get annulled. Marital success is the criterion but it cannot be known at the time the couple take premarital test. Characteristics of a Criterion =Considerations in Predictive Validity= *BASE RATE- the extent to which a particular trait, behavior, characteristic, or attribute exists in the population (existence of trait) *HIT RATE- the proportion of people a test accurately identifies as possessing or exhibiting o particular trait, behavior, characteristic, or attribute (presence of the threat, behavior) *RELEVANT-must be pertinent or applicable to the motler at hand. (The criteria you have is the criterion needed) *MISS RATE- the proportion of people the test foils to identify as having, or not having a particular trait, characteristic or attribute; inaccurate prediction. *VALID- should be adequate for the purpose for which it is being used. If one test is being used as the criterion to validate the second test, then evidence should exist that the first test is valid. *False Positive- a miss when a test taker was predicted to have the attributes being measured when in fact did not: akala meron pero wala. *UNCONTAMINATED- the criterion is not affected by other criterion measures; otherwise, criterion contamination occurs. =Types of Criterion Validity= *CONCURRENT VALIDITY- Comes from assessments of simultaneous relationship between the test and the criterion. Indicates the extent to which test scores may be used to estimate an individual's present standing on a criterion. •For example, work samples, scores on Beck Depression Inventory IBDII and BDI-II, etc. *PREDICTIVE VALIDITY- How accurately scores on the test predict some criterion measure. •For example, scores on College Admission Test (CAT] and GPA of freshmen provide evidence of the predictive validity of such admission test. •NMAT predicts MSAT predicts GPA in Med School •PhiLSAT predicts BAT predicts GPA in Law School *False Negative- a miss when a test taker was not predicted to have the attributes being measured when in fact he or she did have; akalo wok" pero meron. *Incremental Validity- value add of the test or of the criterion. • The degree to which on additional predictor explains something about the criterion measure that is not explained by predictors already in use. • The predictive power of the test to see or discover something else other that what it is intended to (What more, extra mile that the test can offer more than what the test is offering) * Validity Coefficient- The relationship between a test score and a score on the criterion measure. It is the computed correlation coefficient. • No hard-and-fast rules about how large the validity coefficient must be. • .60 or larger is rare; .30 to .40 are commonly considered high. (There is validity coefficient in criterion-related validity) 21 =EVALUATING VALIDITY COEFFICIENT= • Look for Changes in the Cause of Relationships • Identify if the Criterion is Valid and Reliable o Criterion validity studies would mean nothing if the criterion is not valid or reliable • Review the Subject Population in the Validity Study o The validity study might have been done on a population that does not represent the group to which inferences will be made. • Be Sure the Sample Size was Adequate o A good validity study will present evidence for crossvalidation; hence, the sample size should be enough. o Cross validation study assesses how well the test actually forecasts performance for on independent group of subjects. • Never Confuse the Criterion with the Predictor Criterion is the standard being measured or the desired outcome while predictor is a variable that affects the criterion. o • Check for Restricted Range on Both Predictor and Criterion o Restricted range happens if all scores foil very close together. • Review Evidence of Validity Generalization o The findings obtained in one situation may be applied to other situations. •Consider Differential Prediction o Predictive relationships may not be the same for oil demographic groups. =Other Concepts= *EXPECTANCY DATA- provide information that can be used in evaluating the criterion-related validity of a test. *EXPECTANCY TABLE- shows the percentage of people within specified test-scores intervals who subsequently were placed in various categories of the criterion (passed or failed). • Established through a series at activities in which a researcher simultaneously defines some construct and develops the instrumentation to measure it. • Judgment about the appropriateness of inferences drawn from test scores regarding individual standing on a variable. • Viewed as the unifying concept for all validity evidence *CONSTRUCT- a scientific idea developed or hypothesized to describe a explain behavior: something built by mental synthesis. Examples are intelligence, motivation, job satisfaction, self-esteem, etc. *CONSTRUCT VALIDATION- assembling evidence about what the test means and is done by showing the relationship between a test and other tests and measures. =Main Types of Construct Validity Evidence= *CONVERGENT VALIDITY- When a measure correlates well with other tests (standardized. published. etc.) that are designed to measure similar construct. • Yields a moderate 10 high correlation coefficient wherein the scores on the constructed test will be correlated with the scores on an established test that measures the same construct. • Can be obtained by: • Showing that a test measures the same thing as the other tests used for the same purpose: and • Demonstrating specific relationships that can be expected if the test's really doing its job. *DISCRIMINANT (Divergent) VALIDITY- A proof that the test measures something unique. •Indicates that the measure does not represent a construct other than the one for which it was devised. •Correlating a test to another measure that has a little to no relationship at all; hence, the correlation coefficient should be low to zero. =Other Evidences of Construct Validity = *Evidence of Homogeneity- how uniform a test is in measuring a single concept. (Homogeneity- Normative sample have the same characteristics) *Evidence of Changes with Age- some Constructs Ore expected to change over time. *Evidence of Pretest-Posttest Changes- result of intervening experiences. *Evidence from Distinct Groups- or contrasted groups - score on a test vary in a predictable way as a function of membership in some group. 22 *FACTOR ANALYSIS- A method of finding the minimum number of dimensions or factors for explaining the largest number of variables. *EXPLORATORY FACTOR ANALYSIS: entails estimating, or extracting factors; deciding how many factors to retain: and rotating factors to an interpretable orientation TOPIC 6: TEST DEVELOPMENT *Test development- is an umbrella term for all that goes into the process of creating a test. The process of developing a test occurs in five stages: 1. test conceptualization; *CONFIRMATORY FACTOR ANALYSIS: the degree to which a hypothetical model - which includes factors fits the actual data. 2. test construction; = Other Issues in Validity= 4. item analysis; *RATING ERROR- judgment resulting for the intentional or unintentional misuse of a rating scale. 5. test revision. *Leniency of Generosity Error- rater’s tendency to be lenient in scoring, marking, or grading (mataas yung binigay na score) *Central Tendency Error- rater's reluctance to give scores at either the positive or negative extremes (the scorer are having difficulty in giving scores) *Severity Error- rater gives low scores regardless of the performance. (if the rater is very reliant in the rubric given to the rater) *Halo Effect- rater gives high scores due to his or her failure to discriminate among conceptually and independent aspects of behavior (nadadala ka ng positive attributes. Hindi maganda yung gawa pero dahil kilala mo siya or mabait siya) 3. test tryout; Once the idea for a test is conceived (test conceptualization), test construction begins. *Test construction- is a stage in the process of test development that entails writing test items (or re-writing or revising existing items), as well as formatting items, setting scoring rules, and otherwise designing and building a test Once a preliminary form of the test has been developed, it is administered to a representative sample of test takers under conditions that simulate the conditions that the final version of the test will be administered under (test tryout) *Item analysis- are employed to assist in making judgments about which items are good as they are, which items need to be revised, and which items should be discarded. =Relationship Between Reliability and Validity= *Test revision- refers to action taken to modify a test’s content or format for the purpose of improving the test’s effectiveness as a tool of measurement. A test can be reliable even if Nis not valid; but a test cannot be valid If it is not reliable. =Test Conceptualization= *Asexuality- may be defined as a sexual orientation characterized by a long-term lack of interest in a sexual relationship with anyone or anything 23 =Scaling methods= 1. Rating scale- grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker. Rating scales can be used to record judgments of oneself, others, experiences, or objects, and they can take several forms *Summative scale-final test score is obtained by summing the ratings across all the items Norm-referenced tests compare individual performance with the performance of a group. Criterion-referenced assessments measure how well a student has mastered a specific learning goal (or objective) *Pilot work, pilot study, and pilot research refer- in general, to the preliminary research surrounding the creation of a prototype of the test. Test items may be pilot studied (or piloted) to evaluate whether they should be included in the final form of the instrument =Test construction= *Scaling- may be defined as the process of setting rules for assigning numbers in measurement. Stated another way, scaling is the process by which a measuring device is designed and calibrated and by which numbers (or other indices)— scale values—are assigned to different amounts of the trait, attribute, or characteristic being measured. *Age-based scale- If the testtaker’s test performance as a function of age is of critical interest, *Grade-based scale- If the testtaker’s test performance as a function of grade is of critical interest, *stanine scale- If all raw scores on the test are to be transformed into scores that can range from 1 to 9 * A scale might be described in still other ways. For example, it may be categorized as unidimensional as opposed to multidimensional. It may be categorized as comparative as opposed to categorical * Likert scale- one type of summative rating scale. It is used extensively in psychology, usually to scale attitudes. Likert scales are relatively easy to construct. Each item presents the testtaker with five alternative responses (sometimes seven), usually on an agree–disagree or approve– disapprove continuum Rating scales differ in the number of dimensions underlying the ratings being made •Unidimensional- meaning that only one dimension is presumed to underlie the ratings. •Multidimensional- meaning that more than one dimension is thought to guide the testtaker’s responses *Method of paired comparisons- another scaling method that produces ordinal data. Testtakers are presented with pairs of stimuli (two photographs, two objects, two statements), which they are asked to compare. 2. Sorting Scale- stimuli such as printed cards, drawings, photographs, or other objects are typically presented to testtakers for evaluation *Comparative scaling- One method of sorting, entails judgments of a stimulus in comparison with every other stimulus on the scale. Testtakers would be asked to sort the cards from most justifiable to least justifiable. Comparative scaling could also be accomplished by providing testtakers with a list of 30 items on a sheet of paper and asking them to rank the justifiability of the items from 1 to 30. *Categorical scaling- Stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum. Testtakers might be given 30 index cards, on each of which is printed one of the 30 items. Testtakers would be asked to sort the cards into three piles: those behaviors that are never justified, those that are sometimes justified, and those that are always justified. 24 3. Guttman scale- Items on it range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured. A feature of Guttman scales is that all respondents who agree with the stronger statements of the attitude will also agree with milder statements. •If this were a perfect Guttman scale, then all respondents who agree with “a” (the most extreme position) should also agree with “b,” “c,” and “d.” All respondents who disagree with “a” but agree with “b” should also agree with “c” and “d,” and so forth * scalogram analysis- an item-analysis procedure and approach to test development that involves a graphic mapping of a testtaker’s responses * Thurstone’s equal appearing intervals method- is one scaling method used to obtain data that are presumed to be interval in nature. •It is an example of a scaling method of the direct estimation variety. *In contrast to other methods that involve indirect estimation-there is no need to transform the testtaker’s responses into some other scale. =Writing Items= *Item pool- is the reservoir or well from which items will or will not be drawn for the final version of the test Multiply the number of items required in the pool for one form of the test by the number of forms planned, and you have the total number of items needed for the initial item pool. * Item format- Variables such as the form, plan, structure, arrangement, and layout of individual test items *selected-response format- require testtakers to select a response from a set of alternative responses *constructed-response format- require testtakers to supply or to create the correct answer, not merely to select it =Types of selected-response item formats= *Multiple-choice format- has three elements: (1) a stem, (2) a correct alternative or option, and (3) several incorrect alternatives or options variously referred to as distractors or foils. *Matching item- the testtaker is presented with two columns: premises on the left and responses on the right *Binary-choice item- A multiple-choice item that contains only two possible responses. Perhaps the most familiar binary-choice item is the true–false item =Three types of constructed-response items= *Completion item requires- the examinee to provide a word or phrase that completes a sentence, as in the following example: The standard deviation is generally considered the most useful measure of __________. (Completion item also referred as short-answer item) *Short-answer item- It is desirable for completion or short-answer items to be written clearly enough that the testtaker can respond succinctly—that is, with a short answer. *Essay item- as a test item that requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation. An essay item is useful when the test developer wants the examinee to demonstrate a depth of knowledge about a single topic =Writing items for computer administration= *Item bank- is a relatively large and easily accessible collection of test questions. Instructors who regularly teach a particular course sometimes create their own item bank of questions that they have found to be useful on examinations * computerized adaptive testing (CAT)- refers to an interactive, computer administered test-taking process wherein items presented to the testtaker are based in part on the testtaker’s performance on previous items. CAT tends to reduce floor effects and ceiling effects. *Floor effect- refers to the diminished utility of an assessment tool for distinguishing testtakers at the low end of the ability, trait, or other attribute being measured. *Ceiling effect- refers to the diminished utility of an assessment tool for distinguishing testtakers at the high end of the ability, trait, or other attribute being measured. *Item branching- The ability of the computer to tailor the content and order of presentation of test items on the basis of responses to previous items =Scoring Items= 25 *Class scoring or (also referred to as category scoring)testtaker responses earn credit toward placement in a particular class or category with other testtakers whose pattern of responses is presumably similar in some way. This approach is used by some diagnostic systems wherein individuals must exhibit a certain number of symptoms to qualify for a specific diagnosis. *Ipsative scoring- is comparing a testtaker’s score on one scale within a test to another scale within that same test. Test Tryout •The test should be tried out on people who are similar in critical respects to the people for whom the test was designed. •Example, if a test is designed to aid in decisions regarding the selection of corporate employees with management potential at a certain level, it would be appropriate to try out the test on corporate employees at the targeted level •An informal rule of thumb is that there should be no fewer than 5 subjects and preferably as many as 10 for each item on the test. In general, the more subjects in the tryout the better. What Is a Good Item? * Pseudobulbar affect (PBA)- is a neurological disorder characterized by frequent and involuntary outbursts of laughing or crying that may or may not be appropriate to the situation =Item Analysis= 1. The Item-Difficulty Index- calculating the proportion of the total number of testtakers who answered the item correctly. •A lowercase italic “p” (p) is used to denote item difficulty, and a subscript refers to the item number (so p1 is read “item-difficulty index for item 1”). •Note that the larger the item-difficulty index, the easier the item. Because p refers to the percent of people passing an item, the higher the p for an item, the easier the item. •The statistic referred to as an item-difficulty index in the context of achievement testing may be an item-endorsement index in other contexts, such as personality testing 2. Item-reliability index- provides an indication of the internal consistency of a test •The higher this index, the greater the test’s internal consistency •This index is equal to the product of the item-score standard deviation (s) and the correlation (r) between the item score and the total test score. *Factor analysis and inter-item consistency A statistical tool useful in determining whether items on a test appear to be measuring the same thing(s) 3. item-validity index- is a statistic designed to provide an indication of the degree to which a test is measuring what it purports to measure. •The higher the item-validity index, the greater the test’s criterion-related validity. •The item-validity index can be calculated once the following two statistics are known: ■ the item-score standard deviation ■ the correlation between the item score and the criterion score The item-score standard deviation of item 1 (denoted by the symbol s1) can be calculated using the index of the item’s difficulty (p1) in the following formula: s1 = √p1(1 − p1) 4. item-discrimination index- is a measure of item discrimination, symbolized by a lowercase italic “d” (d). This estimate of item discrimination, in essence, compares performance on a particular item with performance in the upper and lower regions of a distribution of continuous test scores. •Item-discrimination index is a measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly •The higher the value of d, the greater the number of high scorers answering the item correctly. *Item-characteristic curve- is a graphic representation of item difficulty and discrimination. 26 =Other Considerations in Item Analysis= *Guessing- one that has eluded any universally acceptable solution items occurring toward the end of the test appear to be easier than they are. =Qualitative Item Analysis= 1. A correction for guessing must recognize that, when a respondent guesses at an answer on an achievement test, the guess is not typically made on a totally random basis. It is more reasonable to assume that the testtaker’s guess is based on some knowledge of the subject matter and the ability to rule out one or more of the distractor alternatives. *Qualitative methods- are techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures. Encouraging testtakers—on a group or individual basis—to discuss aspects of their test-taking experience is, in essence, eliciting or generating “data” (words). 2. A correction for guessing must also deal with the problem of omitted items. Sometimes, instead of guessing, the testtaker will simply omit a response to an item. *Qualitative item analysis- is a general term for various nonstatistical procedures designed to explore how individual test items work. The analysis compares individual test items to each other and to the test as a whole 3. Just as some people may be luckier than others in front of a Las Vegas slot machine, so some testtakers may be luckier than others in guessing the choices that are keyed correct. Any correction for guessing may seriously underestimate or overestimate the effects of guessing for lucky and unlucky testtakers *Item fairness- refers to the degree, if any, a test item is biased. *Biased test item- is an item that favors one particular group of examinees in relation to another when differences in group ability are controlled *Speed Test- The closer an item is to the end of the test, the more difficult it may appear to be. This is because testtakers simply may not get to items near the end of the test before time runs out. •How can items on a speed test be analyzed? Perhaps the most obvious solution is to restrict the item analysis of items on a speed test only to the items completed by the testtaker. However, this solution is not recommended, for at least three reasons: (1) Item analyses of the later items would be based on a progressively smaller number of testtakers, yielding progressively less reliable results; (2) if the more knowledgeable examinees reach the later items, then part of the analysis is based on all testtakers and part is based on a selected sample; and (3) because the more knowledgeable testtakers are more likely to score correctly, their performance will make *Qualitative methods- involve exploration of the issues through verbal means such as interviews and group discussions conducted with testtakers and other relevant parties *“think aloud” test administration- as a qualitative research tool designed to shed light on the testtaker’s thought processes during the administration of a test. On a one-toone basis with an examiner, examinees are asked to take a test, thinking aloud as they respond to each item. *Expert panels- may also provide qualitative analyses of test items. *Sensitivity review- is a study of test items, typically conducted during the test development process, in which items are examined for fairness to all prospective testtakers and for the presence of offensive language, stereotypes, or situations *Cross-validation- refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion. A key step in the development of all tests— brand-new or revised editions. *Validity shrinkage -The decrease in item validities that inevitably occurs after cross-validation of findings *Co-validation- may be defined as a test validation process conducted on two or more tests using the same sample of testtakers 27 *Co-norming- When used in conjunction with the creation of norms or the revision of existing norms model for scoring and a mechanism for resolving scoring discrepancies. * Another mechanism for ensuring consistency in scoring is the anchor protocol. An anchor protocol- is a test protocol scored by a highly authoritative scorer that is designed as a * scoring drift -A discrepancy between scoring in an anchor protocol and the scoring of another protocol =The Use of IRT in Building and Revising Tests= Three of the many possible applications of IRT in building and revising tests include (1) evaluating existing tests for the purpose of mapping test revisions, (2) determining measurement equivalence across testtaker populations, and (3) developing item banks *item-characteristic curves (ICCs)- provide information about the relationship between the performance of individual items and the presumed underlying ability (or trait) level in the testtaker •Using IRT, test developers evaluate individual item performance with reference to item-characteristic curves (ICCs) 28 *Differential item functioning (DIF)- a phenomenon, wherein an item functions differently in one group of testtakers as compared to another group of testtakers known to have the same (or similar) level of the underlying trait *DIF analysis- test developers scrutinize group-bygroup item response curves, looking for what are termed DIF items. •It has even been used to explore differential item functioning as a function of different patterns of guessing on the part of members of different groups *DIF items- are those items that respondents from different groups at the same level of the underlying trait have different probabilities of endorsing as a function of their group membership. •It has been used to evaluate measurement equivalence in item content across groups that vary by culture, gender, and age. =Developing item banks= •The final item bank will consist of a large set of items all measuring a single domain (or, a single trait or ability). A test developer may then use the banked items to create one or more tests with a fixed number of items. For example, a teacher may create two different versions of a math test in order to minimize efforts by testtakers to cheat. •When used within a CAT environment, a testtaker’s response to an item may automatically trigger which item is presented to the testtaker next. The software has been programmed to present the item next that will be most informative with regard to the testtaker’s standing on the construct being measured. This programming is actually based on near-instantaneous construction and analysis of IRT information curves. The process continues until the testing is terminated. 29 WRITING AND EVALUATING TEST ITEMS Item Writing 1. Define clearly what you want to measure. To do this, use substantive theory as a guide and try to make items as specific as possible. 2. Generate an item pool. Theoretically, all items are randomly chosen from a universe of item content. In practice, however, care in selecting and developing items is valuable. Avoid redundant items. In the initial phases, you may want to write three or four items for each one that will eventually be used on the test or scale. 3. Avoid exceptionally long items. Long items are often confusing or misleading. 4. Keep the level of reading difficulty appropriate for those who will complete the scale. 5. Avoid “double-barreled” items that convey two or more ideas at the same time. For example, consider an item that asks the respondent to agree or disagree with the statement, “I vote Democratic because I support social programs.” There are two different statements with which the person could agree: “I vote Democratic” and “I support social programs.” *Likert format- because it was used as part of Likert’s (1932) method of attitude scale construction. A scale using the Likert format consists of items such as “I am afraid of heights”. Instead of asking for a yes-no reply, five alternatives are offered: strongly disagree, disagree, neutral, agree, and strongly agree. In some applications, six options are used to avoid allowing the respondent to be neutral. The six responses might be strongly disagree, moderately disagree, mildly disagree, mildly agree, moderately agree, and strongly agree. 6. Consider mixing positively and negatively worded items. Sometimes, respondents develop the “acquiescence response set.” This means that the respondents will tend to agree with most items. To avoid this bias, you can include items that are worded in the opposite direction. For example, in asking about depression, the CES-D (see Chapter 2) uses mostly negatively worded items (such as “I felt depressed”). However, the CES-D also includes items worded in the opposite direction (“I felt hopeful about the future”). =Item Formats= *Dichotomous format- offers two alternatives for each item. Usually a point is given for the selection of one of the alternatives. The most common example of this format is the true-false examination *Polytomous format- (sometimes called polychotomous) resembles the dichotomous format except that each item has more than two alternatives. Multiple choice format. Distractors- Incorrect choices *Category format- a technique that is similar to the Likert format but that uses an even greater number of choices. On a scale from 1 to 10, with 1 as the lowest and 10 as the highest. Visual analogue scale- . Popular for measuring selfrated health. Using this method, the respondent is given a 100-millimeter line and asked to place a mark between two well-defined endpoints. 30 *Adjective checklist-One format common in personality measurement. The subject receives a long list of adjectives and indicates whether each one is characteristic of himself or herself. It can be used for describing either oneself or someone else •The adjective checklist requires subjects either to endorse such adjectives or not, thus allowing only two choices for each item. A similar technique known as the Q-sort increases the number of categories. •Discrimination index - difference between the proportions of people in each group who got the item correct •For each item, subtract the proportion of correct responses for the low group from the proportion of correct responses for the high group. This gives the item discrimination index (di). 2. Point Biserial Method- finding the correlation between performance on the item and performance on the total test. •The correlation between a dichotomous (two-category) variable and a continuous variable •Q-sort- can be used to describe oneself or to provide ratings of others. A subject is given statements and asked to sort them into nine piles. If a statement really hit home, you would place it in pile 9. Those that were not at all descriptive would be placed in pile 1 •The point biserial correlation (rpbis) between an item and the total test score is evaluated in much the same way as the extreme group discriminability index. •If this value is negative or low, then the item should be eliminated from the test =ITEM ANALYSIS= *Item analysis- a general term for a set of methods used to evaluate test items, is one of the most important aspects of test construction. *Item difficulty- the number of people who get a particular item correct (For example, if 84% of the people taking a particular test get item 24 correct, then the difficulty level for that item is .84.) *Item discriminability- determines whether the people who have done well on particular items have also done well on the whole test (examine the relationship between performance on particular items and performance on the whole test) =WAYS TO EVALUATE DISCRIMINABILITY= 1. The Extreme Group Method- compares people who have done well with those who have done poorly on a test. For example, you might find the students with test scores in the top third and those in the bottom third of the class. You would find the proportions of people in each group who got each item correct •The closer the value of the index is to 1.0, the better the item •If 90% of test takers get an item correct, then there is too little variability in performance for there to be a substantial correlation with the total test score. Similarly, if items are so hard that they are answered correctly by 10% or fewer of the test takers, then there is too little room to show a correlation between the items and the total test score *Drawing the Item Characteristic Curve- Using fewer class intervals allows the curve to take on a smoother appearance. =Pictures of Item Characteristics= *Item characteristic curve- valuable way to learn about items is to graph their characteristics •The total test score is plotted on the horizontal (X) axis and the proportion of examinees who get the item correct is plotted on the vertical (Y) axis •The total test score is used as an estimate of the amount of a “trait” possessed by individuals. 31 •The relationship between performance on the item and performance on the test gives some information about how well the item is tapping the information we want TOPIC 7. TEST ADMINISTRATION *Item Response Theory- each item on a test has its own item characteristic curve that describes the probability of getting each particular item right or wrong given the ability level of each test taker. •Both the behavior of the examiner and his or her relationship to the test taker can affect test scores •The computer can rapidly identify the specific items that are required to assess a particular ability level =The Race of the Tester= •It builds on traditional models of item analysis and can provide information on item functioning, the value of specific items, and the reliability of a scale =Items for Criterion-Referenced Tests= •To evaluate the items in the criterion-referenced test, one should give the test to two groups of students—one that has been exposed to the learning unit and one that has not =The Relationship Between Examiner and Test Taker= •Subtle cues given by the test administered can affect the level of performance expected by the examiner •Examiner’s race did not affect the results of IQ tests because the procedures for properly administering an IQ test are so specific. Anyone who gives the test should do so according to a strict procedure. In other words, well-trained African American and white test administrators should act almost identically =Stereotype Threat= For people who come from groups haunted by negative stereotypes, there may be a second level of threat Stereotype threat depletes working memory Another explanation for the effects of stereotype threat is “self-handicapping.” Test takers, when faced with the expectation that they may not perform well, might reduce their level of effort This threat could be avoided by simply moving the questions about age, race, and sex from the beginning of the test to the end. •The frequency polygon looks like a V. The scores on the left side of the V are probably those from students who have not experienced the unit. Scores on the right represent those who have been exposed to the unit =Subject Variables= A final variable that may be a serious source of error is the state of the subject. -Motivation and anxiety can greatly affect test scores •The bottom of the V is the antimode, or the least frequent score. This point divides those who have been exposed to the unit from those who have not been exposed and is usually taken as the cutting score or point, or what marks the point of decision •When people get scores higher than the antimode, we assume that they have met the objective of the test. When they get lower scores, we assume they have not =Computer-Assisted Test Administration= Here are some of the advantages that computers offer: •excellence of standardization, •individually tailored sequential administration, • precision of timing responses, •release of human testers for other duties, •patience (test taker not rushed), and •control of bias 32 =Language of Test Taker= TOPIC 8: INTERVIEWING TECHNIQUES The amount of linguistic demand can put non-English speakers at a -disadvantage. Even for tests that do not require verbal responses, it is important to consider the extent to which test instructions assume that the test taker understands English *Interview techniques- will reveal a plethora of sites offering advice on how to answer job interview questions. =Training of Test Administrators= Different assessment procedures require different levels of training *Expectancy Effects- It is often called Rosenthal effects •Standardized test administration procedures are necessary for valid results. Extensive research in social psychology has clearly demonstrated that situational factors can affect scores on mental and behavioral tasks. These effects, however, can be subtle and may not be observed in all studies. For example, stereotype threat can have significantly detrimental effects on test takers. Similarly, the examiner’s rapport and expectancies may influence scores on some but not all occasions. Direct reinforcement of specific responses does have an acknowledged impact and therefore should not be given in most testing situations. In response to these problems, several remedies have been suggested. These include standardization of the test instructions. The threat of the testing situation might be reduced by maneuvers as simple as asking demographic information at the end rather than at the beginning of the test. *Interview- It involves the interaction of two or more people. It is the only or most important source of data. The interview remains one of the most prevalent selection devices for employment. The chief method of collecting data in clinical psychiatry •All interviews involve mutual interaction whereby the participants are interdependent—that is, they influence each other •It is thematic; it does not jump from one unrelated topic to another as it might if the interviewer asked a series of set questions. *structured interview- the interviewer asks a specific set of questions. *structured standardized interview- these questions are printed *unstructured interview- there are no specific questions or guidelines for the interviewer to follow *directive- The personnel officer directed, guided, and controlled the course of the interview *nondirective- The clinical psychologist let the client determine the direction of the interview •Interest has increased in computer-assisted test administration because it may reduce examiner bias. Computers can administer and score most tests with great precision and with minimum bias. This mode of test administration has become the norm for many types of tests. *employment interview or selection interview- designed to elicit information pertaining to client’s qualifications and capabilities for particular employment duties •The state of the subject also affects test scores. For example, some students suffer from debilitating test anxiety, which seriously interferes with performance. •Method for gathering data •Used to make predictions •Evaluated in terms of reliability •Evaluated in terms of validity •Group or individual Structured or unstructured =Similarities Between an Interview and a Test= *social facilitation- we tend to act like the models around us (e.g when professional actors responded with anger to highly trained, experienced interviewers, the interviewers became angry themselves and showed anger toward the actors) 33 =Principles of Effective Interviewing= *The Proper Attitudes- Good interviewing is actually more a matter of attitude than skill. Attitudes related to good interviewing skills include warmth, genuineness, acceptance, understanding, openness, honesty, and fairness •To appear effective and establish rapport, the interviewer must display the proper attitudes. *Avoid- Judgmental or evaluative statements, Probing statements, Hostile responses and False reassurance *Effective Responses- keeping the interaction flowing. The interview is a two-way process; one person speaks first, then the other, and so on. The interviewer listens with interest by maintaining face-to-face contact. *interpersonal attraction- the degree to which people share a feeling of understanding, mutual respect, similarity, and the like *open-ended question- requires the interviewee to produce something spontaneously. used in structured interviews or for a particular purpose. (E.g a closed-ended question, which can be. Examples of open-ended questions include, “Tell me a little bit about yourself,” “Tell me about what interests you,” and “What is it that brings you here to see me?”) *Responses to Avoid- If the goal is to elicit as much information as possible or to receive a good rating from the interviewee, then interviewers should avoid certain responses, including judgmental or evaluative statements, probing statements, hostility, and false reassurance. *closed-ended question- requires the interviewee to recall something. It brings the interview to a dead halt, thus violating the principle of keeping the interaction flowing. (E.g “Do you like sports?,” “Are you married?,” and “How old are you?”) *Judgmental or evaluative statements- evaluating the thoughts, feelings, or actions of another *Responses to Keep the Interaction Flowing- using a transitional phrase such as ‘Yes,” “And,” or “I see.” These phrases imply that the interviewee should continue on the same topic *interpersonal influence- the degree to which one person can influence another and is related to *probing statements- these demand more information than the interviewee wishes to provide voluntarily. The most common way to phrase a probing statement is to ask a question that begins with “Why?” as it tends to place others on the defensive. •In probing we may induce the interviewee to reveal something that he or she is not yet ready to reveal. The interviewee will probably feel anxious and thus not well disposed to revealing additional information •BUT, probes are appropriate and necessary. With children or individuals with mental retardation to get beyond a superficial interchange, for instance, one often needs to ask questions to elicit meaningful information. Avoiding “Why?” statements and replacing them with “Tell me” or “How?” statements *verbatim playback- the interviewer simply repeats the interviewee’s last response. (E.g in her interview with the clinical psychologist, Maria stated, “I majored in history and social studies.” A verbatim playback, “You majored in history and social studies,”) *Paraphrasing and restatement responses- Repeats interviewee’s response using different words. Interchangeable with the interviewee’s response *Summarizing and clarification statements- Pulls together the meaning of several responses. It go just beyond the interviewee’s response. *Clarification response- clarifies the interviewee’s response *Empathy and understanding- communicates understanding •One good way to accomplish this involves what we call understanding statements. To establish a positive atmosphere, interviewers begin with an open-ended question followed by understanding statements that capture the meaning and feeling of the interviewee’s communication 34 =Measuring Understanding= =Sources of Error in the Interview= *Level-one responses- bear little or no relationship to the interviewee’s response. The two people are really talking only to themselves. *Interview Validity- extreme difficulty we have in making accurate, logical observations and judgments *Level-Two Responses- communicates a superficial awareness of the meaning of a statement. The individual who makes a level-two response never quite goes beyond his or her own limited perspective. Level-two responses impede the flow of communication. *Level-Three Responses- interchangeable with the interviewee’s statement. The minimum level of responding that can help the interviewee. Paraphrasing, verbatim playback, clarification statements, and restatements are all examples of level-three responses. *level-four response- the interviewer adds “noticeably” to the interviewee’s response. *level-five response- the interviewer adds “significantly” to it •Level-four and level- five responses not only provide accurate empathy but also go beyond the statement given. *active listening- is the foundation of good interviewing skills for many different types of interviews *mental status examination- An important tool in psychiatric and neurological examinations, it is used primarily to diagnose psychosis, brain damage, and other major mental health problems. Its purpose is to evaluate a person suspected of having neurological or emotional problems in terms of variables known to be related to these problems. =Developing Interviewing Skills= *first step- in doing so is to become familiar with research and theory on the interview in order to understand the principles and underlying variables in the interview. *second step- in learning such skills is supervised practice. Experience truly is the best teacher. No amount of book learning can compare with having one’s taped interview analyzed by an expert. *third step- one must make a conscious effort to apply the principles involved in good interviewing, such as guidelines for keeping the interaction flowing. •initial phase of learning any new skill seems to involve attending to a hundred things at once—an impossible task *halo effect- tendency to judge specific traits on the basis of a general impression. Interviewers form an impression of the interviewee within the first minute or so and spend the rest of the interview trying to confirm that impression *general standoutishness- people tend to judge on the basis of one outstanding characteristic. The tendency of interviewers to make unwarranted inferences from personal appearance *Interview Reliability- the critical questions about reliability have centered on inter-interviewer agreement- agreement between two or more interviewers •The research suggests that a highly structured interview in which specific questions are asked in a specific order can produce highly stable results •The internal consistency reliability for scores on highly structured interviews was .79 where the interviewer was gathering information about the interviewee’s experience, .90 where interviewees responded to hypothetical dilemmas they may experience on the job, and .86 where the interviewer was gathering information about the interviewees’ past behavior 35 TOPIC 9: THEORIES OF INTELLIGENCE AND ITS MEASUREMENTS (SB-5, WAIS-IV) *Intelligence- as a multifaceted capacity that manifests itself in different ways across the life span. In general, intelligence includes the abilities to: •acquire and apply knowledge •reason logically •plan effectively •infer perceptively •make sound judgments and solve problems •grasp and visualize concepts •pay attention •be intuitive •find the right words and thoughts with facility •cope with, adjust to, and make the most of new situations =Perspectives on Intelligence= *Interactionism- refers to the complex concept by which heredity and environment are presumed to interact and influence the development of one’s intelligence. *Louis L. Thurstone- conceived of intelligence as composed of what he termed primary mental abilities (PMAs). •Thurstone developed and published the Primary Mental Abilities test, which consisted of separate tests, each designed to measure one PMA: verbal meaning, perceptual speed, reasoning, number facility, rote memory, word fluency, and spatial relations. *factor-analytic theories- the focus is squarely on identifying the ability or groups of abilities deemed to constitute intelligence. *information-processing theories- the focus is on identifying the specific mental processes that constitute intelligence. Prior to reading about factor-analytic theories of intelligence, some extended discussion of factor analysis may be helpful *Factor-analytic theories of intelligence- theorists have used factor analysis to study correlations between tests measuring varied abilities presumed to reflect the underlying attribute of intelligence. *two-factor theory of intelligence- spearman formalized these observations into an influential theory of general intelligence that postulated the existence of a general intellectual ability factor (denoted by an italic lowercase g) that is partially tapped by all other mental abilities. •The g representing the portion of the variance that all intelligence tests have in common and the remaining portions of the variance being accounted for either by specific components (s), or by error components (e) of this general factor •Test that exhibited high positive correlations with other intelligence tests were thought to be highly saturated with g, whereas tests with low or moderate correlations with other intelligence tests were viewed as possible measures of specific factors (such as visual or motor ability). •The greater the magnitude of g in a test of intelligence, the better the test was thought to predict overall intelligence. *group factors- existence of an intermediate class of factors common to a group of activities but not to all. Neither as general g or nor as specific as s. Examples of these broad group factors include linguistic, mechanical, and arithmetical abilities. =Multiple-factor models of intelligence= *Guilford- have sought to explain mental activities by deemphasizing, if not eliminating, any reference to g *Thurstone- initially conceived of intelligence as being composed of seven “primary abilities.” *Gardner- developed a theory of multiple (seven, actually) intelligences: (1) logicalmathematical, (2) bodily-kinesthetic, (3) linguistic, (4) musical, (5) spatial, (6) interpersonal- the ability to understand other people: what motivates them, how they work, how to work cooperatively with them. Successful sales people, politicians, teachers, clinicians, and religious leaders are all likely to be individuals with high degrees of interpersonal intelligence (7) intrapersonal- seventh kind of intelligence, is a correlative ability, turned inward. It is a capacity to form an accurate, veridical model of oneself and to be able to use that model to operate effectively in life. 36 (Guilford’s perspective, there is no single underlying intelligence for the different test items to reflect. This means that there would be no basis for a large common factor.) *Cattell- the existence of two major types of cognitive abilities: crystallized intelligence and fluid intelligence *crystallized intelligence (symbolized Gc)- include acquired skills and knowledge that are dependent on exposure to a particular culture as well as on formal and informal education (vocabulary, for example). Retrieval of information and application of general knowledge are conceived of as elements of crystallized intelligence. *fluid intelligence (symbolized Gf )- are nonverbal, relatively culture-free, and independent of specific instruction (such as memory for digits). *Horn- proposed the addition of several factors: visual processing (Gv), auditory processing (Ga), quantitative processing (Gq), speed of processing (Gs), facility with reading and writing (Grw), short-term memory (Gsm), and long-term storage and retrieval (Glr) *vulnerable abilities- such as (Gv), in that they decline with age and tend not to return to preinjury levels following brain damage *maintained abilities- such as (Gq), they tend not to decline with age and may return to preinjury levels following brain damage. *three-stratum theory of cognitive abilities- Carroll’s model is g, or general intelligence. •The second stratum is composed of eight abilities and processes: fluid intelligence (Gf ), crystallized intelligence (Gc), general memory and learning (Y), broad visual perception (V), broad auditory perception (U), broad retrieval capacity (R), broad cognitive speediness (S), and processing/decision speed (T) *Hierarchical model- meaning that all of the abilities listed in a stratum are subsumed by or incorporated in the strata above. (three-stratum theory used this) *McGrewFlanagan CHC model- features ten “broadstratum” abilities and over seventy “narrow-stratum” abilities, with each broad-stratum ability subsuming two or more narrow-stratum abilities. •g was not employed in their CHC model because it lacked utility in psychoeducational evaluations. •The ten broad-stratum abilities, with their “code names” in parentheses, are labeled as follows: fluid intelligence (Gf), crystallized intelligence (Gc), quantitative knowledge (Gq), reading/writing ability (Grw), short-term memory (Gsm), visual processing (Gv), auditory processing (Ga), long-term storage and retrieval (Glr), processing speed (Gs), and decision/reaction time or speed (Gt). *cross-battery assessment- assessment that employs tests from different test batteries and entails interpretation of data from specified subtests to provide a comprehensive assessment. *Thorndike- intelligence can be conceived in terms of three clusters of ability: social intelligence (dealing with people), concrete intelligence (dealing with objects), and abstract intelligence (dealing with verbal and mathematical symbols). •Thorndike also incorporated a general mental ability factor (g) into the theory, defining it as the total number of modifiable neural connections or “bonds” available in the brain. (It look for one central factor reflecting g along with three additional factors representing social, concrete, and abstract intelligences; the testtakers’ responses to specific items reflected in part a general intelligence but also different types of intelligence: social, concrete, and abstract) *Information-processing view- by Russian neuropsychologist Aleksandr Luria. It focuses on the mechanisms by which information is processed—how information is processed, rather than what is processed. •Two basic types of information-processing styles, simultaneous and successive, have been distinguished 37 *simultaneous (or parallel) processing- information is integrated all at one time (may be described as “synthesized.” Information is integrated and synthesized at once and as a whole.) *successive (or sequential) processing- each bit of information is individually processed in sequence (logical and analytic in nature; piece by piece and one piece after the other, information is arranged and rearranged so that it makes sense.) *PASS model of intellectual functioning- is an acronym for planning, attention, simultaneous, and successive. In this model, *planning- refers to strategy development for problem solving; *attention (also referred to as arousal)- refers to receptivity to information; and *simultaneous and successive- refer to the type of information processing employed. =Measuring Intelligence= *Some Tasks Used to Measure Intelligence- intellectual assessment consists primarily of measuring sensorimotor development. (E.g measurement of nonverbal motor responses such as turning over, lifting the head, sitting up, •The reliability of the SB5 Full Scale IQ with the norming sample, an internal consistency reliability formula designed for the sum of multiple tests was employed. •The calculated coefficients for the SB5 Full Scale IQ were consistently high (.97 to .98) across age groups, as was the reliability for the Abbreviated Battery IQ (average of .91). following a moving object with the eyes, imitating gestures, and reaching for a group of objects) *Some Tests Used to Measure Intelligence- reference volumes such as Tests in Print, many different intelligence tests exist. *Stanford-Binet Intelligence Scales: Fifth Edition (SB5)- was designed for administration to assessees as young as 2 and as old as 85 (or older). The test yields a number of composite scores, including a Full Scale IQ derived from the administration of ten subtests. •Subtest scores all have a mean of 10 and a standard deviation of 3. •All composite scores have a mean set at 100 and a standard deviation of 15. (based on the Cattell-Horn-Carroll (CHC) theory of intellectual abilities.) *alternate item- an item to be substituted for a regular item under specified conditions (such as the situation in which the examiner failed to properly administer the regular item). *test composite- formerly described as a deviation IQ score, may be defined as a test score or index derived from the combination of, and/or a mathematical transformation of, one or more subtest scores. •Test-retest reliability coefficients reported in the manual were also high. The test-retest interval was only 5 to 8 days—shorter by some 20 to 25 days than the interval employed on other, comparable tests. •Inter-scorer reliability coefficients reported in the SB5 Technical Manual ranged from .74 to .97 with an overall median of .90. 38 •Items showing especially poor inter-scorer agreement had been deleted during the test development process. *routing test- may be defined as a task used to direct or route the examinee to a particular level of questions. A purpose of the routing test, then, is to direct an examinee to test items that have a high probability of being at an optimal level of difficulty. It contain *teaching items- which are designed to illustrate the task required and assure the examiner that the examinee understands *floor- refers to the lowest level of the items on a subtest *ceiling- highest-level item of the subtest *basal level- which is used to describe a subtest with reference to a specific testtaker’s performance *ceiling level- is said to have been reached and testing is discontinued if and when examinees fail a certain number of items in a row *adaptive testing- or testing individually tailored to the testtaker, might entail beginning a subtest with a question in the middle range of difficulty. If the testtaker responds correctly to the item, an item of greater difficulty is posed next. If the testtaker responds incorrectly to the item, an item of lesser difficulty is posed. Computerized adaptive testing is in essence designed “to mimic automatically what a wise examiner would do” (Other terms used to refer to adaptive testing include tailored testing, sequential testing, branched testing, and response-contingent testing) • Three other advantages of beginning an intelligence test or subtest at an optimal level of difficulty are that (1) it allows the test user to collect the maximum amount of information in the minimum amount of time, (2) it facilitates rapport, and (3) it minimizes the potential for examinee fatigue from being administered too many items. *cutoff boundaries with their corresponding nominal categories: *Wechsler tests- W-B measured something comparable to what other intelligence tests measured. Still, the test suffered from some problems: (1) The standardization sample was rather restricted; (2) some subtests lacked sufficient inter-item reliability; (3) some of the subtests were made up of items that were too easy; and (4) the scoring criteria for certain items were too ambiguous. *Wechsler Adult Intelligence Scale (WAIS)- WAIS was organized into Verbal and Performance scales. Scoring yielded a Verbal IQ, a Performance IQ, and a Full Scale IQ. *WAIS-IV- made up of subtests that are designated either as core or supplemental. *core subtest- is one that is administered to obtain a composite score. *supplemental subtest (sometimes referred to as an optional subtest)- is used for purposes such as providing additional clinical information or extending the number of abilities or processes sampled •supplemental subtest might be substituted for a core subtest if: ■ the examiner incorrectly administered a core subtest ■ the assessee had been inappropriately exposed to the subtest items prior to their administration ■ the assessee evidenced a physical limitation that affected the assessee’s ability to effectively respond to the items of a particular subtest •changes in the WAIS-IV as compared to the previous edition ■ enlargement of the images in the Picture Completion, Symbol Search, and Coding subtests ■ the recommended nonadministration of certain supplemental tests that tap short-term memory, hand-eye coordination, and/or motor speed for testtakers above the age of 69 (this to reduce testing time and to minimize testtaker frustration) ■ an average reduction in overall test administration time from 80 to 67 minutes (accomplished primarily by shortening the number of items the testtaker must fail before a subtest is discontinued) 39 =Group tests of intelligence or school ability test= •Group intelligence tests in the schools are used in special forms as early as the kindergarten level. The tests are administered to groups of 10 to 15 children, each of whom receives a test booklet that includes printed pictures and diagrams. *Short forms of intelligence tests- refers to a test that has been abbreviated in length, typically to reduce the time needed for test administration, scoring, and interpretation *Watkins - concluded that short forms may be used for screening purposes only, not to make placement or educational decisions *Silverstein- provided an incisive review of the history of short forms, focusing on four issues: (1) how to abbreviate the original test; (2) how to select subjects; (3) how to estimate scores on the original test; and (4) the criteria to apply when comparing the short form with the original *Ryan and Ward- advised that anytime a short form is used, the score should be reported on the official record with the abbreviation “Est” next to it, indicating that the reported value is only an estimate. *Wechsler Abbreviated Scale of Intelligence (WASI)- was designed to answer the need for a short instrument to screen intellectual ability in testtakers from 6 to 89 years of age. The test comes in a two-subtest form (consisting of Vocabulary and Block Design) that takes about 15 minutes to administer and a four-subtest form that takes about 30 minutes to administer. •The four subtests (Vocabulary, Block Design, Similarities, and Matrix Reasoning) are WISC- and WAIStype subtests that had high correlations with Full Scale IQ on those tests and are thought to tap a wide range of cognitive abilities. •yields measures of Verbal IQ, Performance IQ, and Full Scale IQ. Consistent with many other intelligence tests, the Full Scale IQ was set at 100 with a standard deviation of 15. *WASI-2- revision of WASI (California Test of Mental Maturity, the KuhlmannAnderson Intelligence Tests, the Henmon-Nelson Tests of Mental Ability, and the Cognitive Abilities Test are group intelligence tests available for use in school settings) *Otis-Lennon School Ability Test, formerly the Otis Mental Ability Test- the test is designed to measure abstract thinking and reasoning ability and to assist in school evaluation and placement decision-making. •This nationally standardized test yields Verbal and Nonverbal score indexes as well as an overall School Ability Index (SAI) *Army Alpha test- This test would be administered to Army recruits who could read. It contained tasks such as general information questions, analogies, and scrambled sentences to reassemble *Army Beta test- designed for administration to foreignborn recruits with poor knowledge of English or to illiterate recruits (defined as “someone who could not read a newspaper or write a letter home”). It contained tasks such as mazes, coding, and picture completion (wherein the examinee’s task was to draw in the missing element of the picture) *screening tool- as an instrument or procedure used to identify a particular trait or constellation of traits at a gross or imprecise level. *Officer Qualifying Test- a 115-item multiple-choice test used by the U.S. Navy as an admissions test to Officer Candidate School *Airman Qualifying Exam- a 200-item multiple-choice test given to all U.S. Air Force volunteers *Armed Services Vocational Aptitude Battery (ASVAB)The ASVAB is administered to prospective new recruits in all the armed services. It is also made available to highschool students and other you (multiple aptitude test) 40 *Armed Forces Qualification Test (AFQT)- a form of ASVAB, measure of general ability used in the selection of recruits. In addition to the AFQT score, ten aptitude areas are also tapped on the ASVAB, including general technical, general mechanics, electrical, motor-mechanics, science, combat operations, and skill-technical. (A set of 100 selected items included in the subtests of Arithmetic Reasoning, Numerical Operations, Word Knowledge, and Paragraph Comprehension) • group tests are useful screening tools when large numbers of examinees must be evaluated either simultaneously or within a limited time frame =Other measures of intellectual abilities= *cognitive style- is a psychological dimension that characterizes the consistency with which one acquires and processes information Four terms common to many measures of creativity are *Originality- refers to the ability to produce something that is innovative or nonobvious. It may be something abstract like an idea or something tangible and visible like artwork or a poem *Fluency- refers to the ease with which responses are reproduced and is usually measured by the total number of responses produced. For example, an item in a test of word fluency might be In the next thirty seconds, name as many words as you can that begin with the letter w. *Flexibility- refers to the variety of ideas presented and the ability to shift from one approach to another *Elaboration- refers to the richness of detail in a verbal explanation or pictorial display. *Guilford- drew a distinction between the intellectual processes of convergent and divergent thinking *Convergent thinking- is a deductive reasoning process that entails recall and consideration of facts as well as a series of logical judgments to narrow down solutions and eventually arrive at one solution. (thought process required in most achievement test) *Divergent thinking- is a reasoning process in which thought is free to move in many different directions, making several solutions possible. Divergent thinking requires flexibility of thought, originality, and imagination • described several tasks designed to measure creativity, such as Consequences (“Imagine what would happen if . . .”) and Unusual Uses (e.g., “Name as many uses as you can think of for a rubber band” *Structure-of-Intellect Abilities- are verbally oriented tasks (such as Word Fluency) and nonverbally oriented tasks (such as Sketches). *Remote Associates Test (RAT)- presents the testtaker with three words; the task is to find a fourth word associated with the other three (Mednick) *Torrance Tests of Creative Thinking- consist of wordbased, picture-based, and sound-based test materials. *psychoeducational batteries- test package to test not only intelligence but also related abilities in educational settings. =Issues in the Assessment of Intelligence= •Items on a test of intelligence tend to reflect the culture of the society where the test is employed. *culture-free intelligence test- was designed to separate “natural intelligence from instruction” by “disregarding, insofar as possible, the degree of instruction which the subject possesses” *Culture loading- may be defined as the extent to which a test incorporates the vocabulary, concepts, traditions, knowledge, and feelings associated with a particular culture (the culture loading of a test tends to involve more of a subjective, qualitative, nonnumerical judgment.) *culture-fair intelligence test- as a test or assessment process designed to minimize the influence of culture with regard to various aspects of the evaluation procedures, such as administration instructions, item content, responses required of testtakers, and interpretations made from the resulting data. (Culture-fair tests have been found to lack the hallmark of traditional tests of intelligence: predictive validity) *culture-specific intelligence test- traditional intelligence tests developed for members of a particular cultural group or subculture, such tests were thought to be able to yield a more valid measure of mental development 41 *Black Intelligence Test of Cultural Homogeneity- culturespecific intelligence test developed expressly for use with African-Americans, a 100-item multiple-choice test (the test was measuring a variable that could be characterized as streetwiseness, “street smarts” or “street efficacy”) (BITCH lacked predictive validity and provided little useful, practical information) *Flynn effect- is thus a shorthand reference to the progressive rise in intelligence test scores that is expected to occur on a normed test intelligence from the date when the test was first normed. (a less obvious sources of systematic bias in scores on intelligence tests.) * intelligence inflation- measured intelligence seems to rise on average, year by year, starting with the year for which the test is normed. The rise in measured IQ is not accompanied by any academic dividend and so is not thought to be due to any actual rise in “true intelligence.” INTELLIGENCE AND BINET SCALES *Alfred Binet- intelligence as “the tendency to take and maintain a definite direction; the capacity to make adaptations for the purpose of attaining a desired end, and the power of autocriticism” *Spearman- intelligence as the ability to educe either relations or correlates *Freeman- intelligence is “adjustment or adaptation of the individual to his total environment,” “the ability to learn,” and “the ability to carry on abstract thinking” *Das- defined intelligence as “the ability to plan and structure one’s behavior with an end in view” *H. Gardner (1983) defined intelligence in terms of the ability “to resolve genuine problems or difficulties as they are encountered” *Sternberg- defined intelligence in terms of “mental activities involved in purposive adaptation to, shaping of, and selection of real-world environments relevant to one’s life” *Anderson- intelligence is two-dimensional and based on individual differences in information-processing speed and executive functioning influenced largely by inhibitory processes *T. R. Taylor- identified three independent research traditions that have been employed to study the nature of human intelligence: the psychometric, the information processing, and the cognitive approaches *psychometric approach- examines the elemental structure of a test. We examine the properties of a test through an evaluation of its correlates and underlying dimensions (oldest approach) *information-processing approach- we examine the processes that underlie how we learn and solve problems *cognitive tradition- focuses on how humans adapt to real-world demands *Binet test- that examines one’s ability to define words and identify numerical sequences certainly does not meet the standards of all or even most definitions of intelligence. •properly used intelligence tests provide an objective standard of competence and potential (Greisinger, 2003). Critics charge that intelligence tests are biased (McDermott, Watkins, & Rhoad, 2014), not only against certain racial and economic groups (Jones, 2003) but also used by those in power to maintain the status quo (Gould, 1981) =Binet’s Principles of Test Construction= *Binet- defined intelligence as the capacity (1) to find and maintain a definite direction or purpose, (2) to make necessary adaptations—that is, strategy adjustments—to achieve that purpose, and (3) to engage in self-criticism so that necessary adjustments in strategy can be made. *Principle 1: Age Differentiation- refers to the simple fact that one can differentiate older children from younger children by the former’s greater capabilities. For example, whereas most 9-year-olds can tell that a quarter is worth more than a dime, a dime is worth more than a nickel, and so on, most 4-year-olds cannot *mental age- equivalent age capability •one could determine the equivalent age capabilities of a child independent of his or her chronological age. If a 6year-old completed tasks that were appropriate for the average 9-year-old, then the 6-year-old had demonstrated that he or she had capabilities equivalent to those of the average 9-year-old, or a mental age of 9 •A particular 5-year-old child might be able to complete tasks that the average 8-year-old could complete. On the other hand, another 5-year-old might not be capable of completing even those tasks that the average 3-year-old could complete. 42 *Principle 2: General Mental Ability- measure only the total product of the various separate and distinct elements of intelligence •Identifying each element or independent aspect of intelligence. He also was freed from finding the relation of each element to the whole •He could judge the value of any particular task in terms of its correlation with the combined result (total score) of all other tasks. •Tasks with low correlations could be eliminated, and tasks with high correlations retained. =Spearman’s Model of General Mental Ability= *Spearman’s theory- intelligence consists of one general factor (g) plus a large number of specific factors •half of the variance in a set of diverse mental ability tests is represented in the g factor =Implications of General Mental Intelligence (g)= •implies that a person’s intelligence can best be represented by a single score, g, that presumably reflects the shared variance underlying performance on a diverse set of tests. •True performance on any given individual task can be attributed to g as well as to some specific or unique variance •However, if the set of tasks is large and broad enough, the role of any given task can be reduced to a minimum *The gf-gc Theory of Intelligence- human intelligence can best be conceptualized in terms of multiple intelligences rather than a single score there are two basic types of intelligence: *Fluid intelligence (f) - can best be thought of as those abilities that allow us to reason, think, and acquire new knowledge *Spearman’s model of intelligence- According to the model, intelligence can be viewed in terms of one general underlying factor (g) and a large number of specific factors (S1, S2, …, Sn). Thus, intelligence can be viewed in terms of g (general mental ability) and S (specific factors). Spearman’s theory was consistent with Binet’s approach to constructing the first intelligence test •The general mental ability, which he referred to as psychometric g (or simply g) *positive manifold- when a set of diverse ability tests are administered to large unbiased samples of the population, almost all of the correlations are positive. All tests, no matter how diverse, are influenced by g. *factor analysis- a statistical technique, is a method for reducing a set of variables or scores to a smaller number of hypothetical variables called factors •one can determine how much variance a set of tests or scores has in common •The g in a factor analysis of any set of mental ability tasks can be represented in the first unrotated factor in a principal components analysis *Crystallized intelligence (c) - by contrast, represents the knowledge and understanding that we have acquired •the abilities that allow us to learn and acquire information (fluid) and the actual learning that has occurred (crystallized) =The Early Binet Scales= *1905 Binet-Simon scale- the first major measure of human intelligence. It was an individual intelligence test consisting of 30 items presented in an increasing order of difficulty. •The collection of 30 tasks of increasing difficulty in the Binet-Simon scale provided the first major measure of human intelligence 43 three levels of intellectual deficiency =Terman’s Stanford-Binet Intelligence Scale= *Idiot- described the most severe form of intellectual impairment *The 1916 Stanford-Binet Intelligence Scale- Terman relied heavily on Binet’s earlier work. The principles of age differentiation, general mental ability, and the age scale were retained. The mental age concept also was retained *imbecile- moderate levels of impairment *moron- the mildest level of impairment. •Binet believed that the ability to follow simple directions and imitate simple gestures (item 6 on the 1905 scale) was the upper limit of adult idiots. •The ability to identify parts of the body or simple objects (item 8) would rule out the most severe intellectual impairment in an adult •The upper limit for adult imbeciles was item 16, which required the subject to state the differences between two common objects such as wood and glass. (The 1905 Binet-Simon scale lacked an adequate measuring unit to express results; it also lacked adequate normative data and evidence to support its validity) *The 1908 Scale- Binet and Simon retained the principle of age differentiation *intelligence quotient (IQ)- used a subject’s mental age in conjunction with his or her chronological age to obtain a ratio score. This ratio score presumably reflected the subject’s rate of mental development. •The first step- is to determine the subject’s actual or chronological age. To obtain this, we need only know his or her birthday. •The second step- the subject’s mental age is determined by his or her score on the scale. Finally, to obtain the IQ, the chronological age (CA) is divided into the mental age (MA) and the result multiplied by 100 to eliminate fractions: IQ 5 MA/CA 3 100. •Introduced two major concepts: the age scale format and the concept of mental age. *age scale- which means items were grouped according to age level rather than simply one set of items of increasing difficulty, as in the 1905 scale •The age scale provided a model for innumerable tests still used in educational and clinical settings •Binet attempted to solve the problem of expressing the results in adequate units. A subject’s mental age was based on his or her performance compared with the average performance of individuals in a specific chronological age group. (The scale produced only one score, almost exclusively related to verbal, language, and reading ability) •When MA is less than CA, the IQ is below 100, the subject was said to have slower-than-average mental development •When MA exceeded CA, the subject was said to have faster-than-average mental development. 44 (The 1916 scale had a maximum possible mental age of 19.5 years; that is, if every group of items was passed, this score would result. Given this limitation, anyone older than 19.5 would have an IQ of less than 100 even if all items were passed. Because back in 1916 people believed that mental age ceased to improve after 16 years of age, 16 was used as the maximum chronological age.) *The 1937 Scale- extended the age range down to the 2year-old level. Adding new tasks, developers increased the maximum possible mental age to 22 years, 10 months. •instructions for scoring and test administration were improved, and IQ tables were extended from age 16 to 18. Perhaps most important, the problem of differential variation in IQs was solved by the deviation IQ concept. *deviation IQ- was simply a standard score with a mean of 100 and a standard deviation of 16 (today the standard deviation is set at 15) =The Modern Binet Scale= *Model for the Fourth and Fifth Editions of the Binet Scale-These versions incorporate the gf-gc theory of intelligence. They are based on a hierarchical model. •At the top of the hierarchy is g (general intelligence), which reflects the common variability of all tasks. At the next level are three group factors. *Crystallized abilities reflect learning—the realization of original potential through experience. *crystallized ability has two subcategories: verbal reasoning and nonverbal reasoning *Fluid-analytic abilities- represent original potential, or the basic capabilities that a person uses to acquire crystallized abilities •several performance items, which required the subject to do things such as copy designs, were added to decrease the scale’s emphasis on verbal skills. *Short-term memory- refers to one’s memory during short -intervals—the amount of information one can retain briefly after a single, short presentation •the most important improvement in the 1937 version was the inclusion of an alternate equivalent form. Forms L and M were designed to be equivalent in terms of both difficulty and content. *Thurstone’s Multidimensional Model- intelligence could best be conceptualized as comprising independent factors, or “primary mental abilities.” =Problems With the 1937 Scale= •reliability coefficients were higher for older subjects than for younger ones. Thus, results for the latter were not as stable as those for the former. *The 1960 Stanford-Binet Revision and Deviation IQ (SBLM)- tried to create a single instrument by selecting the best from the two forms of the 1937 scale •instead of viewing all specific abilities as being powered by a g factor, some groups of abilities were seen as independent. =Characteristics of the 1986 Revision= *Modern 2003 fifth edition- provided a standardized hierarchical model with five factors *The age range touted by the fifth edition spans from 2 to 851 years of age. 45 (The purpose of the routing tests is to estimate the examinee’s level of ability in order to guide the examination process by estimating the level of ability at which to begin testing for any given subject) *start point- estimated level of ability *basal- The level at which a minimum criterion number of correct responses is obtained *ceiling- which is a certain number of incorrect responses that indicate the items are too difficult. *Scaled scores- have a mean of 10 and a standard deviation of 3. *standard score- with a mean of 100 and a standard deviation of 15 is computed for nonverbal IQ, verbal IQ, full-scale IQ, and each of the five factors: fluid reasoning, knowledge, quantitative reasoning, visualspatial processing, and working memory (Nonverbal and verbal IQ scores are based on summing the five nonverbal and five verbal subtests. The full-scale IQ is based on all 10. The standard scores for each of the five factors are based on summing the nonverbal and corresponding verbal subtest for each respective factor) The Wechsler Intelligence Scales (1) Wechsler’s use of the point scale concept rather than an age scale used in the early Binet Tests and (2) Wechsler’s inclusion of a nonverbal performance scale. •each of the specific 15 tests were grouped into one of four content areas or factors •the nonverbal and verbal scales are equally weighted. The test examination process begins with one of two “routing measures” (subtests): one nonverbal, one verbal *Routing tests- are organized in a point scale, which means that each contains items of similar content and of increasing difficulty. For example, the verbal routing test consists of a set of vocabulary items of increasing difficulty. • the examiner then goes to an age scale-based subtest at the appropriate level for the examinee. In that way, items that are too easy are skipped to save time and provide for a more efficient examination. *point scale- credits or points are assigned to each item. An individual receives a specific amount of credit for each item passed. •allowed Wechsler to devise a test that permitted an analysis of the individual’s ability in a variety of content areas (e.g., judgment, vocabulary, and range of general knowledge) =The Performance Scale Concept= •Wechsler included an entire scale that provided a measure of nonverbal intelligence: a performance scale *performance scale- consisted of tasks that require a subject to do something (e.g., copy symbols or point to a missing detail) rather than merely answer questions (measure of nonverbal intelligence) 46 *verbal scale- provided a measure of verbal intelligence *Similarities Subtest- consists of paired items of increasing difficulty. The subject must identify the similarity between the items in each pair *Arithmetic Subtest- contains approximately 15 relatively simple problems in increasing order of difficulty. The ninth most difficult item is as easy as this: “A person with $28.00 spends $.50. How much does he have left?” *Digit Span Subtest- requires the subject to repeat digits, given at the rate of one per second, forward and backward = Scales, Subtests, and Indexes= * Wechsler- defined intelligence as the capacity to act purposefully and to adapt to the environment. In his words, intelligence is “the aggregate or global capacity of the individual to act purposefully, to think rationally and to deal effectively with his environment” •Wechsler believed that intelligence comprised specific elements that one could individually define and measure; however, these elements were interrelated—that is, not entirely independent •implies that intelligence comprises several specific interrelated functions or elements and that general intelligence results from the interplay of these elements *index- is created where two or more subtests are related to a basic underlying skill •On the WAIS-IV, the subtests are sorted into four indexes: (1) verbal comprehension, (2) perceptual reasoning, (3) working memory, and (4) processing speed. *Information Subtest- items appear in order of increasing difficulty. Item 6 asks something like, “Name two people who have been generals in the U.S. Army” or “How many members are there in the U.S. Congress?” Like all Wechsler subtests, the information subtest involves both intellective and nonintellective components, including the abilities to comprehend instructions, follow directions, and provide a response. *Comprehension Subtest- measures judgment in everyday practical situations, or common sense. It has three types of questions. •The first asks the subject what should be done in a given situation, as in, “What should you do if you find an injured person lying in the street?” •The second type of question asks the subject to provide a logical explanation for some rule or phenomenon, as in, “Why do we bury the dead?” The third type asks the subject to define proverbs such as, “A journey of 1000 miles begins with the first step.” *Letter–Number Sequencing Subtest- used as a supplement for additional information about the person’s intellectual functioning. It is made up of items in which the individual is asked to reorder lists of numbers and letters. For example, Z, 3, B, 1, 2, A, would be reordered as 1, 2, 3, A, B, Z. This subtest is related to working memory and attention *Digit Symbol–Coding Subtest- requires the subject to copy symbols. It measures such factors as ability to learn an unfamiliar task, visual-motor dexterity, degree of persistence, and speed of performance *Vocabulary Subtest- ability to define words is not only one of the best single measures of intelligence but also the most stable. Vocabulary tests appear on nearly every individual test that involves verbal intelligence. *Block Design Subtest- provides an excellent measure of nonverbal concept formation, abstract thinking, and neurocognitive impairment. The subject must arrange the blocks to reproduce increasingly difficult designs. •This subtest requires the subject to reason, analyze spatial relationships, and integrate visual and motor 47 functions. The input information (i.e., pictures of designs) is visual, but the response (output) is motor. *Matrix Reasoning Subtest- a good measure of information-- processing and abstract-reasoning skills. A core subtest in the perceptual reasoning index scale in an effort to enhance the -assessment of fluid intelligence, which involves our ability to reason. *WAIS-IV- follows a hierarchical model with general intelligence (FSIQ) at the top. The index scores form the next level, with the subtests providing the base •In the matrix reasoning subtest, the subject is presented with nonverbal, figural stimuli. The task is to identify a pattern or relationship between the stimuli *Symbol Search Subtest- It was added in recognition of the role of speed of information processing in intelligence. •the subject is shown two target geometric figures. The task is then to search from among a set of five additional search figures and determine whether the target appears in the search group. •the faster a subject performs this task, the faster his or her information-processing speed will be. =From Raw Scores to Scaled and Index Scale Scores= •Each subtest produces a raw score—that is, a total number of points—and has a different maximum total •to compare scores on individual subtests, raw scores can be converted to standard or scaled scores with a mean of 10 and a standard deviation of 3. •Each of the four index scores was normalized to have a mean of 100 and a standard deviation of 15 •The four composite index scales are then derived by summing the core subtest scores *inferential norming- used in deriving a subtest scaled score for the WAIS-IV. A variety of statistical indexes or “moments,” such as means and standard deviations, were calculated for each of the 13 age groups of the stratified normative sample *Working memory- refers to the information that we actively hold in our minds, in contrast to our stored knowledge, or long-term memory (one of the most important innovations on the modern WAIS) *FSIQ- represents a measure of general intelligence. follows the same principles of index scores. •It is obtained by summing the age-corrected scaled scores of all four index composites. Again, a deviation IQ with a mean of 100 and a standard deviation of 15 is obtained. =Psychometric Properties of the Wechsler Adult Scale= *Standardization- sample consisted of a stratified sample of 2200 adults divided into 13 age groups from 16:00 through 90:11 as well as 13 specialty groups *Reliability- When the split-half method is used for all subtests except speeded tests (digit symbol–coding and symbol search), the typical average coefficients across age levels are .98 for the FSIQ, .96 for the verbal comprehension index VIQ, .95 for the perceptual reasoning index, .94 for the working memory index, and .90 for the processing speed index *Validity- The validity of the WAIS-IV rests heavily on its correlation with earlier versions of the test. However, the Wechsler tests are considered among the most valid in the world today for measuring IQ. *Wechsler adult scale- is extensively used as a measure of adult intelligence. This scale is well constructed and its primary measures—the four index components and fullscale IQ (FSIQ)—are highly reliable. 48 =Downward Extensions of the WAIS-IV: The WISC-V and the WPPSI-IV= *WISC-V- measures intelligence from ages 6 through 16 years, 11 months. 21st-century test. •can be administered and scored by two coordinated iPads, one for the examiner and one for the subject being tested. According to the test authors, administration is faster and more efficient. The scores can then be forwarded for interpretation and even report generation to a Web-based platform called “Q-global scoring and reporting.” •The test is heavily based on speed of a response based on the findings that faster responding is associated with higher ability for most tasks. •At the top of the hierarchy is FSIQ. Next are the five indexes or primary scores. *The five indexes- are called (1) verbal comprehension, (2)visual spatial, (3) fluid reasoning(ability to abstract) , (4) working memory, and (5) processing speed. •Each index is associated with at least two subtest scores. To enhance assessment, there are five ancillary scales, each based on two or more subtests. These are called quantitative reasoning, auditory working memory, nonverbal, general ability, and cognitive processing. •Finally, there are three “complementary” scales, called naming speed, symbol translation, and storage and retrieval. (provides insight for academic performance, various groups were targeted and carefully studied. These included various specific learning disabilities, attentiondeficit/hyperactivity disorder (ADHD), traumatic brain injury, and autism spectrum disorders) *WPPSI- a scale for children 4 to 6 years of age. It is based on the familiar hierarchical model. General mental ability, or g, is at the top and reflected in the full-scale IQ. •Then, there are three group factors, represented by index or primary scores: (1) verbal comprehension, (2) visual spatial, and (3) working memory. •Finally, each of the indexes is composed of two or more subtest scores. *WPPSI-IV- Measures intelligence in children from 2.5 to 7 years, 7 months. It is more flexible than its predecessors and gives the test user the option of using more or less subtests depending on how complete an evaluation is needed and how young the child is (compatible with measures of adaptive functioning and achievement)

Psychological Testing & Assessment Overview

Related documents

Products

Support

Psychological Testing & Assessment Overview

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib