EPPP Testing
Terms
undefined, object
copy deck
- Def: reliability
-
Repeatable and consistent
Free from error
Reflects 'true score' - Def: validity
- Measures what it says it does
- Def: power test
-
Assesses the attainable level of difficulty
No time limit
Graduated difficulty
Qs that everyone can do
Qs that no one can do
Eg: WAIS information subtest - Def: ipsative measures
-
Scores reported in terms of relative strength within the individual
Preference is expressed for one item over another - Def: mastery test
- Cutoff for predetermined level of performance
- Def: normative measures
-
Absolute strength measured
All items answered
Comparison among people possible - Range and interpretation of a reliability coefficient
-
0 (unreliable)
to
1 (perfectly reliable)
.9 means 90% of the variance accounted for
You do NOT square a reliability coefficient - Factors affecting reliability coefficient
-
Anything reducing the range of obtained scores (eg a homogeneous population)
Anything increasing measurement error
Short (vs long) tests
Presence of floor or ceiling effects
High probability of guessing a correct answer - Factors affecting test-retest reliability
-
Maturation
Difference in conditions
Practice effects - Measures of internal consistency
-
Split-half: divide test in 2 and correlate scores on the subtests; sensitive to selection strategy
Coefficient alpha: used with multiple choice questions
Kuder-Richardson Formula 20 (KR-20) used for questions with dichotomous answers
Reliability increases with item homogeniety - Utility of internal consistency measures
-
Measurement of unstable traits
Not good for speed tests
Sensitive to item content / sampling - Appropriate measure of speed test reliability
-
Test-retest
Alternate forms - Measure of inter-rater reliability
- Kappa coefficient
- Factors improving inter-rater reliability
-
Well trained raters
Explicit observation of the raters
Mutually exclusive and exhaustive scoring categories - Def: interval recording
- All behavior within a specified period of time
- Def: standard error of measurement
- How much error is expected from an individual test score
- Formula: standard error of measurement *
-
SE = SD * square root of (1-r)
where r = the reliability coefficient which ranges from 0 to 1 - Use: standard error of measurement
- Construction of a confidence interval
- Probability of scores falling within a specified confidence interval
-
68% +/- 1 SE
95% +/- 1.96 SE
99% +/- 2.58 SE - Use: eta *
- Correlation of continuous non-linear variables
- Def: types of criterion related validity
-
Concurrent
Scores collected at the same time
Useful for diagnostic tests
Predictive validity
Scores tested before and later
Useful for eg job selection tests - Factors affecting criterion related validity
-
Restricted range of scores
Unreliability of predictor or criterion
Regression
Criterion contamination - Def: criterion contamination
- Occurs when person assessing criterion knows predictor for an individual
- Def: convergent/divergent analysis
-
Convergent validity is high correlation between different measures of same construct
Divergent validty is low correlation between measures measuring different constructs - Relationship between reliability and validity
-
The criterion-related validity coefficient cannot exceed the square root of the predictor's reliability coefficient
Reliability coefficient sets a ceiling on the validity coefficient - Def: face validity
- Appearance of validity to test takers, administrators and other untrained people
- Def: criterion related validity coefficient
-
Pearson r correlation between predictor and criterion
acceptable range is +/- .3 to .6 -
Differences between
standard error of measurement
and
standard error of estimate -
Standard error of measurement
related to reliability coefficient
used to estimate true score on a given test
Standard error of estimate
Determines where a criterion will fall given a predictor - Def: shrinkage
-
Reduction in validity coefficient on cross-validation (revalidation with a second sample)
A result of noise in original sample - Factors affecting shrinkage
-
Small original validation sample
Large original item pool
Relative number of items retained is small
Items not sensibly chosen - Def: construct validity
- Extent to which a test successfully measures an unobservable, abstract concept such as IQ
- Techniques for assessing construct validity
-
Convergent validity techniques
High correlation on a trait even with different methods
Divergent / discriminant validity techniques
Low correlation on different traits even with the same method
Factor analysis - Def: factor loading
-
Correlation between a given test and a factor derived from a factor analysis
Can be squared to give % of variance that the test accounts for in the factor - Def: communality (factor analysis)
-
The proportion of variance of a test accounted for by the factors
Sum of the squared factor loadings
Interpreted directly, ie .4 = 40%
Only valid when factors are orthogonal - Def: unique variance (factor analysis)
-
Variance not accounted for by the factors
u2 = 1 - h2, where h2 is the communality - Def: eigenvalue
-
explained variance
= Sum of the squares of the loadings
sum of the eigenvalues <= number of tests
Applied to unrotated factors only - Formula to convert eigenvalue to %
- = eigenvalue * 100 / number of tests
- Types of rotation (factor analysis) *
-
Orthogonal - uncorrelated
Oblique - correlated
Choice depends on what you believe the relationship is among the factors - Differences between principle components analysis and factor analysis
-
In principle components analysis:
Factors are always uncorrelated
Variance = explained + error
In factor analysis:
variance = common + specific + error - Use: cluster analysis
- Categorize or taxonimize a set of objects
- Differences between cluster analysis and factor analysis
-
Cluster analysis
all types of data
clusters interpreted as categories
Factor analysis
interval or ratio data only
factors interpreted as underlying constructs - Def: correction for attenuation
- Estimate of how much more valid a predictor would be if it and the criterion were perfectly reliable
- Def: content validity
- Adquate sampling of relevant content domain
- To reduce the number of false positives...
-
Raise the predictor cutoff
and / or
Lower the criterion cutoff - Def: false negative
- Predicted not to meet a criterion but in reality does
- Def: item difficulty or difficulty index *
-
% of examinees answering correctly
an ordinal value, because an item with an index of .2 is not necessarily half the difficulty of an item with an index of .4 - Def: item discriminability
-
Degree to which an item differentiates between low and high scorers
D = difference between high and low % correctly answered
range from 100 to -100
moderate difficulty optimal - Target values for item difficulty by objective
-
.5 for most tests
.25 for high cutoff (matching selection %)
.8 or .9 for mastery
half way between chance and 1, eg t/f exams would be .75 - Relationship between item difficulty and discriminability
-
Difficulty creates a ceiling for discriminability
Difficulty of .5 creates maximum discriminability
The greater the mean discriminability the greater the reliability - What can you determine from an item response (aka item characteristic) curve?
-
Difficulty
point where p(correct response) = .5
Discriminability
slope of the curve; lower more discriminable
Probability of a correct guess
intersection with y axis - Def: computer adaptive assessment
- Computerized selection of test items based on periodic estimates of ability
- What are the advantages of a test item of moderate difficulty (p = .5)
-
Increases variability which increases reliability and validity
Maximally differentiates between low and high scorers - Techniques for assessing an item's discriminability
-
Correlation with
total score
an external criterion - What are the mean and std deviation for the following standard scores: z, t, stanine and deviation IQ?
-
mean SD
z 0 1
t 50 10
stanine 9 ~2
deviation IQ 100 15 - The difference between norm-referenced and criterion referenced scores
-
Norm referenced is a comparison to others in a sample
Criterion referenced measure against an external criterion - Characteristics of alternate forms reliability coefficient
-
Best, because to be high must be consistent across time and content
Likely to have a lower magnitude than other coefficients - Def: moderator variable
-
Variables affecting validity of a test
A moderator variable confers differential validity on the test - Def: 'testing the limits' in dynamic assessment
- Following a standardized test, using hints to elicit correct performance. The more hints necessary, the more severe the learning disability
- Contents of the Mental Measurements Yearbook
-
Author
Publisher
Target population
Administrative time
Critical reviews - Effect on the floor of adding easy questions to a test *
- Will raise the floor
- Def: dynamic assessment
- Variety of procedures following on standardized testing to get further information, usually used with learning disablity or retardation
- test theory
- ttest theory