Glossary of EPPP Testing

Start Studying! Add Cards ↓

Def: reliability
Repeatable and consistent

Free from error
Reflects 'true score'
Def: validity
Measures what it says it does
Def: power test
Assesses the attainable level of difficulty

No time limit
Graduated difficulty
Qs that everyone can do
Qs that no one can do

Eg: WAIS information subtest
Def: ipsative measures
Scores reported in terms of relative strength within the individual

Preference is expressed for one item over another
Def: mastery test
Cutoff for predetermined level of performance
Def: normative measures
Absolute strength measured
All items answered
Comparison among people possible
Range and interpretation of a reliability coefficient
0 (unreliable)
1 (perfectly reliable)

.9 means 90% of the variance accounted for

You do NOT square a reliability coefficient
Factors affecting reliability coefficient
Anything reducing the range of obtained scores (eg a homogeneous population)

Anything increasing measurement error

Short (vs long) tests
Presence of floor or ceiling effects
High probability of guessing a correct answer
Factors affecting test-retest reliability
Difference in conditions
Practice effects
Measures of internal consistency
Split-half: divide test in 2 and correlate scores on the subtests; sensitive to selection strategy

Coefficient alpha: used with multiple choice questions

Kuder-Richardson Formula 20 (KR-20) used for questions with dichotomous answers

Reliability increases with item homogeniety
Utility of internal consistency measures
Measurement of unstable traits

Not good for speed tests
Sensitive to item content / sampling
Appropriate measure of speed test reliability
Alternate forms
Measure of inter-rater reliability
Kappa coefficient
Factors improving inter-rater reliability
Well trained raters
Explicit observation of the raters
Mutually exclusive and exhaustive scoring categories
Def: interval recording
All behavior within a specified period of time
Def: standard error of measurement
How much error is expected from an individual test score
Formula: standard error of measurement *
SE = SD * square root of (1-r)

where r = the reliability coefficient which ranges from 0 to 1
Use: standard error of measurement
Construction of a confidence interval
Probability of scores falling within a specified confidence interval
68% +/- 1 SE
95% +/- 1.96 SE
99% +/- 2.58 SE
Use: eta *
Correlation of continuous non-linear variables
Def: types of criterion related validity
Scores collected at the same time
Useful for diagnostic tests

Predictive validity
Scores tested before and later
Useful for eg job selection tests
Factors affecting criterion related validity
Restricted range of scores
Unreliability of predictor or criterion
Criterion contamination
Def: criterion contamination
Occurs when person assessing criterion knows predictor for an individual
Def: convergent/divergent analysis
Convergent validity is high correlation between different measures of same construct

Divergent validty is low correlation between measures measuring different constructs
Relationship between reliability and validity
The criterion-related validity coefficient cannot exceed the square root of the predictor's reliability coefficient

Reliability coefficient sets a ceiling on the validity coefficient
Def: face validity
Appearance of validity to test takers, administrators and other untrained people
Def: criterion related validity coefficient
Pearson r correlation between predictor and criterion

acceptable range is +/- .3 to .6
Differences between
standard error of measurement
standard error of estimate
Standard error of measurement
related to reliability coefficient
used to estimate true score on a given test

Standard error of estimate
Determines where a criterion will fall given a predictor
Def: shrinkage
Reduction in validity coefficient on cross-validation (revalidation with a second sample)

A result of noise in original sample
Factors affecting shrinkage
Small original validation sample
Large original item pool
Relative number of items retained is small
Items not sensibly chosen
Def: construct validity
Extent to which a test successfully measures an unobservable, abstract concept such as IQ
Techniques for assessing construct validity
Convergent validity techniques
High correlation on a trait even with different methods

Divergent / discriminant validity techniques
Low correlation on different traits even with the same method

Factor analysis
Def: factor loading
Correlation between a given test and a factor derived from a factor analysis

Can be squared to give % of variance that the test accounts for in the factor
Def: communality (factor analysis)
The proportion of variance of a test accounted for by the factors

Sum of the squared factor loadings
Interpreted directly, ie .4 = 40%

Only valid when factors are orthogonal
Def: unique variance (factor analysis)
Variance not accounted for by the factors

u2 = 1 - h2, where h2 is the communality
Def: eigenvalue
explained variance
= Sum of the squares of the loadings

sum of the eigenvalues <= number of tests

Applied to unrotated factors only
Formula to convert eigenvalue to %
= eigenvalue * 100 / number of tests
Types of rotation (factor analysis) *
Orthogonal - uncorrelated
Oblique - correlated

Choice depends on what you believe the relationship is among the factors
Differences between principle components analysis and factor analysis
In principle components analysis:

Factors are always uncorrelated
Variance = explained + error

In factor analysis:
variance = common + specific + error
Use: cluster analysis
Categorize or taxonimize a set of objects
Differences between cluster analysis and factor analysis
Cluster analysis
all types of data
clusters interpreted as categories

Factor analysis
interval or ratio data only
factors interpreted as underlying constructs
Def: correction for attenuation
Estimate of how much more valid a predictor would be if it and the criterion were perfectly reliable
Def: content validity
Adquate sampling of relevant content domain
To reduce the number of false positives...
Raise the predictor cutoff
and / or
Lower the criterion cutoff
Def: false negative
Predicted not to meet a criterion but in reality does
Def: item difficulty or difficulty index *
% of examinees answering correctly

an ordinal value, because an item with an index of .2 is not necessarily half the difficulty of an item with an index of .4
Def: item discriminability
Degree to which an item differentiates between low and high scorers

D = difference between high and low % correctly answered

range from 100 to -100
moderate difficulty optimal
Target values for item difficulty by objective
.5 for most tests
.25 for high cutoff (matching selection %)
.8 or .9 for mastery
half way between chance and 1, eg t/f exams would be .75
Relationship between item difficulty and discriminability
Difficulty creates a ceiling for discriminability

Difficulty of .5 creates maximum discriminability

The greater the mean discriminability the greater the reliability
What can you determine from an item response (aka item characteristic) curve?
point where p(correct response) = .5

slope of the curve; lower more discriminable

Probability of a correct guess
intersection with y axis
Def: computer adaptive assessment
Computerized selection of test items based on periodic estimates of ability
What are the advantages of a test item of moderate difficulty (p = .5)
Increases variability which increases reliability and validity

Maximally differentiates between low and high scorers
Techniques for assessing an item's discriminability
Correlation with
total score
an external criterion
What are the mean and std deviation for the following standard scores: z, t, stanine and deviation IQ?
mean SD
z 0 1
t 50 10
stanine 9 ~2
deviation IQ 100 15
The difference between norm-referenced and criterion referenced scores
Norm referenced is a comparison to others in a sample

Criterion referenced measure against an external criterion
Characteristics of alternate forms reliability coefficient
Best, because to be high must be consistent across time and content

Likely to have a lower magnitude than other coefficients
Def: moderator variable
Variables affecting validity of a test

A moderator variable confers differential validity on the test
Def: 'testing the limits' in dynamic assessment
Following a standardized test, using hints to elicit correct performance. The more hints necessary, the more severe the learning disability
Contents of the Mental Measurements Yearbook
Target population
Administrative time
Critical reviews
Effect on the floor of adding easy questions to a test *
Will raise the floor
Def: dynamic assessment
Variety of procedures following on standardized testing to get further information, usually used with learning disablity or retardation
test theory
ttest theory

Add Cards

You must Login or Register to add cards