EPPP Testing

Start Studying!

Terms

undefined, object

copy deck

Def: reliability: Repeatable and consistent

Free from error
Reflects 'true score'
Def: validity: Measures what it says it does
Def: power test: Assesses the attainable level of difficulty

No time limit
Graduated difficulty
Qs that everyone can do
Qs that no one can do

Eg: WAIS information subtest
Def: ipsative measures: Scores reported in terms of relative strength within the individual

Preference is expressed for one item over another
Def: mastery test: Cutoff for predetermined level of performance
Def: normative measures: Absolute strength measured
All items answered
Comparison among people possible
Range and interpretation of a reliability coefficient: 0 (unreliable)
to
1 (perfectly reliable)

.9 means 90% of the variance accounted for

You do NOT square a reliability coefficient
Factors affecting reliability coefficient: Anything reducing the range of obtained scores (eg a homogeneous population)

Anything increasing measurement error

Short (vs long) tests
Presence of floor or ceiling effects
High probability of guessing a correct answer
Factors affecting test-retest reliability: Maturation
Difference in conditions
Practice effects
Measures of internal consistency: Split-half: divide test in 2 and correlate scores on the subtests; sensitive to selection strategy

Coefficient alpha: used with multiple choice questions

Kuder-Richardson Formula 20 (KR-20) used for questions with dichotomous answers

Reliability increases with item homogeniety
Utility of internal consistency measures: Measurement of unstable traits

Not good for speed tests
Sensitive to item content / sampling
Appropriate measure of speed test reliability: Test-retest
Alternate forms
Measure of inter-rater reliability: Kappa coefficient
Factors improving inter-rater reliability: Well trained raters
Explicit observation of the raters
Mutually exclusive and exhaustive scoring categories
Def: interval recording: All behavior within a specified period of time
Def: standard error of measurement: How much error is expected from an individual test score
Formula: standard error of measurement *: SE = SD * square root of (1-r)

where r = the reliability coefficient which ranges from 0 to 1
Use: standard error of measurement: Construction of a confidence interval
Probability of scores falling within a specified confidence interval: 68% +/- 1 SE
95% +/- 1.96 SE
99% +/- 2.58 SE
Use: eta *: Correlation of continuous non-linear variables
Def: types of criterion related validity: Concurrent
Scores collected at the same time
Useful for diagnostic tests

Predictive validity
Scores tested before and later
Useful for eg job selection tests
Factors affecting criterion related validity: Restricted range of scores
Unreliability of predictor or criterion
Regression
Criterion contamination
Def: criterion contamination: Occurs when person assessing criterion knows predictor for an individual
Def: convergent/divergent analysis: Convergent validity is high correlation between different measures of same construct

Divergent validty is low correlation between measures measuring different constructs
Relationship between reliability and validity: The criterion-related validity coefficient cannot exceed the square root of the predictor's reliability coefficient

Reliability coefficient sets a ceiling on the validity coefficient
Def: face validity: Appearance of validity to test takers, administrators and other untrained people
Def: criterion related validity coefficient: Pearson r correlation between predictor and criterion

acceptable range is +/- .3 to .6
Differences between standard error of measurement and standard error of estimate: Standard error of measurement
related to reliability coefficient
used to estimate true score on a given test

Standard error of estimate
Determines where a criterion will fall given a predictor
Def: shrinkage: Reduction in validity coefficient on cross-validation (revalidation with a second sample)

A result of noise in original sample
Factors affecting shrinkage: Small original validation sample
Large original item pool
Relative number of items retained is small
Items not sensibly chosen
Def: construct validity: Extent to which a test successfully measures an unobservable, abstract concept such as IQ
Techniques for assessing construct validity: Convergent validity techniques
High correlation on a trait even with different methods

Divergent / discriminant validity techniques
Low correlation on different traits even with the same method

Factor analysis
Def: factor loading: Correlation between a given test and a factor derived from a factor analysis

Can be squared to give % of variance that the test accounts for in the factor
Def: communality (factor analysis): The proportion of variance of a test accounted for by the factors

Sum of the squared factor loadings
Interpreted directly, ie .4 = 40%

Only valid when factors are orthogonal
Def: unique variance (factor analysis): Variance not accounted for by the factors

u2 = 1 - h2, where h2 is the communality
Def: eigenvalue: explained variance
= Sum of the squares of the loadings

sum of the eigenvalues <= number of tests

Applied to unrotated factors only
Formula to convert eigenvalue to %: = eigenvalue * 100 / number of tests
Types of rotation (factor analysis) *: Orthogonal - uncorrelated
Oblique - correlated

Choice depends on what you believe the relationship is among the factors
Differences between principle components analysis and factor analysis: In principle components analysis:

Factors are always uncorrelated
Variance = explained + error

In factor analysis:
variance = common + specific + error
Use: cluster analysis: Categorize or taxonimize a set of objects
Differences between cluster analysis and factor analysis: Cluster analysis
all types of data
clusters interpreted as categories

Factor analysis
interval or ratio data only
factors interpreted as underlying constructs
Def: correction for attenuation: Estimate of how much more valid a predictor would be if it and the criterion were perfectly reliable
Def: content validity: Adquate sampling of relevant content domain
To reduce the number of false positives...: Raise the predictor cutoff
and / or
Lower the criterion cutoff
Def: false negative: Predicted not to meet a criterion but in reality does
Def: item difficulty or difficulty index *: % of examinees answering correctly

an ordinal value, because an item with an index of .2 is not necessarily half the difficulty of an item with an index of .4
Def: item discriminability: Degree to which an item differentiates between low and high scorers

D = difference between high and low % correctly answered

range from 100 to -100
moderate difficulty optimal
Target values for item difficulty by objective: .5 for most tests
.25 for high cutoff (matching selection %)
.8 or .9 for mastery
half way between chance and 1, eg t/f exams would be .75
Relationship between item difficulty and discriminability: Difficulty creates a ceiling for discriminability

Difficulty of .5 creates maximum discriminability

The greater the mean discriminability the greater the reliability
What can you determine from an item response (aka item characteristic) curve?: Difficulty
point where p(correct response) = .5

Discriminability
slope of the curve; lower more discriminable

Probability of a correct guess
intersection with y axis
Def: computer adaptive assessment: Computerized selection of test items based on periodic estimates of ability
What are the advantages of a test item of moderate difficulty (p = .5): Increases variability which increases reliability and validity

Maximally differentiates between low and high scorers
Techniques for assessing an item's discriminability: Correlation with
total score
an external criterion
What are the mean and std deviation for the following standard scores: z, t, stanine and deviation IQ?: mean SD
z 0 1
t 50 10
stanine 9 ~2
deviation IQ 100 15
The difference between norm-referenced and criterion referenced scores: Norm referenced is a comparison to others in a sample

Criterion referenced measure against an external criterion
Characteristics of alternate forms reliability coefficient: Best, because to be high must be consistent across time and content

Likely to have a lower magnitude than other coefficients
Def: moderator variable: Variables affecting validity of a test

A moderator variable confers differential validity on the test
Def: 'testing the limits' in dynamic assessment: Following a standardized test, using hints to elicit correct performance. The more hints necessary, the more severe the learning disability
Contents of the Mental Measurements Yearbook: Author
Publisher
Target population
Administrative time
Critical reviews
Effect on the floor of adding easy questions to a test *: Will raise the floor
Def: dynamic assessment: Variety of procedures following on standardized testing to get further information, usually used with learning disablity or retardation
test theory: ttest theory

Start Studying!

Deck Info

Number of cards 62