- Def: reliability
- Repeatable and consistent

Free from error

Reflects 'true score'

- Def: validity
- Measures what it says it does

- Def: power test
- Assesses the attainable level of difficulty

No time limit

Graduated difficulty

Qs that everyone can do

Qs that no one can do

Eg: WAIS information subtest

- Def: ipsative measures
- Scores reported in terms of relative strength within the individual

Preference is expressed for one item over another

- Def: mastery test
- Cutoff for predetermined level of performance

- Def: normative measures
- Absolute strength measured

All items answered

Comparison among people possible

- Range and interpretation of a reliability coefficient
- 0 (unreliable)

to

1 (perfectly reliable)

.9 means 90% of the variance accounted for

You do NOT square a reliability coefficient

- Factors affecting reliability coefficient
- Anything reducing the range of obtained scores (eg a homogeneous population)

Anything increasing measurement error

Short (vs long) tests

Presence of floor or ceiling effects

High probability of guessing a correct answer

- Factors affecting test-retest reliability
- Maturation

Difference in conditions

Practice effects

- Measures of internal consistency
- Split-half: divide test in 2 and correlate scores on the subtests; sensitive to selection strategy

Coefficient alpha: used with multiple choice questions

Kuder-Richardson Formula 20 (KR-20) used for questions with dichotomous answers

Reliability increases with item homogeniety

- Utility of internal consistency measures
- Measurement of unstable traits

Not good for speed tests

Sensitive to item content / sampling

- Appropriate measure of speed test reliability
- Test-retest

Alternate forms

- Measure of inter-rater reliability
- Kappa coefficient

- Factors improving inter-rater reliability
- Well trained raters

Explicit observation of the raters

Mutually exclusive and exhaustive scoring categories

- Def: interval recording
- All behavior within a specified period of time

- Def: standard error of measurement
- How much error is expected from an individual test score

- Formula: standard error of measurement *
- SE = SD * square root of (1-r)

where r = the reliability coefficient which ranges from 0 to 1

- Use: standard error of measurement
- Construction of a confidence interval

- Probability of scores falling within a specified confidence interval
- 68% +/- 1 SE

95% +/- 1.96 SE

99% +/- 2.58 SE

- Use: eta *
- Correlation of continuous non-linear variables

- Def: types of criterion related validity
- Concurrent

Scores collected at the same time

Useful for diagnostic tests

Predictive validity

Scores tested before and later

Useful for eg job selection tests

- Factors affecting criterion related validity
- Restricted range of scores

Unreliability of predictor or criterion

Regression

Criterion contamination

- Def: criterion contamination
- Occurs when person assessing criterion knows predictor for an individual

- Def: convergent/divergent analysis
- Convergent validity is high correlation between different measures of same construct

Divergent validty is low correlation between measures measuring different constructs

- Relationship between reliability and validity
- The criterion-related validity coefficient cannot exceed the square root of the predictor's reliability coefficient

Reliability coefficient sets a ceiling on the validity coefficient

- Def: face validity
- Appearance of validity to test takers, administrators and other untrained people

- Def: criterion related validity coefficient
- Pearson r correlation between predictor and criterion

acceptable range is +/- .3 to .6

- Differences between

standard error of measurement

and

standard error of estimate - Standard error of measurement

related to reliability coefficient

used to estimate true score on a given test

Standard error of estimate

Determines where a criterion will fall given a predictor

- Def: shrinkage
- Reduction in validity coefficient on cross-validation (revalidation with a second sample)

A result of noise in original sample

- Factors affecting shrinkage
- Small original validation sample

Large original item pool

Relative number of items retained is small

Items not sensibly chosen

- Def: construct validity
- Extent to which a test successfully measures an unobservable, abstract concept such as IQ

- Techniques for assessing construct validity
- Convergent validity techniques

High correlation on a trait even with different methods

Divergent / discriminant validity techniques

Low correlation on different traits even with the same method

Factor analysis

- Def: factor loading
- Correlation between a given test and a factor derived from a factor analysis

Can be squared to give % of variance that the test accounts for in the factor

- Def: communality (factor analysis)
- The proportion of variance of a test accounted for by the factors

Sum of the squared factor loadings

Interpreted directly, ie .4 = 40%

Only valid when factors are orthogonal

- Def: unique variance (factor analysis)
- Variance not accounted for by the factors

u2 = 1 - h2, where h2 is the communality

- Def: eigenvalue
- explained variance

= Sum of the squares of the loadings

sum of the eigenvalues <= number of tests

Applied to unrotated factors only

- Formula to convert eigenvalue to %
- = eigenvalue * 100 / number of tests

- Types of rotation (factor analysis) *
- Orthogonal - uncorrelated

Oblique - correlated

Choice depends on what you believe the relationship is among the factors

- Differences between principle components analysis and factor analysis
- In principle components analysis:

Factors are always uncorrelated

Variance = explained + error

In factor analysis:

variance = common + specific + error

- Use: cluster analysis
- Categorize or taxonimize a set of objects

- Differences between cluster analysis and factor analysis
- Cluster analysis

all types of data

clusters interpreted as categories

Factor analysis

interval or ratio data only

factors interpreted as underlying constructs

- Def: correction for attenuation
- Estimate of how much more valid a predictor would be if it and the criterion were perfectly reliable

- Def: content validity
- Adquate sampling of relevant content domain

- To reduce the number of false positives...
- Raise the predictor cutoff

and / or

Lower the criterion cutoff

- Def: false negative
- Predicted not to meet a criterion but in reality does

- Def: item difficulty or difficulty index *
- % of examinees answering correctly

an ordinal value, because an item with an index of .2 is not necessarily half the difficulty of an item with an index of .4

- Def: item discriminability
- Degree to which an item differentiates between low and high scorers

D = difference between high and low % correctly answered

range from 100 to -100

moderate difficulty optimal

- Target values for item difficulty by objective
- .5 for most tests

.25 for high cutoff (matching selection %)

.8 or .9 for mastery

half way between chance and 1, eg t/f exams would be .75

- Relationship between item difficulty and discriminability
- Difficulty creates a ceiling for discriminability

Difficulty of .5 creates maximum discriminability

The greater the mean discriminability the greater the reliability

- What can you determine from an item response (aka item characteristic) curve?
- Difficulty

point where p(correct response) = .5

Discriminability

slope of the curve; lower more discriminable

Probability of a correct guess

intersection with y axis

- Def: computer adaptive assessment
- Computerized selection of test items based on periodic estimates of ability

- What are the advantages of a test item of moderate difficulty (p = .5)
- Increases variability which increases reliability and validity

Maximally differentiates between low and high scorers

- Techniques for assessing an item's discriminability
- Correlation with

total score

an external criterion

- What are the mean and std deviation for the following standard scores: z, t, stanine and deviation IQ?
- mean SD

z 0 1

t 50 10

stanine 9 ~2

deviation IQ 100 15

- The difference between norm-referenced and criterion referenced scores
- Norm referenced is a comparison to others in a sample

Criterion referenced measure against an external criterion

- Characteristics of alternate forms reliability coefficient
- Best, because to be high must be consistent across time and content

Likely to have a lower magnitude than other coefficients

- Def: moderator variable
- Variables affecting validity of a test

A moderator variable confers differential validity on the test

- Def: 'testing the limits' in dynamic assessment
- Following a standardized test, using hints to elicit correct performance. The more hints necessary, the more severe the learning disability

- Contents of the Mental Measurements Yearbook
- Author

Publisher

Target population

Administrative time

Critical reviews

- Effect on the floor of adding easy questions to a test *
- Will raise the floor

- Def: dynamic assessment
- Variety of procedures following on standardized testing to get further information, usually used with learning disablity or retardation

- test theory
- ttest theory