This site is 100% ad supported. Please add an exception to adblock for this site.

Spring 2006 comps, tesl 560


undefined, object
copy deck
absolute decision
the interpretation of CRT scores where each examinee's score is meaningful without reference to the scores of the other examinees.
criterion level
the level or standard of performance required to pass a CRT.
criterion-referenced test
measures which assess student achievement in terms of a CERTAIN CRITERION STANDARD and thus provide information as to the degree of competence attained by a particular student which is INDEPENDENT OF REFERENCE TO THE PERFORMANCE OF OTHERS.
norm-referenced test
any test that is primarily designed to disperse the performances of students in a normal distribution based on their general abilities, or proficiencies, for purposes of categorizing the students into levels or comparing students' performances to the performances of others who formed the normative group.
relative decision
the interpretation of an examinees score on an NRT where an examinee's position relative to the scores of all of the other examinees who took the test is understood.
norming group
the first set of test examinees to which whose scores all other test examinees' scores will be compared.
achievement testing
This type of testing is usually done at the end of a course of study. The purpose of criterion-referenced achievement testing is to determine the degree to which the students successfully learned the language material or skills covered in the course
diagnostic testing
most often done at the beginning of a term of instruction, the purpose of this test is to determine the strengths and weaknesses of each student so they can focus their energies on their weaknesses where those energies will be most effective and efficient.
educational objectives
is defined as any set of statements that describe what teachers expect students (a) to be able to do at the end of a particular language course or program, or (b) be able to do in a specific domain of knowledge, skills, or abilities.
experiential objectives
objectives which describe exactly what the teacher wants the students to experience and how that experience will be verified.
gain scores
defined as the post-test average percent minus the pre-test average percent for each objective. In other words, the difference between the pretest and postest where any positive number would indicate that the students have learned something and to what degree.
instructional objectives
narrowly defined objectives which include, performance objectives (what the learner is expected to be able to do by the end of the term), conditions (important conditions under which the performance will occur), and criterion level (the level at which learners are expected to perform in order to pass).
needs analysis
the systematic collection and analysis of all subjective and objective information necessary to define and validate defensible curriculum purposes that satisfy the language learning requirements of students within the context of particular institutions that influence the learning and teaching situation
to define a practical or theoretical concept in observable terms.
It refers to what a language learner is expected to be able to do by the end of a course
It usually refers to an achievement test because it is administered at the end of a course, meaning the feedback will be achievement related
It usually refers to a diagnostic test because it is administered at the beginning of instruction. The students haven’t learned knowledge or skills from the course yet.
progress testing
s kind of testing is usually administered as a language course progresses rather than at the beginning of instruction (e.g. diagnostic testing). The purpose is to determine the strengths and weaknesses of each student as the learning continues.
alternative assessments
see personal-response items
analytic scoring
a scoring method in which separate scores are assigned to different aspects of the performance (for instance, separate scores on a composition for grammar, mechanics, vocabulary, content, and organization).
conference format
typically involves the student visiting the teacher's office, usually by appointment, to discuss a particular piece of work or learning process, or both. It differs from other forms of assessment because it focuses directly on the learning process.
constructed-response items
those in which a student is required to actually produce langauge by writing, speaking, or acting in some way , rather than simply selecting answers.
discrete-point items
These tests clearly assess distinct and identifiable parts of the phonetic, syntactic, or lexical systems, as in multiple-choice grammar or vocabulary tests. They stand in contrast to integrative tests, which link two or more language skills
These are incorrect alternatives in selection-response items
fill-in format
This format provides a language context of some sort and then removes part of that context and replaces it with a blank. The student’s job is to fill in that blank. Fill-in items take many forms ranging from single word fill-in items and cloze passages to phrase-length responses
format confoundings
This is inadvertent confusion due to item appearance or content which may needlessly disorient students. This is opposed to linguistic confoundings, which confusion based on unintentional language problems.
holistic scoring
a type of writing scoring in which a single overall score is given. This stands in contrast to analystic scoring, with separate scores for different aspects of performance (i.e. grammar, mechanics, vocabulary, content, and organization.)
integrative tests
tests which link two or more language skills, for example a dictation test which would assess listening and writing.
item content analysis
making judgments about the degree to which the form of the item allows for accurately assessing the desired content
item quality analysis
(for CRTs) ulitmately involves making judgments about the degree to which the itmes are valid for the purposes of the test and the appropriacy of the content of the content of the items within the specific language program or research area.
item specifications
clear item descriptions that include a general description (a simple one or two sentence description in general terms of the behavior being examined and how that behavior will be assessed), a sample item (an example test item derived from the test specifications), stimulus/prompt attributes (a series of statements that attempt to delimit the general class of material that the examinee will be responding to when answering the type of item involved), response attributes (either defines the characteristics of the options from which the students select their responses or presents the standards by which students' responses will be evaluated), and specification suppliments (limits for the content of a particular item).
item stem
the part of the itme that forms the basis for the choices in a multiple-choice item.
linguistic confoundings
unintentional language problems or processing errors test writers should avoid. 1. avoid writing items whose language is at levels of complexity above the examinee’s level of language proficiency.2. Avoid negative and double negative statements, and 3. avoid ambiguity
matching format
Examinees select words or phrases in one list that match words or phrases in another list; match prompts to options. They are limited to measuring student’s abilities to associate one set of facts with another or one set of definitions with the words they define. Typically used for testing vocabulary
multiple-choice format
See selected response; reduce the guessing factor found in binary choice tests. Goal is to develop MC items which do not 1. provide clues to the answer or 2. require types of cognitive processing that may confound the results.
For selected response items the possible answers or alternatives from which the examinee selects. None of the above, A and B but not C, or all of the above should be avoided.
performance format
a format which generally requires Ss to perform some more or less real life, authentic task using the language, most often either productive types of spoken or written language but sometimes combining two or more sills like reading and writing or listening and speaking.
personal assessment
individualized testing; the student’s communication is their own. It is what they want to communicate. Disadvantage is that it involves subjective scoring if they are scored at all. Advantage is that it is more personalized assessment.
personal-response items
require students to produce language; allow for the responses and even the ways the tasks are accomplished to be different for each student; students produce different responses according to their own personalities or preferences.
portfolio format
a collection of any aspects of student’s work that tell the story of the efforts, skills, abilities, achievements, and contributions to a given class; have been used in order to encourage students to collect and display their work.
the portion of the test item to which the students must respond. It can take many forms.
selected-response items
in selected response items, examinees choose the correct response from a set of supplied options. The most common form of this in language testing is the multiple choice item. The selected response category also includes binary-choice (that is, true/false) and matching items. For selected response items, the possible answers from which the examinee selects are called the options, or alternatives. The correct answer is also sometimes called the key, and the incorrect alternatives are called distractors.
self-assessment format
any assessment that requires students to rate their own language, whether through performance ability self-assessments, comprehension self-assessments, or observation self-assessments.
⬢ Performance ability self-assessments require learners to read a situation and then judge how well they think they would respond in that particular situation (on perhaps a scale of 1 to 5).
⬢ Comprehension self-assessments also require the students to read a situation and then judge how well they comprehend that particular situation.
⬢ Observation self-assessments involve learners listening to audio cassette recordings or watching video tapes of their own language behavior (usually in natural situations or in role-play activities) and judging how well they performed.
short-answer format
this format generally requires the students to examine a statement or question and then respond to it with a phrase or two, or a sentence or two, in the space provided. Unlike fill-in format (which generally has a narrow focus on very specific language components), short-answer items can be somewhat more general in nature, but none the less, limitations of response length must be considered. As with the fill-in format, short-answer format may have many possible answers for each question. Each student may produce a completely unique answer. This characteristic can cause enormous problems in terms of fair scoring
test specifications
the first stage in writing a criterion-referenced test. These provide a set of guidelines as to what the test is designed to measure and what language content or skills will be covered in the test. Additionally, test specifications can be used later for communicating the test writers’ intentions to the users of the test. One useful format for test specifications might include an overall test descriptor and specific test descriptors
true-false format
or Binary-choice- This format requires students to respond to the language by selecting one of two choices, for instance, between true and false or between correct and incorrect. The most commonly used binary-choice formant is the true/false format
statistic showing the item facility based on those who passed and failed the test. Contrasting the performance of the masters and non-masters. Must have an established cut score.
B-index = IFpass-IFfail
the mode of a set of numbers is the number that appear the most frequently. Bimodal means that there are two numbers that appear most frequently, i.e. 10 test takers take a test and out of a 10 point scale, both 4 and 8 appear 5 times.
Item Facility (IF)

A statistic that expresses the percentage of examinees who correctly answer a given item.
0- hard item
1- easy item because every examinee got the item correct.
central tendency
Center of a set of values: mode, mean, median
Any distribution of scores can be described numerically by calculating statistics that represent the central tendency of the scores
difference index (DI)
a value comparing item performance of masters and non-masters. It indicates the degree to which an item is reflecting gain in knowledge or skill. In CRT, it indicates the degree to which a CRT ite is distinguishing between the students, called masters, who know the material (or have the skill being taught) and students, labeled non-
item discrimination (ID)
ID shows how well an item separates high-ability and low-ability students. (NRT)
Item discrimination: ID = IFupper – IFlower
- a negative number means low ability Ss got right, high ability Ss got wrong.
0.00 - 1.00
Useless perfect
item phi
The correlation between item score and pass/fail.
Item Phi is essentially a Pearson correlation between examinee item and test performance outcome.
students who know the material (or have the skill) being taught
Measure of central tendency
Calculated by adding up the scores, and dividing by the number of scores.


M= mean
Σ= sum (or add up)
X= scores
N= number of scores

This is essentially synonomous with average
Another measure of central tendency
The point in the distribution above which half of the scores are found and below which the other half are found
The point in the distribution that has the most scores
students who do not know the material (or have the skill) being taught
number of items (k)
the number of items on a test
CRT, the information of interest is the amount of knowledge of skill that the examinees posses
Focus is on percentage of items correctly, which hopefully reflects the percentage of material known
NRT, the focus is on how the examinee’s performance relates to the scores of all of the other examinees.
Indicates the proportion of examinees who scored above and below the examinee
Represents the distance from the lowest core in the distribution t the highest score.
Range= High – Low + 1
High = highest score on the test
Low = lowest score on the test
Negatively skewed

CRT post-test
Most test-takers score high
Distribution in which the skewing (or tail) is in the direction of the low scores (that is, in the negative direction)
Positively skewed

CRT pre-test
When most examinees score low, indicating that they did not know the material being tested
The skewing (or tail) is in the direction of the high scores (positive direction)
Normative distribution

distribution example for an NRT
standard deviation

the average distance from the mean of a set of scores.
standard scores/z scores

Raw scores (or the actual number of questions answer correctly) converted into scores on a normative distribution. Z scores report the distance a given raw score is from the mean in standard deviation units.

X= student's score
X- = mean
SD = standard deviation
Pearson r

the Pearson product moment correlation or Pearson r assumes that both variables be interval, for example polytomous data from the scores of two tests.
X = each score on variable X
Mx = mean score for variable X
Y = each score on variable Y
My¬¬ = mean score for variable Y
Sx = standard deviation for variable X
Sy = standard deviation for variable Y
N = number of examinees who took both tests
spearman rho

spearman rho is used to correlate two ordinal sets of data. See correlation.
D2 = sum of the squared differences between the ranks on test 1 and the ranks on test 2
N = the number of test takers
Cronbach's alpha

Used to calculate the reliability of norm-referenced tests. See reliability.
k = number of items
si2 = the sum of item variance
Sx2 = the variance of the total score

another version of Cronbach's alpha; used to calculate the reliability of NRTs. See reliability
k = number of items
pq = the sum of p (the proportion of students who go the item correct) and q (the
proportion of people who got the item incorrect, or 1-p).
Sx2 = the variance of the total score

another way to calculate Cronbach's alpha; used to calculate the reliability of NRTs.
K = number of items
Sx2 = variance of the total score
M = the mean of the total score
standard error of measurement

a statistical estimation of the strength of consistency and therefore a measurement of the degree to which an observed score is representative of a student's true score (true ability or average score if the test were administered an infinate number of times independant of each other) on a test. In other words, SEM represents a confidence interval around the examinee's score within which we feel confident the examinee's score woud fall if we administered the test many times.
Sx = the standard deviation of the total score
Rxx’ = reliability coefficient or the reliability of the test
nominal data/scale
used for naming and catagorizing data in a variable. It is not in any particular order. In testing this is often dichotomous (pass/fail or master/non-master)
ordinal data
Different from nominal scales, ordinal data is ranked. For example, if 30 students took a test and you wanted to rank the students from best to worst (as you would when calculating spearman rho)
interval data
interval data also involves ranking, though they reflect the distance between points in the rankings. Example: scores on a test.
Reliability is often defined as consistency of measurement. A reliable test score will be consistent across different characteristics of the testing situation. Thus, reliability can be considered to be a function of the consistency of scores from one set of tests and test tasks to another.
a measure of consistency of measurement in CRTs in terms of consistency in the classifications of master/non-master and pass/fail, and importance of misclassifications.
the strength of relationship between 2 variables. The range for correlation is -1 to 1, where when two variables are negatively correlated, one goes up on a graph as the other variable goes down. If it is positively correlated, both variables will go up at the same time. The closer the correlation coefficient gets to 0, the weaker the correlation between the two variables.
point-biserial pb(r)
a correlation coefficient for one nominal variable (dichotomously scored test items) and an interval variable (test scores). Used in testing to determine the strength of a dichotomously scored item and test scores, where the correlation coefficient is indicative of the degree two which an item can predict the outcome of the test in terms of test score.
a correlation coefficient used to correlate two nominal variables such as dichotomously scored test items and test outcome in terms of pass/fail.
test-retest reliability
the assumption that, under appropriate conditions of administrative control, observed scores on 2 administrations of the same NRT test will be "parallel". For example the same test is given to the same students twice (assuming that the students have forgotten everything on the test the second time).
equivalent forms/parallel forms reliability
similar to test-retest reliability only different but very similar versions of the test are constructed and administered.
internal consistency reliability
examines how different parts of the same NRT relate to each other.
slit-half reliability
a means of measureing the internal consistency reliability of an NRT by dividing the test into two halves, using scoring even numbered and odd numbered items separately, and then examining the correlation between the 2 halves of the test.
Bachman and Palmer's two fundamental principles
1. The need for a correspondence between language test performance and language use.
2. a clear and explicit definition of the qualities of usefulness.
Bachman and Palmer's central concern
Language test performance should reflect non-test language use.
construct validity
Construct validity pertains to the meaningfulness and appropriateness of the interpretations that we make on the basis of test scores⬦We must be able to provide adequate justification for any interpretation we make of a given test score. That is, we need to demonstrate, or justify, the validity of the interpretations we make of test scores, and not simply assert or argue that they are valid.
Authenticity is the degree of correspondence of the characteristics of a given language test tasks to the features of a TLU task. It is a qualitative consideration, not an either/or characteristic, it is only authentic or less authentic, not authentic or inauthentic. It also relates the test task to the domain of generalization to which we want our score interpretations to extend.
Interactiveness is the extent and type of involvement of the test taker’s individual characteristics in accompanying a test task…a test task that requires a test taker to relate the topical content of the test input to her own topical knowledge is likely to be relatively more interactive than one that does not.
the impact of test use and score interpretations on society and educational systems, and upon the people in those educational systems. Can also be defined under the term washback, which is the effect of testing on teaching and learning.
We can define practicality as the relationship between the resources that will be required in the design, development, and use of the test and the resources that will be available for these activities. It goes without saying that if required resources exceed available resources, then the test is not practical.
Bachman and Palmer's 3 things a test characteristics framework can be used for.
1. Describing TLU tasks as a basis for designing language test tasks.
2. Describing different test tasks in order to insure their comparability
3. Comparing the characteristics of TLU and test tasks to assess authenticity
Bachman and Palmer's model of language ability
I. Organizational knowledge (how utterances or sentences and texts are organized)
A. Grammatical knowledge (how individual utterances or sentences and text are organized)
1. knowledge of vocabulary
2. knowledge of syntax
3. knowledge of phonology/graphology.
B. Textual knowledge (how utterances or sentences are organized to form texts)
1. knowledge of cohesion
2. knowledge of rhetorical or conversational organization
II. Pragmatic knowledge (how utterances or sentences and texts are related to the communicative goals of the language user and to the features of the language use settings).
A. Functional knowledge (how utterances or sentences and texts are related to the communicative goals of language users).
1. knowledge of ideational functions.
2. knowledge of manipulative functions.
3. knowledge of heuristic functions.
4. knowledge of imaginitive functions.
B. sociolinguistic knowledge (how utterances or sentences and texts are related to features of the language use setting).
1. knowledge of dialects/varieties.
2. knowledge of registers.
3. knowledge of natural or idomatic expressions.
4. knowledge of cultural references and figures of speech.
ideational functions
enables us to express or interpret meaning in terms of our experience of the real world.
manipulative functions
enables us to use language to effect the world around us. Includes:
1. instrumental functions: performed to get other people to do things for us.
2. regulatory functions: used to control what other people do (ie rules).
3. interpersonal functions: used to establish, maintain, and change interpersonal relationships.
heuristic functions
enables us to use langauge to extend our knowledge of the world around us.
imaginative functions
enables us to use langauge to create and imaginary world or extend the world around us for humorous or esthetic purposes.
Bachman and Palmer's noncompensatory view of strategic competence.
strategic competence = a set of metacognitive components (ie. strategies) which can be thought of as higher order executive processes that provide a cognitive management function in language use, as well as in other cognitive activities. this conceptualization provides an essential basis both for designing and developing potentially interactive test tasks and for evaluating the interactiveness of the test tasks we use. the 3 general components are: goal setting, assessment, and planning.
goal setting (stratigic competence)
identifying/choosing the language use tasks or test tasks, then deciding whether or not to attempt to complete the task(s).
since the purpose of a language test is to elicit a specific sample of language use, the test taker’s flexibility in setting goals for performance on test tasks is generally not as great as that enjoyed by language users in non-test language use.
assessment (stratigic competence)
a. the characteristics of the language use or test task in order to determine the desirability and feasibility of successfully completing the task,
b. the individual’s own topical and language knowledge to determine which ones (including affective schemata) might be utilized for successfully completing the task.
c. the correctness/appropriateness (including affective schemata) of the response to the test task.
planning (stratigic competence)
deciding how to utilize language/topical knowledge and affective schemata to complete the test task successfully.
in other words, the plan specifies how the various elements will be combined and ordered when realized as a response.
Design statement components
1. Purpose of test
2. Description of the TLU domain and task types.
3. Characteristics of test takers.
4. Definition of constructs.
5. Plan for evaluating the qualities of usefulness
6. Inventory of available resources and plan for their allocation and management.
Blueprint components
I. Test Structure
a. number of parts/tasks
b. salience of parts
c. sequence of parts
d.relative importance of parts/tasks
e. number of tasks per part

II. Test Task Specifications
a. Purpose
b. Definition of constructs
c. Setting
d. Time allotment
e. Instructions
f. Characteristics of input and expected response
g. Scoring method.
Two types of TLU domains
1) real-life domains: the identification and description of real-life characteristics that the test tasks are for. This is an appropriate base for designing test tasks when there is a clear basis for knowing under what real-life conditions the test taker will be using the language being tested. An additional consideration (of this as a basis) is that the test takers’ language ability is at a high enough level and broad enough for it to be reasonable to ask them to perform test tasks based on real-life tasks.

2) language instructional domains: when the test takers are students in a language course, the test often provides feedback on how well students have learned the content of the course (eg: classroom quiz, achievement test). When the characteristics of language instructional tasks closely match the characteristics of real life tasks, then the test developer can use tasks in either domain or both as a basis. If not, then the authors suggest that the test tasks be described in the real-life domain to have a positive impact on instruction (ie: if test tasks can be made more authentic, then instructional tasks could also be made more authentic.).
High-stakes vs. low-stakes decisions
high stakes tests have major impact on the lives of a large number of individuals or on large programs.

low stakes tests have little impact on the lives on a relatively small number of people or small programs
selection decision
determining which individuals should be admitted into a particular program or job
placement decision
determining which of several different levels would be appropriate for the test taker.
diagnostic decision
determining specific areas of strength or weakness so as to assign students to a specific course or learning activity.
formative evaluation
information about how to help students guide their own subsequent learning, or for helping teachers modify their teaching methods and materials so as to make them more appropriate for their students' needs, interests, and capabilities.
summative evaluation
information about the students' achievement at the end of a course of study.
McNamara's strong vs. weak sense of language performance assessment
“(McNamara . . . divides views of performance into two basic implicit assumptions about what test results represent. The 1st approach takes what he terms a work sample approach, with a view of task success being of paramount importance. His 2nd approach is a more cognitive approach that takes a more explicitly linguistic basis and focuses less on the particular task and more on the qualities of the linguistic execution of the task. These views he designates as “a strong and a weak sense of the term second language performance test” (p. 43) “In the strong sense of language performance assessment, the tasks represent real-world tasks and the criteria used to evaluate the performance are primarily those real-world criteria used to assess the fulfillment of the task. Language production features will at most be partial criteria in assessing performance fulfillment. In short, performance of the target task is of primary importance and language is viewed simply as the medium through which the task is carried out. “In McNamara’s weak sense, language performance, tests are primarily concerned with the language performance evinced through the particular tasks. While test tasks may resemble real-life tasks, in that examinees engage in tasks that may exist outside a language testing situation, the capacity to perform the task per se is not the primary focus of assessment. One primary purpose of the task is as a mechanism to elicit a language sample--hopefully a language sample demonstrating some fidelity to the language of the real-life event—which can then be evaluated for its linguistic effectiveness and appropriateness” (pp. 23-24).
Differences between NRT and CRT
Differences in interpretations of results
NRT: score is interpreted relative to other students’ scores (i.e., percentile)
CRT: score is absolute; a student’s performance is a percent of material known.
Differences in goals of measurement
NRT: goal is to measure general language ability/proficiency (e.g., TOEFL)
CRT: goal is to measure knowledge of a specific domain or objectives-based language points (e.g., unit test in a classroom)
Differences in purposes for testing
NRT: purpose is to spread students along a continuum of general proficiency for comparison
CRT: purpose is to assess how much material is known or has been learned
Differences in distribution of expected scores
NRT: normal distribution of scores around a mean (i.e., bell curve)
CRT: varies, often not “normal,” as Ss who know the material should get 100%
similarities between NRT and CRT
1. Both require specification of the achievement domain to be measured.
2. Require a relevant and representative sample of items.
3. Use the same types of test items (but not same type of measurements)
4. Use the same rules for item writing (not on item difficulty)
5. Are judged by the same qualities of goodness (validity and reliability).
6. Are useful in educational measurement.
Advantages and disadvantages of holistic scoring
1. students do not risk of being assessed solely on the basis of one lesser aspect,
2. the approach puts the emphasis on what is done well not deficiencies.
1. score don’t provide diagnostic information
2. it is difficult to interpret the score
3. it lumps together in one score uneven abilities.
4. may produce unfair results due to deferential weighting of aspects of language which cause unfair results.
5. longer essays may receive higher grade
6. the approach penalizes efforts
7. reducing score to one score reduces reliability
8. increasing reliability decreases validity.
9. it may confound writing scale with language proficiency.
Advantages and disadvantages of analytic scoring
1. there is no collapsing of categories.
2. training raters is easier.
1. there is no assurance that analytic scales will be assured.
2. writing is more that the sum of its parts.
3. prefer essays..
4. the scales may be not informative especially if scales of concern are somewhat neglected by the raters,
5 individual scales may call for qualitative judgments .
3 types of selected response items
1. Binary choice
2. Matching
3. Multiple choice
3 types of constructed response items
1. Fill-in the blank
2. Short answer
3. Performance (essays, role-playing, communicative tasks, etc...)
3 types of personal response items
1. Conferences
2. Portfolios
3. Self-assessments
Advantages and disadvantages of selected response
advantages: requires a short time to administer; easy to score; scoring is objective.

disadvantages: relatively difficult to create; requires no language production on the part of the students.
Advantages and disadvantages of constructed response
advantages: virtually no guessing factor; allows for testing productive langauge use; allos for testing interaction of receptive and productive skills.

disadvantages: difficult and time consuming to score; scoring is subjective; bluffing is possible.
advantages and disadvantages of personal response items
advantages: personal assessment; directly related to and integrated into curriculum; appropriate for assessing learning processes.

disadvantages: difficult to create and structure; scoring is subjective.
Item Response Theory (IRT)
Comprised of a family of statistical approaches that provide probabilistic models linking item difficulty with an examinee's ability. Read Brown and Husdon Pgs. 128 - 148.
Item discrimination limits
Item Facility: between .4 and .7
Item Discrimination: above .4
Point-Biserial: above .3

B-index: above .4
Item phi: above .3

Deck Info