
Using Tests for Assessment |
This chapter provides some guidelines for using locally- or externally- developed tests for assessment. Tips for planning and developing a test as well as for analyzing the quality of a test are included.
Questions & Answers About Using Tests for Assessment
Q) How does a test fit into an assessment plan?
A) In most cases, a test will be one part of a fully developed assessment plan. Most programs have cognitive, attitudinal, and performance goals. Tests are commonly used in association with cognitive goals to review student achievement with respect to a common body of knowledge associated with a discipline or practice. Tools such as interviews and surveys are used more commonly in conjunction with other non-cognitive assessment techniques.
Q) When is a test a good choice for an assessment program?
A) Use of a test is a good choice when:
Q) What are the major advantages of testing as an assessment technique?
A) In comparison to other assessment procedures, tests can sample student knowledge with efficiency and reliability you can find out what a lot of students know in a brief period of time. A second advantage is that repeated use of a test will provide a means of comparison between different student groups or the same group over time. This type of testing practice provides reviewers with a rich context for evaluation, decision-making, and recommendations.
Q) What are the distinguishing features of a good test?
A) At a minimum, a good test will have:
Q) What are some strategies for creating a departmental test for assessment?
A) One common practice is to develop and adopt common pretests and posttests in courses with multiple sections. Items for common tests could be culled from existing exams. Another practice is to determine a portion of each unit exam, a specific set of items, that are scored for program assessment and also for individual evaluation. This practice is sometimes referred to as course-embedded testing. (When using course-embedded testing, it is important to notify students how the practice affects the assignment of grades.)
Q) How is using a test for assessment purposes different from using a test in the classroom?
A) Generally, instructors develop their own classroom tests, making all decisions about when and how to construct, administer, score, and report results of tests. Construction is often done without formality or documentation. With assessment, planning, implementing, and using results become a group effort a shared set of decisions and responsibilities. Consensus is emphasized. Some additional planning time, communication, and record keeping will be needed. The most frequent use of tests by instructors is to assign grades related to individual student learning. When used for program assessment, test performance is generally used along with other information to describe group achievement and is independent of grading.
Q) How can we prepare students for tests as assessment?
A) Because tests for assessment are different than tests for grading, it is recommended that you provide a brief orientation which covers the following topics:
Q) What is a standardized test?
A) A standardized test is one in which the initial construction, as well as conditions for administration and scoring, have a uniform procedure so that the scores may be interpreted in a consistent manner from one administration to the next. These are designed by test development specialists, either internally or externally.
Q) What are the basic steps in developing a test?
A) Seven basic and sequential steps are recommended:
Q) What are the differences between a locally-developed and an externally-developed test?
A) Differences between locally- and externally-developed tests are described below.
Characteristics |
Externally-developed tests |
Locally-developed tests |
| Development time | None | Varies; depends on local testing practices, test development resources and expertise |
| Relationship between test and program objectives | Varies; test is tailored to broad-based models | Close; tailored to local needs; adjusted as needed |
| Comparison groups | May include national, regional; may include norms by gender, class level, college/major, institution type; infrequently updated | Created and maintained locally; generally no external norms; can be modified as needed |
| Costs | Usually high; materials & scoring costs may be reoccurring | Usually low; can be managed with limited reoccurring costs |
| Results | May be long delays; little choice in type of analyses | Can be immediate; local needs/decisions drive analyses |
Q) Are there any important decisions that precede test development?
A) Yes. Before a test is developed, a few decisions about use of results need to be clarified because they will have an influence on test construction. Three critical decisions are raised here.
Criterion-referenced tests:
Norm-referenced tests:
Good questions to ask are:
Table 1: Features of Various Test Forms
| Objective | Essay | Oral | Performance | |
| Conditions for Use | Assess knowledge with maximum efficiency and reliability | Assess thinking skills and/or mastery of a structure of knowledge | Assess knowledge during instruction process | Assess ability to use knowledge to create a new product |
| Stimulus Material | Multiple-choice, True/false, Fill-in, Matching | Writing task | Open-ended prompts | Event or directions that provide a frame for a performance or product |
| Students Response | (Recognition) Select from options provided | (Production) Organize, construct, & deliver | (Production) Interpret, construct, & deliver | (Production) Plan, construct, & deliver original response |
| Scoring | Count correct answers | Judge understanding | Determine correctness of answer | Apply attributes checklists or rating scales to describe performance or proficiency |
| Major Strengths | Efficiency-can sample broad content range | Gets at complex thinking | Immediately links assessment and instruction | Provides rich evidence of performance skills |
| Potential Weaknesses | Poorly written items, over-emphasis on factual recall, poor test-taking skills, failure to obtain a representative sample of content | Poorly written exercises, confounded with knowledge of content, poor scoring procedures | Poor questions, students lack of willingness to respond, too few questions | Poor exercises, too few samples of performance, vague criteria, poor rating procedures, poor test conditions |
| Influence of Format on Learning | Can require complex thinking skills, failure to obtain a representative sample of content | Encourages thinking and development of writing skills | Stimulates participation in instruction, provides immediate feedback | Emphasizes use of skill & knowledge application in problem context |
| Keys to Successful Use of Format | Clear test blueprint, item writing skill, ample construction time | Carefully prepared writing prompts & model answers, ample scoring time | Clear questions, systematic sampling procedures, ample time for response | Clear performance criteria, clear rating scales, ample rating time |
Adapted from: Stiggins, R. J. (1987). p. 35.
Test Blueprint
Tests that are soundly constructed are built to meet specifications just as a house is built according to a plan. The plan for a test is frequently called a test blueprint (also called a table of specifications or a test matrix). Two dimensions are laid out as the columns and rows of a two-way table. At the intersections between the rows and columns, the number of items needed to test the area (or percentage of importance) is recorded. A simple example is provided in Figure 1.
Figure 1. Example of a Test Blueprint for an Early American History Test
Discipline/Topic: Early American History
Content: |
Process: Levels of Thought |
Total |
||
Historical Periods |
Recall Facts |
Comprehend Concepts |
Apply Facts & Concepts |
|
Exploration |
10 (20%) | 5 (10%) | 1 (2%) | 16 (32%) |
Colonization |
10 (20%) | 5 (10%) | 1 (2%) | 16 (32%) |
Revolution |
12 (24%) | 5 (10%) | 1 (2%) | 18 (36%) |
Total |
32 (64%) | 15 (30%) | 3 (6%) | 50 (100%) |
Things to notice in the blueprint shown in Figure 1:
Components of a Test Blueprint
Content areas, thinking processes, and importance specifications are given in test blueprints. Using the steps below and the blank test blueprint (Exercise 1) a practice test blueprint can be drafted.
Steps for Completing a Test Blueprint
Suggestions for Developing Test Blueprints
Functions of a Test Blueprint A Summary:
Potential Resources for Locally Constructed Tests
Before you begin writing items for a test, use the following checklists to consider possible resources.
Internal:
External:
Suggestions for Reviewing an Externally-Constructed Test
Item Writing Suggestions
This section contains general item writing tips as well as specific tips for various item types.
Suggestions for All Item Types
| Item Stems | (definition: the part of a test item that poses the question or sets up the problem situation; the stimulus) |
Item Responses
Suggestions/Checks for Multiple-Choice Items
Suggestions/Checks for Matching Items
Suggestions/Checks for True-False (Alternative Response) Items
Suggestions/Checks for Writing Completion or Short Answer Items
Suggestions/Checks for Writing Essay Items
Analyzing the Quality of Test Items
Analyzing the quality of test items can be accomplished in phases. Some general suggestions follow:
If using an objective test format, check for:
If using other formats, check for:
Calculating Item Difficulty and Discrimination
Before test scores are interpreted or used for decisions, it is essential to review how well the items functioned. Calculating item difficulty and discrimination values is recommended practice.
Difficulty Index (P value):
An items difficulty index is expressed as the proportion of students who responded correctly (successfully) to an item. If scores from all students in a group are included the difficulty index is simply the total percent correct. When there is a sufficient number of scores available (i.e., 100 or more) difficulty indexes are calculated using scores from the top and bottom 27 percent of the group.
The value is interpreted in an inverse way, that is, a high value is interpreted as less difficult. These values can range from 0 to 1.00 and are usually expressed as a percentage. Item difficulty is calculated in the following way:
P = Successes in the HSG + Successes in the LSG N in HSG + LSG (HSG=high scoring group LSG=low scoring group N=number) |
Item difficulty is generally interpreted in the following way:
P-Value |
Percent Range |
Interpretation |
> or = .75 |
75-100 |
Easy |
< or = .25 |
0-25 |
Hard |
between .25 & .75 |
26-74 |
Average |
Example:
Item: Who is the current host of the Tonight Show?
Response Options |
|||||
Groups |
A |
B |
C |
D* |
Total |
High Scorers |
0 | 1 | 1 | 8 | 10 |
Low Scorers |
1 | 1 | 5 | 3 | 10 |
Total |
1 | 2 | 6 | 11 | 20 |
| P = (8+3/10+10) = (11/20) = .55 Interpretation: This item is average in difficulty. Slightly more than half of the students got the item correct. |
Discrimination Index (D value):
An items discrimination index is expressed as the difference among high and low scorers. The value is interpreted in terms of both direction (positive or negative) and strength (non-discriminating to strongly-discriminating). These values can range from -1.00 to +1.00. Item discrimination is calculated in the following way:
D = Successes in the HSG - Successes in the LSG N in HSG N in LSG (HSG=high scoring group LSG=low scoring group N=number) |
Item discrimination is generally interpreted in the following way:
D-Value |
Direction |
Strength |
> +.40 |
positive |
strong |
+.20 to +.40 |
positive |
moderate |
-.20 to +.20 |
none |
-- |
< -.20 |
negative |
moderate to strong |
Refer to Previous Example
| P = (8/10 - 3/10) = (5/10) = .50 Interpretation: This item is positive and strongly discriminating. A larger number of high- rather than low-scoring students correctly answered the item |
Other Features to Consider
The pattern of incorrect responses is another feature of item analysis. Ideally, each incorrect response should be selected by a higher percentage of low- than high-scoring students. In the previous example, responses A and B are selected by few students; these options do not provide much information about student knowledge and should be revised.
The relationships of difficulty and discrimination patterns to intended test purpose is a critical aspect of item analysis. The table below outlines how the majority of items should function for norm- and criterion-referenced tests.
Table 3: Relationships Between Item Analysis Indices and Test Purposes
Test Purpose
| Norm-referenced | Criterion-referenced | |
| Item Difficulty | depends on the number of response choices; generally around .50 | depends on level of mastery required, generally < or = .75 |
| Item discrimination | should > or = .20; should be answered correctly by students with high total scores and incorrectly by students with low total scores | should be non-discriminating between .20 and -.20 |
Analyzing the Quality of a Test
Analyzing the quality of a test involves an examination of validity and reliability characteristics of the test. Some basic definitions and critical concerns are reviewed below.
Validity:
Table 4: Two Important Types of Validity
Type |
Definition |
Validity Threatened by |
Prevention & Checks |
Content |
extent to which a test adequately samples program objectives |
poor relationship between test items and program objectives |
use & review test blueprint |
Construct |
extent to which a test measures the amount learned and not some other extraneous variable |
for objective tests: poorly constructed items that measure test-taking skill rather than mastery of material for other test formats: problems with raters and scoring procedures |
use good item writing practices development of good rating criteria; train for & check on raters consistency check correlations with other related & unrelated information (for example, SATs & GPAs) |
Reliability:
For Further Reading
Berk, Robert. (Ed.). Performance Assessment: Methods and Applications. Baltimore, MD: The Johns Hopkins University Press, 1986.
Ebel, Robert. and Frisbie, D. A. Essentials of Educational Measurement (4th ed.). Englewood Cliffs, NJ: Prentice-Hall, 1986.
Gronlund, Norman. Measurement and Evaluation in Teaching (4th ed.). New York: MacMillan, 1991.
Gronlund, Norman. Preparing Criterion-Referenced Tests for Classroom Instruction. New York: MacMillan, 1973.
Hill, John. Measurement and Evaluation in the Classroom (2nd. ed.). Columbus,OH: Charles E. Merrill, 1981.
Kubiszyn, Tom and Borich, G. Educational Testing and Measurement: A Guide for Writing and Evaluating Test Items. Minneapolis, MN: Burgess Publishing Co., 1984.
McKeachie, Wilbert. Teaching Tips: A Guidebook for the Beginning College Teacher (7th ed.). Lexington, MA: D.C. Heath & Co., 1978.
Mehrens, William. and Lehmann, I. J. Measurement and Evaluation in Education and Psychology. (2nd ed.). New York: Holt, Rinehart, & Winston, 1978.
Office of Examination, Scanning, & Evaluation Services. "Interpretive Recommendations for Item and Test Analysis Computer Output", 8/13/85.
Osterlind, Steven. Constructing Test Items. Boston: Kluwer Academic Press, 1989.
Plake, Barbara. (Ed.) Social and Technical Issues in Testing: Implications for Test Construction and Usage. Hillsdale, NJ: Erlbaum, 1984.
Priestley Michael. Performance Assessment in Education and Training: Alternative Techniques. Englewood Cliffs, NJ: Educational Technology Pub., 1982.
Popham, W. J. Criterion-Referenced Measurement. Englewood Cliffs, NJ: Prentice-Hall.
Thorndike, Robert. and Hagen, E. Measurement and Evaluation in Psychology and Education. (4th ed.). New York: John Wiley, 1978.
Tyler, Ralph. and Wolf, R. M. Crucial Issues in Testing. Berkeley, CA: McCutchan, 1974.
Wergin, Jan. "Basic Issues and Principles in Classroom Assessment." In J.H. McMillan (ed.), Assessing Students' Learning. San Francisco: Jossey-Bass [New Directions for Teaching and Learning Series (no. 34)], 1988.