[Ball State University]
Chapter IV

Using Tests for Assessment

This chapter provides some guidelines for using locally- or externally- developed tests for assessment. Tips for planning and developing a test as well as for analyzing the quality of a test are included.

Questions & Answers About Using Tests for Assessment

Q) How does a test fit into an assessment plan?

A) In most cases, a test will be one part of a fully developed assessment plan. Most programs have cognitive, attitudinal, and performance goals. Tests are commonly used in association with cognitive goals – to review student achievement with respect to a common body of knowledge associated with a discipline or practice. Tools such as interviews and surveys are used more commonly in conjunction with other non-cognitive assessment techniques.

Q) When is a test a good choice for an assessment program?

A) Use of a test is a good choice when:

Q) What are the major advantages of testing as an assessment technique?

A) In comparison to other assessment procedures, tests can sample student knowledge with efficiency and reliability – you can find out what a lot of students know in a brief period of time. A second advantage is that repeated use of a test will provide a means of comparison between different student groups or the same group over time. This type of testing practice provides reviewers with a rich context for evaluation, decision-making, and recommendations.

Q) What are the distinguishing features of a good test?

A) At a minimum, a good test will have:

Q) What are some strategies for creating a departmental test for assessment?

A) One common practice is to develop and adopt common pretests and posttests in courses with multiple sections. Items for common tests could be culled from existing exams. Another practice is to determine a portion of each unit exam, a specific set of items, that are scored for program assessment and also for individual evaluation. This practice is sometimes referred to as course-embedded testing. (When using course-embedded testing, it is important to notify students how the practice affects the assignment of grades.)

Q) How is using a test for assessment purposes different from using a test in the classroom?

A) Generally, instructors develop their own classroom tests, making all decisions about when and how to construct, administer, score, and report results of tests. Construction is often done without formality or documentation. With assessment, planning, implementing, and using results become a group effort – a shared set of decisions and responsibilities. Consensus is emphasized. Some additional planning time, communication, and record keeping will be needed. The most frequent use of tests by instructors is to assign grades related to individual student learning. When used for program assessment, test performance is generally used along with other information to describe group achievement and is independent of grading.

Q) How can we prepare students for tests as assessment?

A) Because tests for assessment are different than tests for grading, it is recommended that you provide a brief orientation which covers the following topics:

Q) What is a standardized test?

A) A standardized test is one in which the initial construction, as well as conditions for administration and scoring, have a uniform procedure so that the scores may be interpreted in a consistent manner from one administration to the next. These are designed by test development specialists, either internally or externally.

Q) What are the basic steps in developing a test?

A) Seven basic and sequential steps are recommended:

  1. Determine the outcomes to be measured.
  2. Develop a test blueprint.
  3. Write the test items.
  4. Review, critique, and edit the items.
  5. Pilot the items.
  6. Obtain reliability and validity data.
  7. Revise, reuse, and report.

Q) What are the differences between a locally-developed and an externally-developed test?

A) Differences between locally- and externally-developed tests are described below.

Characteristics

Externally-developed tests

Locally-developed tests

Development time None Varies; depends on local testing practices, test development resources and expertise
Relationship between test and program objectives Varies; test is tailored to broad-based models Close; tailored to local needs; adjusted as needed
Comparison groups May include national, regional; may include norms by gender, class level, college/major, institution type; infrequently updated Created and maintained locally; generally no external norms; can be modified as needed
Costs Usually high; materials & scoring costs may be reoccurring Usually low; can be managed with limited reoccurring costs
Results May be long delays; little choice in type of analyses Can be immediate; local needs/decisions drive analyses

 

Q) Are there any important decisions that precede test development?

A) Yes. Before a test is developed, a few decisions about use of results need to be clarified because they will have an influence on test construction. Three critical decisions are raised here.


Table 1: Features of Various Test Forms

Objective Essay Oral Performance
Conditions for Use Assess knowledge with maximum efficiency and reliability Assess thinking skills and/or mastery of a structure of knowledge Assess knowledge during instruction process Assess ability to use knowledge to create a new product
Stimulus Material Multiple-choice, True/false, Fill-in, Matching Writing task Open-ended prompts Event or directions that provide a frame for a performance or product
Student’s Response (Recognition) Select from options provided (Production) Organize, construct, & deliver (Production) Interpret, construct, & deliver (Production) Plan, construct, & deliver original response
Scoring Count correct answers Judge understanding Determine correctness of answer Apply attributes checklists or rating scales to describe performance or proficiency
Major Strengths Efficiency-can sample broad content range Gets at complex thinking Immediately links assessment and instruction Provides rich evidence of performance skills
Potential Weaknesses Poorly written items, over-emphasis on factual recall, poor test-taking skills, failure to obtain a representative sample of content Poorly written exercises, confounded with knowledge of content, poor scoring procedures Poor questions, students lack of willingness to respond, too few questions Poor exercises, too few samples of performance, vague criteria, poor rating procedures, poor test conditions
Influence of Format on Learning Can require complex thinking skills, failure to obtain a representative sample of content Encourages thinking and development of writing skills Stimulates participation in instruction, provides immediate feedback Emphasizes use of skill & knowledge – application in problem context
Keys to Successful Use of Format Clear test blueprint, item writing skill, ample construction time Carefully prepared writing prompts & model answers, ample scoring time Clear questions, systematic sampling procedures, ample time for response Clear performance criteria, clear rating scales, ample rating time

Adapted from: Stiggins, R. J. (1987). p. 35.

Test Blueprint

Tests that are soundly constructed are built to meet specifications just as a house is built according to a plan. The plan for a test is frequently called a test blueprint (also called a table of specifications or a test matrix). Two dimensions are laid out as the columns and rows of a two-way table. At the intersections between the rows and columns, the number of items needed to test the area (or percentage of importance) is recorded. A simple example is provided in Figure 1.

Figure 1. Example of a Test Blueprint for an Early American History Test

Discipline/Topic: Early American History

Content:

Process: Levels of Thought

Total

Historical Periods

Recall Facts

Comprehend Concepts

Apply Facts & Concepts

 

Exploration

10 (20%) 5 (10%) 1 (2%) 16 (32%)

Colonization

10 (20%) 5 (10%) 1 (2%) 16 (32%)

Revolution

12 (24%) 5 (10%) 1 (2%) 18 (36%)

Total

32 (64%) 15 (30%) 3 (6%) 50 (100%)

Things to notice in the blueprint shown in Figure 1:

Components of a Test Blueprint

Content areas, thinking processes, and importance specifications are given in test blueprints. Using the steps below and the blank test blueprint (Exercise 1) a practice test blueprint can be drafted.

Steps for Completing a Test Blueprint

  1. First, identify the discipline/topic area to be addressed by the test in the blank provided.
  2. Next, define the test's content areas. Along the left column, list at least 3 general content areas from your discipline.
  3. Now, define the intellectual processes to be assessed. Along the top row, list at least three intellectual processes.
  4. Finally, define the importance of the test content areas and processes to the overall test. Enter the number and/or percentage of items in the blanks where content areas and processes overlap. The values should reflect the importance of content/process relationships as conveyed in the curriculum and instruction.

Suggestions for Developing Test Blueprints

Functions of a Test Blueprint – A Summary:

Potential Resources for Locally Constructed Tests

Before you begin writing items for a test, use the following checklists to consider possible resources.

Internal:

External:

Suggestions for Reviewing an Externally-Constructed Test

Item Writing Suggestions

This section contains general item writing tips as well as specific tips for various item types.

Suggestions for All Item Types

Item Stems (definition: the part of a test item that poses the question or sets up the problem situation; the stimulus)

Item Responses

Suggestions/Checks for Multiple-Choice Items

Suggestions/Checks for Matching Items

Suggestions/Checks for True-False (Alternative Response) Items

Suggestions/Checks for Writing Completion or Short Answer Items

Suggestions/Checks for Writing Essay Items

Analyzing the Quality of Test Items

Analyzing the quality of test items can be accomplished in phases. Some general suggestions follow:

Calculating Item Difficulty and Discrimination

Before test scores are interpreted or used for decisions, it is essential to review how well the items functioned. Calculating item difficulty and discrimination values is recommended practice.

Difficulty Index (P value):

An item’s difficulty index is expressed as the proportion of students who responded correctly (successfully) to an item. If scores from all students in a group are included the difficulty index is simply the total percent correct. When there is a sufficient number of scores available (i.e., 100 or more) difficulty indexes are calculated using scores from the top and bottom 27 percent of the group.

The value is interpreted in an inverse way, that is, a high value is interpreted as less difficult. These values can range from 0 to 1.00 and are usually expressed as a percentage. Item difficulty is calculated in the following way:

P = Successes in the HSG + Successes in the LSG

N in HSG + LSG

(HSG=high scoring group

LSG=low scoring group

N=number)

Item difficulty is generally interpreted in the following way:

P-Value

Percent Range

Interpretation

> or = .75

75-100

Easy

< or = .25

0-25

Hard

between .25 & .75

26-74

Average

Example:

Item: Who is the current host of the Tonight Show?

  1. Johnny Carson
  2. Arsenio Hall
  3. David Letterman
  4. Jay Leno
 

Response Options

 

Groups

A

B

C

D*

Total

High Scorers

0 1 1 8

10

Low Scorers

1 1 5 3

10

Total

1 2 6 11

20

 

P = (8+3/10+10) = (11/20) = .55

Interpretation: This item is average in difficulty. Slightly more than half of the students got the item correct.

Discrimination Index (D value):

An item’s discrimination index is expressed as the difference among high and low scorers. The value is interpreted in terms of both direction (positive or negative) and strength (non-discriminating to strongly-discriminating). These values can range from -1.00 to +1.00.  Item discrimination is calculated in the following way:

D = Successes in the HSG - Successes in the LSG

N in HSG N in LSG

(HSG=high scoring group

LSG=low scoring group

N=number)

Item discrimination is generally interpreted in the following way:

D-Value

Direction

Strength

> +.40

positive

strong

+.20 to +.40

positive

moderate

-.20 to +.20

none

--

< -.20

negative

moderate to strong

Refer to Previous Example

P = (8/10 - 3/10) = (5/10) = .50

Interpretation: This item is positive and strongly discriminating. A larger number of high- rather than low-scoring students correctly answered the item

Other Features to Consider

The pattern of incorrect responses is another feature of item analysis. Ideally, each incorrect response should be selected by a higher percentage of low- than high-scoring students. In the previous example, responses A and B are selected by few students; these options do not provide much information about student knowledge and should be revised.

The relationships of difficulty and discrimination patterns to intended test purpose is a critical aspect of item analysis. The table below outlines how the majority of items should function for norm- and criterion-referenced tests.

Table 3: Relationships Between Item Analysis Indices and Test Purposes

Test Purpose

  Norm-referenced Criterion-referenced
Item Difficulty depends on the number of response choices; generally around .50 depends on level of mastery required, generally < or = .75
Item discrimination should > or = .20; should be answered correctly by students with high total scores and incorrectly by students with low total scores should be non-discriminating between .20 and -.20

Analyzing the Quality of a Test

Analyzing the quality of a test involves an examination of validity and reliability characteristics of the test. Some basic definitions and critical concerns are reviewed below.

Validity:

Table 4: Two Important Types of Validity

 

Type

Definition

Validity Threatened by

Prevention & Checks

Content

extent to which a test adequately samples program objectives

poor relationship between test items and program objectives

use & review test blueprint

Construct

extent to which a test measures the amount learned and not some other extraneous variable

for objective tests: poorly constructed items that measure test-taking skill rather than mastery of material

for other test formats: problems with raters and scoring procedures

use good item writing practices

development of good rating criteria; train for & check on raters’ consistency

check correlations with other related & unrelated information (for example, SATs & GPAs)

Reliability:

 

Contents Previous Next

 

For Further Reading

Berk, Robert. (Ed.). Performance Assessment: Methods and Applications. Baltimore, MD: The Johns Hopkins University Press, 1986.

Ebel, Robert. and Frisbie, D. A. Essentials of Educational Measurement (4th ed.).  Englewood Cliffs, NJ: Prentice-Hall, 1986.

Gronlund, Norman. Measurement and Evaluation in Teaching (4th ed.). New York: MacMillan, 1991.

Gronlund, Norman. Preparing Criterion-Referenced Tests for Classroom Instruction.   New York: MacMillan, 1973.

Hill, John. Measurement and Evaluation in the Classroom (2nd. ed.). Columbus,OH: Charles E. Merrill, 1981.

Kubiszyn, Tom and Borich, G. Educational Testing and Measurement: A Guide for Writing and Evaluating Test Items. Minneapolis, MN: Burgess Publishing Co., 1984.

McKeachie, Wilbert. Teaching Tips: A Guidebook for the Beginning College Teacher (7th ed.). Lexington, MA: D.C. Heath & Co., 1978.

Mehrens, William. and Lehmann, I. J. Measurement and Evaluation in Education and Psychology. (2nd ed.). New York: Holt, Rinehart, & Winston, 1978.

Office of Examination, Scanning, & Evaluation Services. "Interpretive Recommendations for Item and Test Analysis Computer Output", 8/13/85.

Osterlind, Steven. Constructing Test Items. Boston: Kluwer Academic Press, 1989.

Plake, Barbara. (Ed.) Social and Technical Issues in Testing: Implications for Test Construction and Usage. Hillsdale, NJ: Erlbaum, 1984.

Priestley Michael. Performance Assessment in Education and Training: Alternative Techniques. Englewood Cliffs, NJ: Educational Technology Pub., 1982.

Popham, W. J. Criterion-Referenced Measurement. Englewood Cliffs, NJ: Prentice-Hall.

Thorndike, Robert. and Hagen, E. Measurement and Evaluation in Psychology and Education. (4th ed.). New York: John Wiley, 1978.

Tyler, Ralph. and Wolf, R. M. Crucial Issues in Testing. Berkeley, CA: McCutchan, 1974.

Wergin, Jan. "Basic Issues and Principles in Classroom Assessment." In J.H. McMillan (ed.), Assessing Students' Learning. San Francisco: Jossey-Bass [New Directions for Teaching and Learning Series (no. 34)], 1988.

Contents Previous Next
[Home] [Ask] [Guestbook] [About] [Map] [Search] [Help]
Author | Last Modification: November 22, 1999 | Technical comments to the Webmaster
Ball State University practices equal opportunity in education and employment and is strongly and actively committed to diversity within its community.
Links contained in this file to information provided by other organizations are presented as a service and neither constitute nor imply endorsement or warranty.