A r c h i v e d  I n f o r m a t i o n

Assessment of Student Performance

Assessment of Student Performance April 1997

CHAPTER 4

Part 4

Technical Features

A third major characteristic of performance assessments and performance assessment systems is their technical features. Experts in the field of testing are actively debating the different technical features performance assessments must possess to fulfill their purposes. The technical features enumerated for performance assessments include not only the traditional technical criteria of content validity and reliability (interrater, internal, and test-retest), but also consequential validity, equity, and generalizability (from scores on individual performance tasks to student achievement and capabilities) (Linn, 1993). Furthermore, the notion of content validity criteria has been expanded to include content quality and meaningfulness of performance assessment systems (Herman, August, 1992; Khattri & Sweet, in press). We briefly discuss these criteria as they apply to the performance assessments systems in our sample.

For the performance assessment systems included in this study, the answer to the question of how valid and reliable the assessments are in assessing student learning — and how well they are accomplishing their intended purposes — is largely dependent upon both the status (i.e., developmental, pilot, implementation) of the performance assessment system and the purposes for which it is implemented. Procedures to determine and ensure the technical robustness of the assessment systems have been established in some cases and not in others, depending upon the stage of implementation and the purposes of the performance assessment system. However it is beyond the scope of this study to evaluate the validity and reliability of the assessment systems in the sample for the following two reasons:

Therefore, we limit our discussion below to: (1) the issues pertaining to establishing the technical robustness of performance assessment systems; and (2) the types of procedures states, districts, and schools in our sample have instituted to establish and monitor the technical quality of their systems. Exhibit 4-12 summarizes the formal evaluation procedures sites have established for use with their performance assessment systems. In addition, the table provides information about the development of assessment tasks and scoring methods and about who is involved in scoring the assessments.

Validity

Messick (1994) suggests that the major validity question that must be answered for each performance assessment system pertains to the purposes of the assessment and to the substantive domains of interest that the assessment purports to capture. He further suggests that major threats to the validity criteria are construct underrepresentation and construct-irrelevant representation in any given assessment or assessment system. Therefore, he argues, a unified view of assessment validity is necessary, since most of the validity criteria are linked and, therefore, are difficult to disentangle. In agreement with this view, we discuss the validation procedures established for the sampled performance assessments under distinct aspects of the validity criteria — with the understanding that the discussed criteria are not stand-alone aspects of validity. The criteria are linked with each other in predictable ways.

Content validity, assessment meaningfulness, consequential validity, generalizability, and fairness all can be subsumed under traditional criteria of validity. These validity criteria are discussed below.

Content Validity

The sites in our sample use a variety of procedures for ensuring the content validity of their performance assessment systems. The sites have attempted to ensure the content validity and quality of their performance assessments through several methods. Such methods include, but are not limited to:

In arriving at a consensus regarding which domains of competencies and knowledge are worth teaching and assessing, states, districts, and schools in our sample have utilized various sources of information and expert help. At the state level, for example, Arizona, Maryland, Vermont, and Oregon have based their mathematics assessment tasks and scoring rubrics on the National Council of Teachers of Mathematics (NCTM) standards for curriculum and evaluation. (New York State's extant mathematics curricular frameworks also are based upon the NCTM standards.) Kentucky, Maryland, and Oregon have utilized the American Association for the Advancement of Science guidelines for formulating their curricular frameworks and associated assessment tasks. States also have involved expert teacher committees in reviewing the assessments. In addition, sites at all three levels have sought help from experts in education and testing to formulate their assessment systems.

Exhibit 4-13 summarizes the outside sources of information the sites have tapped to guide the development of their performance assessment systems. Information and expert help used to conceptualize and to develop the assessment tasks and scoring rubrics range from local-level teacher involvement to contracts with an external agency. At the state level, testing companies have supplied the expertise to conceptualize, develop, and pilot test the performance assessment systems. At the district level, Prince William County's assessments are developed, pilot-tested and reviewed by a testing company. South Brunswick involves the Educational Testing Service only in evaluating the assessment system, and Harrison School District 2's assessments are locally-developed and scored. At the school level, professional groups in education, such as the Coalition of Essential Schools and the New Standards Project, provide information to schools in developing and using their school-based assessment systems.

The national organization, the New Standards Project, is unique in many respects in the developmental work it has undertaken to ensure the technical robustness of its assessment system. It is developing content and performance standards for various subject areas and infusing these standards into its assessment framework. The Coalition of Essential Schools provides the vision and examples of the assessments schools could use but expects the member schools to develop their own systems based on what they consider to be educationally important for their students. Pacesetter has relied upon the NCTM and the Mathematical Association of America standards to develop its mathematics curriculum and assessments.

Arizona, Kentucky, Maryland, and Vermont have established systematic procedures to determine the content validity of their assessments9 . Oregon and New York are still in the very early phases of development, and, therefore, have not yet established rigorous validation procedures. Harrison School District 2 and Prince William County also have instituted pilot-testing and review processes to determine the content validity of their assessment systems, while South Brunswick has instituted an annual review of the scoring rubrics and procedures established for the Sixth Grade Research Performance Assessment. However, it is not possible for us to comment upon the extent to which these procedures are effective in boosting construct-relevant variance and minimizing construct-irrelevant variance.

For the assessment systems developed at the school level, validation procedures are not necessarily undertaken in a systematic fashion. Most validation procedures at the school level consist of reviews of face validity of the assessment tasks and scoring rubrics and methods. In the case of Ni?os Bonitos Elementary, such reviews have been systematized through regular staff meetings and the establishment of clear, agreed-upon scoring criteria. In other cases, teachers use "what works" (i.e., a trial and error method), taking that criteria as evidence of the assessments' content validity. In any case, school-level assessments are likely to be closely tied to the curriculum teachers use in their classrooms, since they are integrated with instruction on a more regular basis.

Meaningfulness

The meaningfulness criterion of assessments extends beyond the content validity of the assessment. Content validity studies are insufficient for determining the meaningfulness of tasks — it is not enough to observe that an assessment item taps into a particular subject area. The meaningfulness criterion refers to the properties an assessment task must possess in order to motivate students to truly engage in completing the task.

One of the principles of assessment reform is that the assessment tasks that are contextualized in applied or real-world problems will be more meaningful to the student. Thus, several performance assessment systems in our sample contain assessments that are contextualized mathematical and science problems.

Whether or not such contextualization works to enhance the meaningfulness of assessment tasks and, therefore, student learning is at present poorly understood. To a large extent, the educational context must be considered to understand a student's approach to assessment items (Burstein, 1989). A student who has not been exposed to contextualized problem-solving in his or her classroom may not find contextualized problems in an assessment situation to be relevant or "meaningful." Furthermore, according to Messick (1994), individual students may be motivated by different types of tasks. "There are few one-edged swords in the measurement enterprise, and contextualization is unlikely to be one of them. Indeed, contextual features that engage and motivate one student and facilitate his or her effective task performance may alienate and confuse another student and bias or distort task performance" (Messick, 1994, p.19). The issue, then, pertains to what types of tasks can be considered to be "meaningful" in a given school environment for a given child. This issue is currently being debated within schools and, indeed, between parents and teachers.

The issue of meaningfulness has not been studied systematically in any of the systems in our sample. Meaningfulness of assessment tasks has been presumed, but not directly evaluated. Thus, Messick's remark (1994) that the terms authentic or direct assessments ". . . have all the earmarks of a validity claim but with little or no evidential grounding" (p.14) is applicable to our sample. At least one study (of the California Learning Assessment System mathematics performance assessments) shows that students find open-ended assessment tasks more interesting and challenging than multiple-choice tasks, but they do not necessarily like such challenges (Herman et al., 1994).

Messick (1994) also incorporates the idea of "transparency" in meaningfulness. In other words, are the assessment criteria clear to the student? Does the student understand what type and quality of performance are expected of him or her to earn the best scores? Some assessment systems have formally incorporated the notion of "transparency" of assessment purposes in their assessment systems. In the case of Harrison School District 2, for example, student rubrics have been devised specifically to help the student complete the assessment task and to make the scoring criteria clear to the student. The other state- and district-level assessment systems have not been specifically designed to make the assessment criteria transparent to the student; no formal scoring rubrics have been devised for the students. Nonetheless, teachers in Kentucky, New York, Oregon, Vermont, and South Brunswick do share the scoring rubrics with their students in order to engage students in the assessment process and to guide student work. At our school-level sites, teachers more often than not devise rubrics to share with their students.

Consequential Validity

Consequential validity pertains to whether or not an intervention such as an assessment system achieves its intended purposes. The consequential validity of the sample performance assessment systems is not clear. Most state-, district-, and school-initiated performance assessment systems purport to influence and improve pedagogical practices, and that is the only aspect of consequential validity some sites have attempted to establish.

Studies of Vermont's (Koretz et al., 1993) and Kentucky's (The Kentucky Institute for Education Research, 1995) systems show that teachers' instructional practices in those states have been influenced by the assessment systems. Evaluation of data gathered from teachers in those states indicate that teachers are asking their students to write more and to do more collaborative work. Other state-, district-, and national-level systems have not yet systematically evaluated this consequential validity criterion.

At the school level, assessment tasks are used for their pedagogical value. That is, they are used primarily as a teaching tool and as a tool for assessing student performance. In this respect, their pedagogical validity is supposedly quite high. For example, the assessments used at Cooper, Ann Chester, Ni?os Bonitos, and Park Elementary are embedded within daily classroom practices. Teachers design and use assessments to monitor student progress and to modify their curriculum and instruction.

Several issues relating to pedagogical validity are important to note. (Pedagogical validity of assessment systems pertains to whether or not assessment systems provide useful information to the teacher for instructional purposes.) Multiple systematic factors affect student performance (e.g., ability, prior exposure to assessment topic, forms of instructional exposure) (Burstein, 1989). Prior to making remediation decisions concerning individual students, it is important to understand the contribution each of these factors makes to the obtained score. None of the sites in our sample has attempted to research and evaluate the relationships between these factors and assessment results. Nor have the systems been designed to take into consideration opportunity to learn variables identified by Burstein. The focus is still a traditional one — an accurate and efficient diagnosis of the individual's level of functioning on given achievement and ability constructs, not on the determinants of individual differences.

Generalizability

Generalizability refers to the inferences one can draw from task-specific performance to the universe of tasks that are associated with the knowledge or skill domain of interest. That is, to what degree is a student's performance on one or a few assessments representative of his or her performance on other, similar assessments, or more importantly, on similar real world tasks? In our sample, only Maryland has systematically evaluated the generalizability of some MSPAP tasks. Thus, based upon assessment scores, appropriate inferences regarding student achievement and school and district performance are quite limited.

Fairness

In our sample, to date, Maryland is the only state that has systematically evaluated the fairness of each task in its performance assessment system. Tasks that function differently for different student groups are flagged to inform the development of subsequent tasks. District-level and school-level assessments in our sample have not systematically generated and evaluated such data. In part, they have not done so because of the limited resources available at these levels to conduct such analyses.

Reliability

Interrater reliability procedures are well established for the state- and district-level systems, but not for the school-level systems — where they are not deemed to be very important. The outstanding issue is still that of inter-task reliability.

At this phase of assessment development, because of the legacy of the multiple-choice tests, the emphasis on obtaining reliable scoring is quite strong. Among states, the Arizona, Kentucky, Maryland, Oregon, and Vermont performance assessment systems require some form of calibration or moderation activities to ensure reliable scoring of the assessments. Interrater reliability estimates for Vermont portfolios increased between 1991-92 and 1993-94. For Kentucky, interrater agreement increased between 1992-93 and 1993-94.

Among districts, Prince William County employs a testing company to score its assessments, and South Brunswick uses a moderation procedure with its scorers. At the school-level, scoring reliability is not generally determined through any systematic procedure (or is expressly ignored in favor of individualizing standards).

The issues related to inter-task reliability have not been fully addressed. Studies indicate that performance on one task is often weakly related to performance on another, seemingly related, task (Linn, 1993). Experience with licensure examinations in law and medicine show that inter-task reliability can be increased only by increasing considerably the number of tasks administered (Linn, 1993). However, among sites included in our study, no inter-task reliability studies were available.

Summary

The validity and reliability aspects discussed above are interrelated. If scoring is inaccurate and the content of the assessments does not adequately and accurately assess what is intended to be assessed, then the ability to draw meaningful inferences from the data these assessments generate is jeopardized. Furthermore, given the emphasis on problem-solving and applications in assessment tasks, construct irrelevant variance is especially important to identify in order to draw valid inferences about student performance. Changes from year to year (not necessarily in the overall format of performance assessment systems, but in the actual tasks that form the assessment instrument) preclude evaluating whether or not these performance assessment systems are reliable and valid. What is important to note, at this stage, are the types of procedures that have been instituted to ensure the reliability and validity of the assessments that comprise the performance assessment system. Performance assessment systems' validity and reliability will be better measured in the future as more data become available.

Conclusion

The performance assessment systems sampled for this study show that the term "performance assessment" is used to describe a very wide range of student testing instruments and systems. The characteristics of performance assessments show variations in terms of their purposes, formats, and the procedures used to ensure their technical robustness.

Performance assessments are being used for a variety of purposes, ranging from monitoring student progress to holding schools accountable for student outcomes, but the implicit purpose of each of the performance assessment systems sampled for this study is to leverage pedagogical changes at the local level. The stated purposes of performance assessment systems differ, however, according to their level of initiation: school-initiated performance assessments are intended primarily for pedagogical purposes, while districts and states also tend to include accountability functions in their stated purposes. The format of the performance assessments in practice also varies greatly, depending upon how the assessment tasks and scoring methods are specified. In addition, performance assessment systems show a great variety in how they are to be implemented and used by teachers at the school level. Some are tightly prescribed, leaving little room for teacher adaptation, while others are very loosely prescribed, giving teachers much leeway in how to adapt the system to their classrooms. Performance assessment systems can also be characterized by the scope of the pedagogical net they cast. Some cast a wide pedagogical net for instructional purposes, enabling teachers and students to be involved with the system on a regular basis and in a variety of assessed areas. Others cast narrower nets, not requiring extended student and teacher involvement or a wide array of assessment tasks or scoring procedures.

The types of technical characteristics performance assessment systems must possess is still under debate. Nonetheless, the states and districts included in this study have instituted procedures to ensure their performance assessment systems' reliability and validity. Although in some cases the results with regard to scoring reliability have been encouraging, evidence regarding the systems' validity is rather slim.

In sum, the samples from our study suggest that performance assessments cannot be classified in one mammoth category; the only commonality among them is the fact that they are non-multiple-choice and are based on the assumption that they are pedagogically useful. Hence, from policy and research perspectives, a number of factors require consideration before an assessment system is implemented or evaluated. Such considerations include a clear statement of purposes, coordination between performance assessment systems and purposes, and the establishment of procedures to continually evaluate the technical robustness and meaningfulness of assessment systems. Understanding the issues involved in assessment design and implementation also should enable educators to understand the issues involved in interpreting assessment scores and drawing inferences about student performance.


9 Please see the full case studies for details regarding these procedures. The case studies appear in, Studies of Education Reform: Assessment of Student Performance — Volume II, Case Studies.


-###-


[Chapter 4: Cross-Case Analysis 1: Part 3 of 4]  [Contents]  [Chapter 5: Cross-Case Analysis 2: Part 1 of 2]