Assessment of Student Performance April 1997
Proponents of assessment reform frequently view performance assessments as a lever for educational reform. In Smith and O'Day's (1990) words: ". . .a major reform in the assessment system . . . is critical to education. Assessment instruments are not just passive components of the educational system; substantial experience indicates that, under the right conditions, they can influence as well as assess teaching". Proponents also assert that if performance assessment is effectively implemented at the school, district, or state level, it can change curriculum as well as teachers', students', and the community's attitudes toward education. An innovative and far-sighted leadership and a friendly political climate appear to play pivotal roles in the effective implementation of new forms of assessment. Below, we discuss the relationship between assessment reform and other critical components of education reform.
That assessments must be integrated with curriculum and instruction is one of the basic premises of assessment reform. What is new in current assessment endeavors is the focus on equally new curriculum and broadly defined valued student outcomes. The stress on cross-disciplinary knowledge, conceptually sophisticated thinking, good writing abilities, application of mathematical and scientific concepts, and social competencies has necessitated overhauling curricular and instructional frameworks as well. To an extent, overhauling writing and math curriculum and assessments has proven to be the easiest reform to undertake at all levels of educational organization. Performance-based writing assessments, especially, represent assessment reform at its most basic.
Reform of math assessments has been aided by the guidelines provided by the National Council of Teachers of Mathematics in the Curriculum and Evaluation Standards for School Mathematics (1989). Reforms in other areas will become easier to implement and more widespread as curriculum guidelines become available. The National Science Foundation and American Association for the Advancement of Science (AAAS), for example, have sponsored projects that address issues of science curriculum.
Work with support from the Federal government on content standards, although controversial (especially with regard to content standards in history), is likely to influence assessment reform. Content standards in science, economics, foreign language, and other areas will likely be released by the end of 1995. Such standards portend revisions of the current assessment systems used in these content areas.
In tying assessment reform to both curriculum and instruction, individual states have followed divergent courses of action. In Oregon, for example, assessment reform and curriculum reform were undertaken simultaneously. In Vermont, on the other hand, an assessment system was instituted before the new curriculum standards were articulated.
In the long run, the core educational changes are likely to be the result of a dialectic process between curriculum and assessment reforms.
Professional development is crucial to reform, as teachers are the deciding factor in the success of performance assessments in effecting desired student outcomes. In fact, the importance of professional development in assessment reform or any other reform effort cannot be overemphasized (Little, 1993). In order for performance assessments to be effective, especially assessments based upon the premises of current assessment reform, teachers' expectations of their students and of their own teaching methods must change. Teachers must be able to develop their students' ability to construct answers, think critically, and move beyond focusing only upon factual knowledge. Currently, for some summative purposes, many teachers are asked to prepare their students for the administration of norm-referenced multiple-choice tests bought from test publishers. They also are being asked to develop their students' abilities to perform well on performance-based assessments. The conflict between these two systems is probably reflected in teachers' pedagogical practices.
Performance assessment, thus, demands teachers' participation in assessment development, implementation, and scoring. Teachers must become knowledgeable about assessment design, scoring, and new pedagogical techniques. The benefits of teacher involvement in developing performance assessment is illustrated by the New Standards Project (NSP). The relative success of NSP in developing interesting assessment tasks and associated scoring rubrics can be attributed to its endeavors to build professional capacity at the local level. Teachers themselves develop assessment tasks and scoring rubrics and conduct pilot tests in their classrooms. Teachers then send their tasks to an NSP committee (and receive payment if their tasks are adopted for use by the NSP).
Teams of teachers from participating states and districts attend NSP assessment task and scoring rubric development conferences, as well as sessions in curriculum development and portfolio design. After a prescribed number of training sessions, these teachers are designated as Senior Leaders, and they, in turn, offer professional development in the same areas to other teachers in their districts and states. Vermont, Oregon, Kentucky, and Maryland, among others, also are paying increasing attention to issues of professional development, largely through train-the-trainer professional development models similar to that of the NSP.
Professional development activities are not cheap. All such activities are resource intensive when compared to those associated with traditional systems of testing. Therefore, commitment on the part of leadership to provide money, teacher release time, and materials is essential to successful implementation of performance assessments.
In addition to teacher and leadership support, community support is critical throughout the entire reform process, whether or not the assessment system is the chosen mode of change. Without a sense of ownership on the parts of teachers, administrators, students, parents, the community, and other stakeholders, the system-wide changes required to effectively implement performance assessments will not occur. Vermont, for example, engaged in a large-scale consensus process before beginning its statewide portfolio assessments. As a result, its initiative has largely been supported by most stakeholders. On the other hand, Littleton, Colorado, had to rescind its reforms due to community opposition. The community was not kept well informed, and the reforms were enacted too swiftly. In the end, community members felt that vague, nonacademic outcomes were replacing content, and that technically unsound assessments would be used to determine something as important as high school graduation (Bradley, 1994).
Because assessment reform can no longer be considered to be a passing fad, performance assessments must pass technical scrutiny if they are to become an accepted means of judging student performance. In fact, most major objections to performance assessments are based on a lack of faith in new methods and in continuing confidence in the technical quality of norm-referenced multiple-choice tests, which have about an 80-year theoretical, research, and development base. Nonetheless, some reformers argue that the shift from multiple-choice to performance-based assessment systems represents a shift in the educational paradigm and, as such, must be evaluated within the framework of the new, rather than any existing, paradigm. The ways in which performance assessments must be technically robust is a topic only touched upon in this chapter.
While educators traditionally have viewed assessment as a separate, completely external event that should not influence teaching, modern day reformers view performance assessments as an integral part of teaching and learning, modeling desirable instructional techniques. This contrast between the two, the traditional and the new, illuminates the two different conceptions of the educational process held by the two camps (Mitchell, May 1992). Thus far, this difference has not been articulated clearly in the literature, but it is likely to confuse communication unless it is identified. Each considers its favored assessment instruments and processes to be more valuable, valid, and informative than those of the other camp. It is the distinction between the concept of standardized measurement and the concept of individualized learning that underlies the present dilemma with respect to the technical qualities of performance assessments.
In part, this problematic situation may have arisen because the psychometric community continued to operate within an old model of test theory, even as a change in cognitive psychology was permeating educational thinking. Mislevy (1989) wrote:
Mislevy's point is that the insights of cognitive psychology have altered the conceptions of competence and learning, and that these insights make a new test theory possible.
The new model of cognition and the integration of assessment into teaching and learning processes have provoked some discussion of the technical problems presented by performance assessments. It has been difficult for psychometricians and researchers focused on terminal testing to switch to a model wherein assessment itself is viewed as an aid to learning and may even take place simultaneously with learning. Nonetheless, as others point out, validity, reliability, and generalizability have been the perennial issues with all measurement instruments and remain so with performance assessments. The three major issues are discussed briefly below.
A central question regarding performance assessments concerns what we term pedagogical validity. If the primary goals of performance-based assessments are to be more closely connected to the curriculum and to provide information to the teacher for instructional purposes, then how satisfactorily they are able to fulfill these goals is a central validity concern. A one-to-one mapping of assessment tasks to curricular areas is perhaps the most important piece in the assessment validity puzzle.
Wiley and Haertel (1992) assert that the connection between the goals of measurement, embodied in curriculum frameworks, and tasks meant to assess progress toward those goals, must be quite close: If no valid system exists for mapping tasks into the frameworks, the curricular coverage of the assessment cannot be evaluated. . . . The link between task selection, task analysis, task scoring, and curricular goals has to be well understood and relatively tight in order for the system to work (p. 15). They stress that analyses must be performed to ensure the match between curricular goals and assessment tasks.
Wiley and Haertel also sketch the types of analyses that must be carried out and conclude by underlining the importance of achieving what they term evidential validity. They contend that the basic reason for rejecting machine-scorable multiple-choice tests is their lack of validity, given that they elicit memorized facts and algorithms, while society demands increasingly complex thinking skills. Their concept of evidential validity can be extended to include the idea of assessments as diagnostic tools for students' educational needs.
Systematic evidence that performance assessments provide the means for obtaining diagnostic information to improve instruction, and the process of teaching and learning in general, is just beginning to accrue. Studies such as Whose work is it? A question for the validity of large scale portfolio assessment (Gearhart, Herman, Baker & Whittaker, July 1993) indicate that, in fact, the use of portfolios has pedagogical value in terms of the level of instructional support teachers provide to their students. Similarly, other, smaller scale studies indicate essentially salutary effects of performance assessments on instructional practices (e.g., Borko, Flory & Kumbo, 1993; Falk & Darling-Hammond, 1993; Smith, et al., 1994).
Consequential validity is another issue within the larger performance assessment validity issue. Linn and Dunbar (1991) include the concept underexpanded validity, which they see as a major adjustment needed in technical theory to accommodate performance assessments:
The fairness issue is of particular concern if assessments are used for student certification and for sorting. There must be some assurance that minority populations (who traditionally have been screened out of institutions or programs that would provide them with social and economic opportunities) not be inadvertently negatively affected by assessment reform. CRESST is conducting research on the responses of minority students to performance assessments in San Diego City Schools. We suspect that results will not generically apply to any and all performance assessments; much will depend on how assessments are constructed, the types of items they comprise, and the type of curriculum they support.
The approaches to validity discussed above are complementary. In fact, they have merged with respect to performance assessments because the theoretical and ideological bases for these assessments call for a concurrently authentic and fair psychometric system.
Generalizability, including reliability, has surfaced as a major issue which must be resolved if performance assessments are to be used for individual student assessments. In addition to redefining validity, Linn and Dunbar (1991) elaborate on the concept of reliability; they argue for subsuming the traditional criterion of reliability under the transfer and generalizability criterion. Whether performance assessments sample sufficiently from the knowledge domain in question to enable fair and accurate judgments about students' achievement in that domain is a question central to assessment reform. After all, if one of the promises of assessment reform is to enable an understanding of students' educational needs, exactly what an assessment product indicates about a student's achievement status must be reliably understood. In this context, then, multiple examples of student work on multiple performance tasks may be the answer to the problem of generalizability. Inter-task reliability, however, has been difficult to attain; studies indicate that performance on one open-ended task is often only weakly related to performance on a related task (Linn, 1993).
Interrater reliability is another important issue facing assessment reform. The complexity of the assessment tasks, the myriad answers they can elicit, and the number of people used to score them with (possibly) different frames of reference yield a very high potential for low interrater reliability. Although interrater reliability is attainable through standardization of task administration, the establishment of explicit scoring criteria, and scorer training, such procedures impose certain practical constraints on the use of these assessments.
The questions with respect to technical issues, then, are:
Indeed, for high stakes decisions, the assessments must be technically impeccable and pass the test of fairness and equity. This conclusion implies that there must be serious investment in research and development to ensure assessments of high quality. On the other hand, to ensure the viability of performance assessments as pedagogical tools, investment in teachers is essential, with less attention to interrater reliability and standardization.
The crux of these issues and questions is whether one type of assessment system can serve many purposes simultaneously, or whether multiplicity of purposes might subvert the goals of the performance assessment system.
The performance assessment movement has developed so rapidly that knowledge in some other significant areas is simply lagging behind. Costs of developing and implementing performance assessments and the use of technology in conjunction with these assessments represent two of those major areas.
Issues related to financing the development, implementation, and evaluation of performance assessments are not well understood. The Office of Technology Assessment's discussion of costs in Testing in American Schools: Asking the Right Questions (1993) is inconclusive. A study by Pechman (1992) suggests that the costs associated with ". . . every phase of alternative assessment are alarming," (p. 24) but that these figures may be misleading as the benefits of teacher involvement in the implementation and scoring procedures are generally not figured into the calculations.
Getting a handle on assessment costs is difficult for two reasons: (1) schools, districts, and state education departments do not necessarily record costs for testing and assessment as separate items, but as portions of categories such as personnel, material, and vendor costs, so that disentangling the costs of assessment is extremely difficult; and (2) the costs of machine-scorable tests and performance assessments are not comparable if professional development is taken into consideration. For example, how can the cost of developing a portfolio, which takes place throughout a year of the teacher's and student's time, be compared to the costs of a machine-scorable test, which also takes part of their time but for different purposes. The results of each process are essentially noncomparable, especially if portfolio grading is done within the context of professional development.
Charting the Course Toward Instructionally Sound Assessment, a report produced by the California Assessment Collaborative (1993) details useful budget and personnel categories for accounting for the costs of developing and implementing performance assessments. The document provides no overall dollar figures but concludes that:
Hardy's (1993) and Monk's (1993) articles contain edifying discussions on how to conceptualize costs and benefits associated with developing and implementing performance assessments systems.
The potential for applying new information and communications technology to performance assessment remains unrealized at all levels of education. At the local level, the problem presents itself as the school's general lack of technology experience and equipment, coupled with a lack of knowledge about how to develop and implement performance assessments.
Technology, however, offers numerous possibilities for integrating assessment into the daily life of the classroom. For example, technology applications (e.g., word processing, databases) can offer teachers a view into students' problem-solving and thinking processes (Means, et al., 1993). Electronic portfolios on a disk for each child can provide a means for on-going assessment. This vision is appealing, but remains a dream for most school districts. (In some cases, where technology is in use, "electronic portfolios" consist of work that has been scanned into the computer.)
Some organizations, however, have been instrumental in helping schools integrate technology into daily teaching and learning activities. For example, the Coalition of Essential Schools (CES) and IBM collaborated to develop an electronic exhibitions resource center. This center is intended as a resource for the CES member schools to exchange ideas about exhibitions (student demonstrations of their work) and about the CES curriculum. Such partnerships between businesses and schools are likely to be helpful in bringing technological innovations to schools and, thus, catapulting them into the 21st century.
In recent years, advocates of performance assessment have linked reformed assessment strategies with needed reforms in curriculum and instruction. Because assessment reform calls for a deviation from traditional assessment strategies in more ways than one, it presents several challenges to the established organizational structure of education.
First, the challenge is to simultaneously engineer other reforms that support and enhance the use of performance assessments. Second, the challenge is to develop assessment systems that are technically sound and pedagogically useful. Third, the challenge is to involve all stakeholders so that their informed consent provides the momentum for assessment (and associated) reforms. Judgments regarding the efficacy of performance assessment in fulfilling its promises must be based on data from the many educational systems now in the process of reform. Only when these reforms result in enhanced student outcomes will the challenge of assessment reform be met.
In the following chapters, we discuss the findings from our study of 16 schools and school systems involved in reforming student assessment systems. First, we present our study objectives and the case summaries of each of the schools we visited. Next, we discuss the characteristics of performance assessments in practice, the facilitators and barriers in assessment reform, and the impact of performance assessments on teaching and learning. We conclude by summarizing our findings with regard to the status of assessment reform, and outline the research and policy implications emanating from those findings.
-###-
[Chapter1: Introduction Part 1 of 2]
[Chapter 2: Study Objectives and Design]