Teacher and school evaluation by Heurística Educativa

Teacher Evaluation, Teacher Effectiveness and School Effectiveness: Perspectives from the USA

Chad D. Ellett CDE Research Associates, Inc

and

Charles Teddlie Louisiana State University

Teacher Evaluation, Teacher Effectiveness and School Effectiveness: Perspectives from the USA

Abstract

This article provides historical overviews of the conceptual and research and development focus of teacher evaluation, teacher effectiveness and school effectiveness research in the USA. Pertinent literature is cited and arguments are made that these lines of inquiry have coexisted for nearly four without adequate integration. With the fourth stage of school effectiveness research in process, there is a recognition that within school context variables, particularly teacher effectiveness, have important effects on school improvement and school outcomes. Similarly, there is the recognition that findings from school effectiveness research have relevance for studies of teacher effectiveness and ongoing developments in teacher evaluation. Examples of: (a) new generation, learner-centered teacher evaluation systems in the USA that are informed by teacher and school effectiveness research; and (b) the fourth stage of school effectiveness research are described. It is proposed that these lines of research should be merged as completely as possible.

Teacher Evaluation, Teacher Effectiveness and School Effectiveness: Perspectives from the USA Chad D. Ellett CDE Research Associates, Inc. Charles Teddlie Louisiana State University Perhaps the most basic question that frames education policy making for K-12 schools in the USA is...What are the appropriate means and ends of education? This is an old but important question that essentially attempts to link proverbial independent and dependent variables in human logic. Debates about appropriate means of American education in the USA include innovations such as: (a) new curricula (e.g., the new math, role of computer technology); (b) structural/organizational mandates (e.g., extended school year, grade retention, reduced class size); (c) non-traditional choices for schooling (e.g., charter, accelerated, alternative schools); (d) teaching and learning programs derived from popular learning theories and instructional models (e.g., social constructivism, cooperative learning); (e) standards-based preparation programs for teachers and administrators (e.g., college/university programs grounded in National Association of Colleges of Teacher Education...NCATE...standards); and (f) new approaches to certification and licensing of teachers (e.g., the National Board of Professional Teaching Standards [NBPTS] and state models of performance-based assessments for licensing). All those concerned with education in the USA want to know what works best? Debates about appropriate ends for education in the USA focus on education outcomes such as: (a) student achievement levels on standardized tests (and the content and format of such tests); (b) the percentage of high school graduates; (c) the percentage of students admitted to

higher education institutions; (d) levels of students’ social adjustments and citizenship; (e) moral and ethical values; and (f) employability. All those concerned with education in the USA want a voice in what educational outcomes should be. These debates are not new in the USA and they appear to be rather cyclical, influenced by political structures and processes, and they also appear to be at the center of citizens’ concerns about the quality of American education, education policy making and educational productivity (Cuban, 1990). Thus, these debates seem rather endemic in the American culture and in the history of education in the USA. During the past three to four decades, the question about appropriate means and ends for education in the USA has been strongly reflected in concerns about (a) producing, selecting and assessing effective teachers and (b) understanding linkages between effective teaching, teacher evaluation, school effectiveness and ultimately effective schools. The current political climate of enhanced school accountability in the USA, grounded in the national education slogan of “No Child Left Behind,” is a case in point. This article addresses these concerns in four sections: (1) a review of teacher effectiveness research (TER) and of systems for teacher evaluation in the USA and a discussion of how the two (TER and teacher evaluation) have been linked; (2) a review of school effectiveness research (SER) in the USA, with an emphasis on the emergence of effective schools characteristics, which may be linked to teacher/school improvement and teacher evaluation; (3) a discussion of the emerging research that has linked the SER, TER and teacher evaluation literatures in the USA; and

(4) a discussion of how TER/SER research can influence teacher evaluation in the USA and lead to further teacher/school improvement. As noted in the overview article for this special issue (Teddlie, Stringfield, & Burdett), teacher evaluation is used for three major purposes internationally (i.e., accountability, promotion staff development), but it is rarely used explicitly for teacher or school improvement. We will argue in this article that both teacher evaluation and school improvement have been logically linked to the TER and SER literatures, but that these links have been â&#x20AC;&#x153;loosely coupledâ&#x20AC;? historically. We further argue that now is the time in the USA for these links to be tightened. We begin the ground work for our arguments by first reviewing historical conceptions of, and research on, teacher effectiveness and teacher evaluation in the USA. Following this review is an overview of SER, linkages between the two lines of inquiry (TER and SER), and some subsequent discussion of the potential influence of these two lines of inquiry on teacher evaluation in the USA. The primary focus of our discussions about teacher evaluation is on classroom-based observation, assessment and evaluation systems rather than on paper and pencil measures.

Teaching, Teacher Effectiveness and Teacher Evaluation in the USA: Some Historical Perspectives Evaluating teachers in the USA is certainly not a new activity. It is as old as the education system in the USA and it has been through many trends and cycles as roles of teachers have changed, as values and beliefs about effective teaching and teacher responsibilities have changed, as perceptions of how students best learn have changed, and as societal demographics and teaching contexts have changed. Cuban (1990) provides some interesting and useful insights

about how views of teacher-centered versus student-centered approaches to educational reform and improvement in the USA have changed during the past century and how these views have been associated with various educational reforms. He reminds us that the focus of teacher evaluation may depend upon the views we hold at any given point in time about effective pedagogy and student learning. Thus, teacher effectiveness, teacher evaluation and school effectiveness seem inextricably interrelated over time.

Teacher Evaluation in the USA from 1900-1950 At the turn of the 20th century in the USA, teacher evaluation was essentially defined from a moralistic and ethical perspective.1 Thus, good teachers were outstanding members of the community who were viewed as possessing high moral and ethical standards, who had basic reading skills (preferably at the high school level) and who were good role models for students. The vast majority were single women (school marms) with only minimal education (completing grade 9 could suffice). One requirement, and typically at the center of evaluation concerns, was good moral standing in the community. Thus, teachers were largely evaluated on their personal characteristics rather than evaluation procedures informed by a knowledge base about effective teaching and learning. In thinking about teacher evaluation at the turn of the century, we are reminded of a story about the grandfather of one of the authors (Ellettâ&#x20AC;&#x2122;s Grandpa Price) who began his teaching career in rural Kentucky in 1910. He was selected as the teacher by members of his community

Interestingly, this emphasis on morality in the evaluation of American teachers in the first part of the 20th century has been mirrored in the traditional evaluation system of the Peopleâ&#x20AC;&#x2122;s Republic of China, where morality was a subscale of the teacher rating scale accounting for 10% of the total score (Lee, Lam and Li, this issue; Ying & Fan, 2001).

because he was the most educated member of the community. He taught in a one room school house with about 20 students, and he liked to tell modern educators about how he was evaluated for K-12 state certification in Kentucky by the “traveling superintendent” who visited schools by horse and buggy. Grandpa Price had been teaching for about six months. He was not a graduate of a teacher education program, though he had taken some courses at the state Normal School. He also had no knowledge of state teacher certification requirements or procedures. The purpose of the superintendent’s visit was to conduct his initial evaluation of Grandpa Price as a teacher. The superintendent remained in the classroom the entire day, assisted with some teaching tasks, ate lunch under the trees with Grandpa Price and the students, informally chatted with Grandpa Price a good bit, and made inquiries and shared stories about various members of the community. After the students were dismissed for the day and as dusk approached, the superintendent hitched his horse to his buggy, climbed aboard, grabbed the reins and prepared to depart. Grandpa Price (like most teachers who are evaluated today) was a bit anxious about the superintendent’s lengthy visit and how he had performed that day. He desperately needed some feedback. He asked the superintendent if he had “enjoyed the day” and the superintendent replied that he had “very much enjoyed the visit, the lesson, the students and the lunch.” Grandpa Price then asked, “How do I get certified to teach in Kentucky?” The superintendent smiled, waved his hand through the air like a wand said “You’re certified young man...I just certified you!” Grandpa Price then asked why he had been certified and the superintendent said, “You were prepared for the lessons, you had different things for young and older students to do, you didn’t yell or have to spank anyone for being bad, you knew your subjects, the children seemed to get along quite well with you and with each other, you had lots

of energy, you didnâ&#x20AC;&#x2122;t waste any time telling stories or jokes and I like you!â&#x20AC;? No formal evaluation procedures or criteria were used by the superintendent to make the certification decision. However, the categories of personal observations he used sound very similar to some of the evaluation criteria comprising many teacher evaluation systems today. During the 1920s - 1940s, the study of teachers and teacher evaluation, influenced by emerging theories of personality and personality characteristics in psychology, primarily focused on personal characteristics of good teachers. There was also an increasing concern for identifying and better understanding factors contributing to the education and training of prospective teachers, and several national studies of teacher characteristics and teacher education programs were completed (e.g., Charters & Waples, 1929). In the early 1940s, several conceptual frameworks for evaluating teaching also began to appear in the literature (e.g., the Ohio Teaching Record). By the end of the 1940s, the knowledge base pertaining to teacher evaluation was beginning to appear in popular texts (e.g., Beecher, 1949).

TER and Teacher Evaluation in the 1950s-1980s: Building a TER Database and Professional Teacher Evaluation Systems With the popular, emerging philosophies of scientific management (e.g., Taylor, 1947) and behaviorism in psychology and education, educational researchers began to narrow their focus on linkages between teacher behavior and student outcomes as a paradigm for classroom research. Thus, during the 1950s and 1960s, there were increased efforts among educational researchers to identify effective teaching methods. A new era of classroom-based teaching methods research was spawned, largely grounded in the popular behaviorism movement pervading thinking in psychology and education. Researchers began to turn their attention to linkages between observable teaching practices (behaviors) and a variety of student outcomes. In the USA, the cold war and international political events (e.g., launching of the Sputnik satellite by the Soviet Union and the subsequent space race) also contributed to a strengthened focus on classroom-based research to identify effective teaching methods, particularly in math and science classrooms. This was also the era of federally funded models of competency-based teacher education (CBTE) in selected higher education institutions in the USA. These CBTE programs were charged with developing preservice curricula around core sets of behaviors and skills deemed necessary for effective teaching in classrooms in the USA. These decades also witnessed the increased use of paper and pencil tests as a means of state licensing of teachers (e.g., the National Teachers Exam). The academic and political press to implement research studies to identify effective teaching methods resulted in the development of a plethora of classroom-based observation checklist systems, the majority of which were grounded in the existing and pervasive philosophy of behaviorism in psychology and education (e.g., OSCAR, CASES, STARS, FLANDERS,

PORS). Subsequently, large collections of these measures began to appear in the educational research literature (e.g., Simon & Boyer, 1967), and important methodological discussions of issues surrounding classroom observations and evaluations of teaching began to appear in the literature as well (see for example, Medley & Mitzel, 1963). The yield of two decades of research on the effectiveness of various classroom teaching methods, practices and teaching behaviors resulted in discussions about how effective teaching was to be conceptualized (i.e., an art or a science). Leading researchers and theorists in the field of teacher education and teaching effectiveness began to debate the scientific basis of effective teaching and the implications of the extant research findings for the preparation of effective teachers (e.g., Gage, 1972). There was also an increasing emphasis evident in the extant TER and teacher evaluation literatures on the use of direct observations of teaching as a preferred methodology (Rosenshine & Furst, 1973) and on linking observations and evaluations of teaching to proxy measures for student learning and achievement such as student on-task behavior (e.g., Stallings, 1977). Direct instruction models such as those posited by Madeline Hunter also began to appear in the teacher effectiveness and teacher evaluation literatures. During the 1970s the number of classroom-based studies seeking to demonstrate linkages between various teaching practices and student outcomes continued to proliferate and summaries of the yield of research on teaching were produced (e.g., Flanders, 1970; Duncan & Biddle, 1974). The predominant paradigm for research on teaching became known as process-product research, and elements of teaching documented as important in the literature began to frame criteria appearing on many teacher evaluation systems (Medley, 1977).

11 st

Reform Movements and TER/Teacher Evaluation from the 1980s into the 21 Century The 1980s ushered in a renewed call for educational reforms in the USA. Such reforms touched almost every aspect of education including, for example: making significant changes in teacher education programs (using the yield of research on teaching); designing and implementing school-based management and decision making models; changing procedures to license teachers and administrators; developing newer, more innovative approaches to student assessment and testing; restructuring schools from a variety of perspectives; increasing teacher empowerment and parental involvement in schools; establishing various incentives-based programs for schools, teachers, administrators and students; piloting school voucher plans, and so on. Perhaps the most popular buzzwords threaded through educational reforms in the 1980s were evaluation and accountability, particularly as these applied to teachers. Having emerged from a decade of increased educational accountability through the use of criterion-referenced, state-mandated minimum competency testing programs for students (Berk, 1984; Jaeger & Tittle, 1980; Pipho, 1978), and given the historical harmonics of student-centered versus teachercentered models of reform (Cuban, 1990), it is no surprise that teacher evaluation became a center piece of educational accountability and reform in the USA in the 1980s. The knowledge base pertaining to teacher evaluation also grew considerably during the 1980s (e.g., DarlingHammond, Wise, & Pease, 1983; Ellett, 1985, 1987; Ellett & Capie, 1982, 1985; Iwanicki, 1986; Joint Committee on Standards for Educational Evaluation, 1988; McLaughlin, 1990; McLaughlin & Pfeifer, 1988; Medley, Coker, & Soar, 1984; Millman, 1981, &; Millman & Darling-Hammond, 1990; Scriven, 1988). This knowledge base has contributed to a broader understanding of the educational, social and political contexts in which teacher evaluation is

situated In the 1980s and on into the 1990s, renewed and more sophisticated efforts to evaluate teachers were viewed by many politicians and education policy makers as the bottom line in efforts to improve education in the USA. The logic of enhanced accountability through teacher evaluation seemed reasonable, since teachers had more direct contact with students than any other element in the educational system. Additionally, over three decades of research on teaching had established a knowledge base to design new teacher evaluation systems around sets of criteria that had been reasonably well linked to student outcomes (Brophy, 1986, 1988; Gage & Needels, 1989). One fundamental shift in teacher evaluation policy beginning in the early 1980s was a movement away from the importance of local, district policies to evaluate teachers as employees, to state-mandated, on-the-job assessments and evaluations of teaching for the purpose of licensure. Part of the impetus for this shift was mistrust of the content and job-related validity of paper and pencil tests, and lack of evidence of how such measures were linked to student outcomes. Additionally, district level evaluations of teaching were known, and subsequently have been shown, to be rather pro forma (e.g., Ellett & Garland, 1987; Loup, Garland, Ellett, & Rugutt, 1996). Thus, educational reformers in the 1980s and into the 1990s believed they had the advantage of utilizing a significant body of literature on teacher effectiveness to frame their conceptions of what should be at the forefront in thinking about educationally meaningful and useful teacher evaluation systems. The state of Georgia implemented the first systematic, statewide effort to evaluate on-thejob performance of teachers in 1980 through use of the Teacher Performance Assessment

Instruments (TPAI) (Capie, Anderson, Johnson, & Ellett, 1980). This statewide program targeted the initial licensing of beginning teachers. While a large number of classroom observation instruments had been developed during th 1960s and 1970s for undertaking research on teaching, as noted above, the TPAI represents the initial example of the first generation of state-mandated, classroom-based teacher evaluation systems targeting teacher licensure (Ellett, 1990). A variety of other states quickly followed suit (e.g., South Carolina, Florida, North Carolina, Virginia, Tennessee, Mississippi, Kentucky, Texas, Missouri, Kansas, Arizona, California, Connecticut). On-the-job, classroom-based teacher evaluation procedures subsequently were extended to other decision-making contexts such as career ladders, (e.g., Texas, Tennessee, Utah), merit pay (Florida) and the professional, renewable certification of teachers (Louisiana). Literally millions of dollars and thousands upon thousands of hours of educator input and effort were involved in the development and implementation of these large-scale, politically motivated, state-mandated programs targeting teacher accountability and school improvement. Interestingly, but not surprisingly given the harmonics of reform and change noted above (Cuban, 1990), only remnants of most of these programs survive today, and most have been politically overhauled, minimized/or disbanded. During the 1990s and into the 21st century in the USA, teacher evaluation for the purposes of accountability, professional development and school improvement continued to be at the forefront of school reform. Teacher evaluation and TER remain highly relevant to educational improvement in the USA because of: (a) the pervasiveness of concerns about teacher retention, recruitment, evaluation and retention in the profession; (b) the knowledge base that has continued to accumulate about teacher effectiveness and teacher evaluation; (c) a

renewed focus on school site professional development of teachers grounded in classroom-based assessments; (d) changing roles of school administrators as leaders of teaching and learning; and (e) the national push for teacher professionalization.

Toward New Generation Perspectives on Teacher Evaluation and Teacher Effectiveness Beginning in the late 1980s and early 1990s and continuing into this century, a variety of new conceptual and methodological developments in teacher evaluation, teacher effectiveness, school improvement and school effectiveness has emerged. Two significant developments in teacher evaluation have been (a) changing the focus of classroom-based evaluation systems from teaching to learning and (b) the work of the NBPTS to develop assessments for national certification of teachers. Each of these developments is briefly described in the sections that follow.

Developing a Learner-Centered Focus in Classroom-Based Evaluation Over a decade ago and more recently, Ellett (1990, 1997) made the argument that a fundamental flaw in classroom-based teacher evaluation processes was the predominant focus on teacher behavior and teacher performance. This focus is historically well documented in many of the statewide teacher evaluation systems developed during the 1980s and 1990s.2 The names of each of these comprehensive, statewide systems, and the attendant evaluation criteria and

Consider, for example, the names of some of these systems: The Teacher Performance Assessment Instruments (TPAI) (GA); Assessments of Performance in Teaching (APT) (SC); Florida (Teacher) Performance Measurement System (FPMS) (FLA); Virginia Teaching Practices Record (VPTR) (VA); Tennessee Career Ladder Teaching Evaluation System (TCLTES) (TN); Georgia Teacher Evaluation Process (GTEP) (GA); Louisiana Components of Effective Teaching (LCET) (LA); Texas Teacher Appraisal System (TTAS) (TX); and the Connecticut Teacher Competency Instrument (CTCI) (CT).

procedures, focus on the teacher and evaluating the teacherâ&#x20AC;&#x2122;s performance. Interestingly, there is little focus on the connection between teaching and learning. Moreover, almost without exception, local school districts in the USA continue to develop and implement teacher evaluation policies and programs designed to measure minimally essential teaching skills, with little concern for student learning.

A New Generation of Learner-Centered, Classroom-Based Evaluation Systems The development and implementation in Louisiana of the System for Teaching and Learning Assessment and Review (STAR) placed a predominant focus on classroom-based assessment and evaluation of teaching and learning (Ellett, Loup, & Chauvin, 1991). The STAR is the first example of a new generation of classroom-based assessments of teaching and learning that has greater potential for improving teaching, enhancing student learning and improving schools than traditional systems focused solely on teacher behavior and the evaluation of teacher performance in the classroom (Ellett, 1990). The Professional Assessment and Comprehensive Evaluation System (PACES) is a more recent assessment system that makes the linkage between teaching and learning even stronger than the STAR did (Ellett, 2001). It is currently being implemented in the Miami-Dade County Public Schools (M-DCPS) system in Florida, a system with 23,000 teachers. The PACES was designed and is being implemented as a new generation system to replace a teacher-centered instrument that was used for the annual evaluation of all M-DCPS teachers since 1985 (Teacher Assessment and Development System - TADS). The PACES has been designed to capture findings from the process-product literature as well as the developing knowledge bases about educational improvement including:

(1) newer conceptions of learning and how these are connected to teaching practices (Taylor, Dawson, & Fraser, 1995; Tobin, 1993); (2) the uniqueness of each teaching and learning context; (3) the importance of reflective practice and collegial sharing to teacher understanding, evaluation and growth (Sparks-Langer, Simmons, Pasch, Colton & Starko, 1990; Roth & Tobin, 2001); (4) the call for greater command of knowledge of the subject matter taught (Shulman, 1987); (5) the potential of teacher self assessment for growth in teaching (http://paces.dadeschools.net). (6) large-scale research studies clearly documenting important linkages between student achievement, school effectiveness and quality teaching (Sanders & Horn, 1998); (7) professional standards for developing and implementing new teacher evaluation systems (Joint Committee on Standards for Educational Evaluation, 1988); (8) the importance in teaching and learning of the development of higher order thinking skills (Marzano, et al, 1988); (9) social-cognitive theory and self-efficacy beliefs research (Bandura, 1997); (10) standards-based approaches to evaluation for national certification reflected in the work of the NBPTS; and (11) a variety of other knowledge bases. The PACES has been defined as a â&#x20AC;&#x153;comprehensive, learner-centered, classroom-based assessment system that is designed to provide teachers, administrators other educators with

information useful for improving teaching and learning in classrooms and schoolsâ&#x20AC;? (Ellett, 2001).3 The PACES was designed to assess far more than fundamental teaching skills reflected on the vast majority of instruments available to evaluate teaching. Instead, it represents a set of integrated teaching and learning concepts, extending from comprehensive planning and reflective practice to the active engagement and involvement of learners in the development of higher order thinking skills. In addition to identifying seven major domains of teaching and learning,4 the PACES includes a variety of subsumed Teaching and Learning Components, each of which is further operationalized by a set of assessment indicators. Extensive explanations and examples are included in the PACES assessment manual to clarify the meaning of the various assessment indicators. These indicators are the fundamental units for making assessment decisions about the quality of teaching and learning. A core element of the total PACES process is the encouragement and enhancement of self-reflection and professional growth. At the school-wide level, the total PACES effort is designed to facilitate the development of a culture that supports a community of learners among a variety of constituents including teachers, administrators, learners and parents. An example from the PACES Teaching and Learning Component, Domain II.A: Time Management can be used to illustrate the organization and content depth of a typical domain and the conceptual connectedness in the system of teaching and learning. The Component

The total PACES document includes a large number of assessment indicators (n=107), all of which are considered important elements of teacher self-assessment, reflection and continuous professional growth. Currently, a subset of only 44 assessment indicators is used to make annual evaluation decisions in M-DCPS. 4 These seven domains are: Planning for Teaching and Learning, Managing the Learning Environment, Teacher/Learner Relationships, Enhancing and Enabling Learning, Enabling Thinking, Classroom-based Assessment of Learning and Professional Responsibilities.

Description of Domain II. A. is the following: Component Description Teaching and learning activities reasonably reflect allocated time, begin promptly, proceed efficiently with smooth transitions and no undesirable digressions, and allow for maximum opportunities for learner engagement in learning. Activity refers to all things teachers and learners do in the classroom.

[Insert Box One About Here]

Box One contains the Assessment Indicator and Decision Making Rule that accompanies Domain II.A in the PACE manual (Ellett, 2001). This illustrates the content depth and conceptual connectedness of the system. Teacher evaluation systems have historically been designed around the language of, and have an evaluation focus on, teacher behavior or teacher performance without concern for student outcomes. The learner-centered focus of the PACES is reflected in the language used to focus PACES evaluations, and this serves as a further example of a new generation of perspectives on teacher evaluation and teacher effectiveness. In evaluating the teaching of concepts, for example, a typical district level teacher evaluation system might include a statement like: The teacher teaches concepts or concepts are taught. In the PACES, this important element of higher order cognition is refocused on learners as follows: Learners are actively engaged and/or involved in the development of concepts. Thus, the language of the PACES requires a shift in focus for the evaluator from teacher performance to the active

involvement of learners in the development of concepts. One advantage of the language and learner-centered focus used throughout the PACES is concern for learning, not simply the quality of the teacher’s performance, or the demonstration of selected teacher behaviors that might be counted and recorded using a traditional teacher evaluation checklist. Another advantage of the language in the PACES is that it does not prescribe any particular teaching strategy, method or behavior. Nothing in the PACES teaching and learning professional growth manual tells a teacher how to teach. Thus, teachers are challenged within the uniqueness of their particular teaching and learning environments to find the best ways to actively engage their students in learning tasks. The PACES focuses on students’ active engagement in learning tasks at the classroom level and is logically consistent with the concern for school learning outcomes that has guided SER and school improvement research at the school level.

The National Board of Professional Teaching Standards A second new generation effort in teacher evaluation in the USA is reflected in the work of the NBPTS. Since 1987, the NBPTS has developed and implemented a variety of content specific, standards-based tasks for identifying and nationally certifying accomplished teachers relative to a set of beliefs and values about exemplary teaching (http://www.nbpts.org). (www.nbpts.org):

This set of beliefs indicates that accomplished teachers have the following

characteristics. •

They are committed to students and their learning.

•

They know the subjects they teach and how to teach those subjects to students.

•

They are responsible for managing and monitoring student learning.

•

They think systematically about their practice and learn from experience.

•

They are members of learning communities.

The NBPTS certification assessment process is voluntary and is designed to capture the complexities of accomplished teaching grounded in concerns for student learning by focusing on how teachers conceptualize, make decisions about and carry out courses of action in their classroom(s). The assessment process is one in which the teacher candidate constructs a comprehensive portfolio that analyzes their teaching knowledge, skills, dispositions and professional judgements that distinguish their practices. The NBPTS certificate was not meant to replace state licensing of teachers since the certificate is a credential offered by the profession, not a license to teach. However, many states have opted to allow the NBPTS certificate substitute for the state license, and some 47 states offer regulatory or legislative support for National Board Certification. In 1995, the NBPTS awarded its first 85 certificates. In the fall of 2000, 7,886 teachers were awarded National Board Certification, bringing the total number of teachers certified by the NBPTS to 23,937 at that point in time (http://www.nbpts.org). Clearly, the drive to obtain NBPTS certification is growing among teachers, school administrators, school board members, education policy makers and others. The professional, symbolic and monetary value of NBPTS certification seems clear. However, there has been considerable discussion in the recent literature about the extent to which NBPTS certified teachers have a greater impact on student learning and achievement than non-NBPTS certified teachers. Initial studies addressing this issue provide some evidence that supports the value-

addedness of NBPTS certification relative to the quality of teaching and student outcomes (Bond, Smith, Baker, & Hattie, 2000). Other recent studies, however, have raised a considerable number of issues about the potential disparate impact of NBPTS certification (Bond, 1988), the extent to which the NBPTS process is one that is heavily influenced by candidatesâ&#x20AC;&#x2122; verbal facilities and their abilities to negotiate and to present themselves in the best light for NBPTS assessors, and the validity of NBPTS certification for predicting the quality of everyday classroom practices (Pool, Ellett, Schiavone, & Carey-Lewis, 2001). Due to these issues, and because there is a need to document the meaning and validity of the NBPTS certification process, the NBPTS has recently commissioned a variety of research studies designed to better understand the value-addedness of the national certification process relative to teacher change and student learning. These studies include the application of sophisticated statistical models to student achievement on standardized tests that will follow NBPTS and non-NBPTS teachers over multiple years, with multiple classes of students. Other studies will focus on the relationship between NBPTS certification and changes in teacherâ&#x20AC;&#x2122;s assessment practices and the depth of quality of student work. Because these studies seek to link the NBPTS assessment process, the quality of teaching, changes in classroom practices, and student learning and achievement over time, they represent an opportunity to better understand linkages between teacher effectiveness, teacher evaluation and school effectiveness. It has been argued here that: (1) traditional approaches to teacher evaluation have done little to improve schools in the USA; and (2) that a new generation of learner-centered assessment and evaluation procedures is needed that embraces the larger literatures related to (a) teacher learning and professional development; (b) student learning; (c) school improvement;

and (d) school effectiveness. While there is considerable evidence that the quality of teaching does influence school effectiveness (discussed later in this article), we believe that a new generation of teacher evaluation systems that focus on the connectedness between teaching and learning is one important key to linking classroom-based assessments to student learning, school effectiveness and school improvement. The next section of this article reviews the history of the study of SER in the USA.

A Brief History of School Effectiveness Research in the USA SER began in the USA in the mid-1960s, and the research generated in the USA is arguably the most comprehensive worldwide, although productivity has declined recently. There are three strands of SER (Teddlie & Reynolds, 2000), all of which have lengthy traditions in the USA: (1) school effects research, which studies the scientific properties of school effects; (2) effective schools research, which is concerned with the processes of effective schooling; and (3) school improvement research, which examines the processes whereby schools can be changed for the better. SER in the USA has passed through four distinct, but somewhat overlapping, stages (Teddlie & Reynolds, 2000): (1) Stage One, from the mid-1960s and up until the early 1970s, involved the initial economic driven input-output model; (2) Stage Two, from the early to the late 1970s, saw the beginning of the effective schools studies, which included a wide range of school process variables for study and examined a wider range of school outcomes than Stage One studies;

(3) Stage Three, from the late 1970s through the mid-1980s, saw the focus of SER shift towards the incorporation of the effective schools correlates into schools through the generation of various school improvement programs; and (4) Stage Four, from the late 1980s to the present day, has involved the introduction of school context factors and of more sophisticated methodologies.

Stages One and Two of SER in the USA Stage One was the period in which economically driven input/output studies predominated. These studies focused on school resource inputs (e.g., per pupil expenditure) and student background characteristics (variants of socio-economic status or SES) to predict student achievement on standardized tests. This research (e.g., Coleman et al, 1966) concluded that differences in achievement were more strongly associated with societally determined family SES than with potentially malleable school-based resource variables. For example, the Coleman et al (1966) study concluded that schools bring little influence to bear on a child's achievement that is independent of his background and general social context. Most of Colemanâ&#x20AC;&#x2122;s school factors involved resources (e.g., per pupil expenditure, number of library books), which were not strongly related to student achievement. Nevertheless, 5-9% of the total variance in individual student achievement in the Coleman study was uniquely accounted for by school factors. As noted by many reviewers (e.g., Averch et al, 1971; Brookover et al, 1979), these early economic studies of school effects did not include adequate measures of school social psychological climate and other classroom/school process variables. Their exclusion contributed

to the underestimation of the magnitude of school effects. Stage Two of the development of SER in the USA involved studies that were conducted to dispute the results of Coleman and others. Researchers studied schools that were doing exceptional jobs of educating students from very poor SES backgrounds and sought to describe the processes ongoing in those schools. These studies also expanded the definition of the outputs of schools to include attitudinal and behavioral indicators. The inclusion of more sensitive measures of classroom input in these studies involved the association of student-level data with the specific teachers who taught the students. This methodological advance was important for two reasons: (1) it emphasized input from the classroom (teacher) level, as well the school level; and (2) it associated student-level output variables with student-level input variables, rather than school-level input variables. Murnane (1975) and Summers and Wolfe (1977) assembled datasets in which specific teacher inputs were associated with the particular students whom they had taught. Their research demonstrated that certain characteristics of classroom teachers were significantly related to student achievement. Murnane's (1975) research indicated that information on classroom and school assignments increased the amount of predicted variance in student achievement by 15% in regression models in which student background and prior achievement had been entered first. Principalsâ&#x20AC;&#x2122; evaluations of teachers were also significant predictors in this and other studies. Later reviews by Hanushek (e.g., 1986, 1996) indicated that teacher variables that are tied to school expenditures (e.g., teacher/student ratio, teacher salary) demonstrated no consistent effects on student achievement. On the other hand, qualities associated with human resources (e.g., student sense of control of their environment, principals' evaluations of teachers, quality of

teachers' education, teachers' expectations for students) demonstrated significantly positive relationships to achievement in several studies (e.g., Murnane, 1975; Link & Ratledge, 1979; Summers & Wolfe, 1977; Winkler, 1975). Other studies conducted in the USA during this period indicated the importance of peer groups on student achievement above and beyond the students' own SES background (e.g., Brookover et al, 1979; Hanushek, 1972; Henderson et al, 1978; Winkler, 1975). The measures of teacher behaviors and attitudes utilized in school effects studies have evolved considerably from the archived data that Summers and Wolfe (1977) and Murnane (1975) utilized. Current measures of teacher inputs include direct observations of effective classroom teaching behaviors, which were identified through the teacher effectiveness research literature (e.g., Brophy & Good, 1986; Gage & Needels, 1989; Rosenshine, 1983), as described in a previous section of this article. A major criticism of the early SER was that school/classroom processes were not adequately measured, and that this contributed to school level variance being attributed to family background variables rather than educational processes. Brookover et al (1979) addressed this criticism by using surveys designed to measure student, teacher and principal perceptions of school climate. These measures included items from four sources:(1) student sense of academic futility, (2) academic self concept, (3) teacher expectations and (4) academic or school climate. This study, while beset with multicollinearity problems, demonstrated that in regression models with school climate variables entered first, student sense of academic futility explained about half of the variance in school level reading and mathematics achievement. Another methodological advance from the 1970s concerned the utilization of more

sensitive outcome measures (i.e., outcomes more directly linked to the courses or curriculum taught at the schools under study). Two studies conducted by a group of American, English and Irish researchers (Brimer et al, 1978; Madaus et al 1979) demonstrated that the choice of test can have a dramatic effect on results concerning the extent to which school characteristics affect achievement. In the Madaus study, between class variance in student level performance on curriculum specific tests (e.g., history, geography) was estimated at 40% (averaged across a variety of tests).

Stages Three and Fourof SER in the USA The foremost proponent of the equity ideal during Stage Three of SER in the USA (the late 1970s and the mid-1980s) was Ron Edmonds, who took the results of his own research (e.g., Edmonds, 1979) and that of others (e.g., Lezotte & Bancroft, 1985) to make a case for the creation of effective schools for the urban poor. Edmonds and his colleagues were no longer interested in just describing effective schools: they also wished to create them. The five correlate model generated through the effective schools research consisted of the following factors: strong instructional leadership from the principal, a pervasive and broadly understood instructional focus, a safe and orderly school learning climate high expectations for achievement from all students and the use of student achievement test data for evaluating school success. The first school improvement studies now appeared, and these studies dominated SER in the USA for several years. These early studies (e.g., Clark & McCarthy, 1983; McCormackLarkin, 1985; Taylor, 1990) were based for the most part on models that utilized the effective schools correlates generated from the studies described above.

The equity orientation, with its emphasis on school reform and its sampling biases, led to predictable responses in the early to mid-1980s. The criticisms (e.g., Cuban, 1983; Purkey & Smith, 1982; Rowan, 1984; Rowan, Bossert & Dwyer, 1983; Ralph & Fennessey, 1993) had the positive effect of paving the way for more sophisticated SER studies, which used more defensible sampling and data analysis strategies. School context factors were, in general, ignored during the effective schools research era in the USA, partially due to the equity orientation of the researchers. This equity orientation generated samples of schools that only came from low SES areas, not from across SES contexts, a bias that attracted much of the criticism noted above. In Stage Four, a more methodologically sophisticated era of SER began with the first contextually sensitive SER studies (e.g., Hallinger & Murphy, 1986; Teddlie & Stringfield, 1993), which explored factors that were producing greater effectiveness in middle-class schools, suburban schools and secondary schools. These studies explored differences in school effects that occur across different school contexts, instead of focusing upon one particular context. These context variables included: SES of student body (low, middle, high), grade level configuration (elementary, middle, secondary) and community type (rural, suburban, urban). The results from some of these contextually sensitive studies of school effectiveness, and their potential links to teacher evaluation, are discussed in the next section of this article. Several methodological advances have occurred in the past 20 years in the USA, leading to more sophisticated research across all three SER strands. The foremost methodological advance in SER in the USA (and internationally) during this period was the development of multilevel mathematical models to more accurately assess the effects of all the units of analysis

associated with schooling. Scholars from the USA (e.g., Burstein, 1980) were among the first to identify the levels of aggregation issue as an important one for SER. Researchers from the USA have continued to contribute to the refinement of multilevel modeling (e.g., Bryk & Raudenbush, 1988, 1992; Raudenbush, 1986, 1989; Raudenbush & Bryk, 1986, 1988). Teddlie and Stringfield (1993) conducted a large scale study similar to that of Brookover et al (1979) during the second phase of the Louisiana School Effectiveness Study. These researchers utilized the Brookover school climate scales and found results similar to those reported by Brookover and his colleagues. The researchers utilized second order factor analysis and multilevel modeling in an effort to deal with the problems of multicollinearity among the school climate and family background variables. Since the mid-1990s there has been much less activity in SER in the USA. In a metaanalysis of multilevel SER studies, Bosker and Witziers (1996) noted how few recently published studies there were from the USA compared to those from the UK and the Netherlands. There are a number of reasons for this decline in the production of SER in the USA: (1) the scathing criticisms of effective schools research, which led many educational researchers to steer away from SER and fewer students to choose the area for dissertation studies after the mid-1980s (e.g., Cuban, 1993); (2) some researchers interested in the field moved away from it in the direction of new topics such as school restructuring (as described by Bickel, 1999) and school indicator systems; (3) SER using the input/output models failed to find significant relationships among financially driven inputs and student achievement (e.g., Geske & Teddlie, 1990; Hanushek, 1986), and research in that area subsided; and

(4) federal funding for educational research plummeted during Reagan/ Bush administrations (Good, 1989), and subsequently state departments of education became more involved in monitoring accountability, rather than in basic research, such as SER.

Links Among SER, TER and Teacher Evaluation in the USA Teddlie, Stringfield, and Burdett (this issue) presented a conceptual model that linked several constructs and research literatures. There has been work done in the USA specifically with regard to two of those links: (SER and teacher evaluation; SER and TER).

The Link Between SER and Teacher Evaluation Levine and Lezotte (1990) summarized the characteristics of effective schooling that had been determined through studies of effective schools in the USA. These processes are listed in Table One. This list greatly expanded the original five correlates of effective schooling espoused by Edmonds in the 1970s.

[Insert Table One About Here]

Of course many of these effective schools characteristics have direct implications for the evaluation of teachers, if teacher and school improvement is the goal of the teacher evaluation process (Teddlie, Stringfield, & Burdett, this issue). Many of the processes associated with the following effective schools characteristics (taken from Table One) could be part of the teacher evaluation process: (1) Effective Instructional Arrangements and Implementation, (2) Focus on

Student Acquisition of Central Learning Skills, (3) Productive School Climate and Culture, (4) High Operationalized Expectations and Requirements for Students and (5) Appropriate Monitoring of Student Progress.5 For instance, if teacher and school improvement were the goals of the teacher evaluation system, then faculty members could be assessed on Student Acquisition of Central Learning Skills by utilizing an observational system that measured (a) how well the teachers maximized the availability and use of time for learning and (b) how much the teachers emphasized mastery of central learning skills. Staff development could then be organized to address deficiencies in these areas. Stage Four SER increasingly involved the use of context variables. As noted above, these context variables include SES of student body, grade level configuration and community type. Research indicates that the characteristics of effective schools differ depending on these school context variables. For instance, Teddlie and Stringfield (1993) reported that effective middle-SES elementary schools promoted both high present and future (e.g., the students will do well in college) expectations, while effective low-SES elementary schools emphasized present educational expectations (e.g., the students can learn third grade mathematics). These differential results across different types of school contexts have ramifications for teacher evaluation systems, since these systems should also be sensitive to the context within which the teachers work. For instance, several studies of grade level configuration (elementary, middle, secondary) emphasize differences in the effective schools characteristics of these schools

Of course, Effective Teaching under Effective Instructional Arrangements and Implementation includes all the effective teaching characteristics described earlier in this article. The SER and TER literatures merge with regard to this characteristic, since they have produced similar findings concerning the components of the characteristic over time.

that could have implications for teacher evaluation (e.g., Heck,1992; Virgilio, Teddlie, & Oescher, 1991). While emphasis on central learning skills is of great importance (especially in elementary schools), Levine, Levine and Eubanks (1984), reported that effective secondary schools stress students' personal as well as educational goals. Since secondary students will soon be entering the work force or going to post-secondary institutions, there is a greater need for secondary schools to be sensitive to individual students' goals and provide opportunities for their realization. Therefore, teachers at elementary and secondary schools should be evaluated somewhat differently with regard to the emphasis that they place on central learning skills to the exclusion of other individual studentâ&#x20AC;&#x2122;s goals Another example concerns the use of low-inference measures of teacher effectiveness (i.e., time-on-task) at the elementary and secondary levels. Virgilio, et al (1991) found that measures of time-on-task could successfully differentiate between more effective and less effective schools at the elementary level, but not at the secondary level. Such measures may be inappropriate for evaluation purposes at the secondary level, where co-operative learning arrangements and classroom situations emphasizing higher order thinking skills may yield lower time-on-task scores. Teacher evaluation systems must be sensitive to the contexts within which teaching occurs if they are to guide teacher and school improvement. This focus on contextually-based evaluations of teaching is consistent with the argument made an earlier section of this article (and also presented in Ellett, 1990, 1997) about the importance of developing and implementing new generation assessments of teaching and learning.

The Link Between TER and SER in the USA

Creemers and his colleagues (e.g., Creemers & Reezigt, 1996; Creemers & Scheerens, 1994; Reezigt, Creemers, & deJong, this issue) have persistently called for the study of educational effectiveness to replace the separate fields of school effectiveness and teacher effectiveness. They have proposed an integrated model that combines variables that have been traditionally associated with either SER or TER. Their call echoes an earlier sentiment expressed by Good (1989), who contended that it is important to explain more completely how processes at both levels (classroom, school) operate and how they can be combined to create more effective educational environments. The first article in this special issue (Teddlie, Stringfield, & Burdett) briefly noted the reasons that the fields of SER and TER developed separately in the USA and elsewhere. While there continues to be a lack of integration of the literatures in most countries, this is not the case in the USA. This section summarizes the results from a series of studies conducted in the USA over the past 20 years that explicitly examined educational effectiveness at both the classroom and school levels simultaneously. The first studies of the joint processes of school and teacher effectiveness began in the late 1970s and 1980s (e.g., Brookover et al, 1979; Mortimore, et al , 1988; Teddlie et al, 1984). These studies used survey data as proxies for classroom observations (e.g., survey instruments measuring the social psychological climates of classes and schools). These survey instruments became a standard part of SER methodology, as did informal observations of classroom and school wide behavior. Researchers using these survey instruments and informal classroom observations were rewarded as they were able to explain aspects of the schooling process that had not been

explored in SER heretofore. For example, the four case studies presented by Brookover et al (1979) included extensive information on the proportion of class time spent on instruction, the use of competitive groups as opposed to individual learning in classrooms and the use of positive reinforcement in classrooms based on survey data and informal classroom observations. Starting in the mid-1980s, researchers working within the SER paradigm began explicitly including formal classroom observations (and consequently TER variables) in their research (e.g., Crone & Teddlie, 1995; Stringfield, Teddlie & Suarez, 1985; Teddlie, 1994; Teddlie, Kirby & Stringfield, 1989; Teddlie & Stringfield, 1993; Virgilio, Teddlie & Oescher, 1991). In these studies, schools were classified on two (more or less effective) or three (more effective, typical, less effective) levels based on their students' achievement after the effect of their families' SES had been taken into consideration. Then observers went into samples of classrooms in the schools measuring teacher behavior using standardized instruments such as Stallings' (1980, 1991) Classroom Snapshot (as a measure of interactive and total time on task, or TOT) and the Virgilio Teacher Behavior Inventory (Teddlie, Virgilio and Oescher, 1990), or the VTBI, which was used to measure classroom management, quality of instruction, and social psychological climate. These studies have revealed consistent mean and standard deviation differences in teaching behavior between schools classified as more effective, typical or less effective as indicated in Table Two. Cumulative data from these studies indicate that classrooms in more effective schools average 51% interactive TOT and 76% total TOT. On the other hand, classrooms in less effective schools average 37% interactive TOT and 52% total TOT.

[Insert Table Two About Here]

These studies also indicate that there were consistent differences in VTBI ratings collected from classrooms in differentially effective schools. Item scores on the VTBI range from one (poor) to five (excellent). Scores for more effective schools averaged around 3.85 across the three general areas of teaching (management, instruction, classroom climate), while those for less effective schools averaged around 3.00, which is the mid-point of the scale. (See Table Two).

Another important result from these studies concerns the variance of ratings

across classrooms in differentially effective schools. Teachers in more effective schools demonstrated less variance, while those in less effective schools demonstrated more variance. It appears that the trailing edge of teaching was eliminated in more effective schools. You will find effective teachers in less effective schools, but you will not find ineffective teachers in more effective schools, because more effective schools have developed processes whereby they eliminate poorer teaching. Researchers and reviewers (e.g., Bickel, 1999; Teddlie, 1994; Teddlie & Stringfield, 1993; Kryiakides & Campbell, this issue) have noted the need for additional research on the interaction of school and teacher effectiveness processes. For example, there are several ways in which school level behaviors can affect teacher behavior in the classroom: (1)

the method and selection of teachers,

(2)

the type of classroom monitoring and feedback,

(3)

the type of support for individual teacher improvement provided by the

administration,

(4)

the instructional leadership provided by the administration, including allocating and protecting academic time, and

(5)

the promotion of a positive academic climate at the school level, which translates

to higher expectations and standards at the classroom level. Bickel (1999) concluded that further research on how school and classroom variables interact to influence educational effectiveness is needed. Despite this, Bickel also concluded that this type of research was unlikely in the USA now, since the national attention on school change has shifted from school effectiveness and improvement processes to systemic change models. The recent decline in SER described in an earlier section of this article indicates that Bickel was essentially correct. While SER has declined in the USA over the past decade, a healthy sign has been the emergence of SER in new countries (e.g., Cyprus, Hong Kong, Ireland, Norway, Spain, Taiwan). For example, the International School Effectiveness Research Program (ISERP) was an eight country study conducted during the 1990s and recently published (Reynolds, Creemers, Stringfield, Teddlie, & Schaffer, 2002). The USA was involved in this study, and the design of ISERP explicitly involved the simultaneous study of school and teacher effectiveness processes. Results from the USA component of ISERP confirmed and extended the research described in this section. For example, the three factors (classroom management, quality of instruction, social psychological climate) from the VTBI were positively associated with increases in student outcomes over the two year longitudinal study. Case studies from the USA sample revealed the following: (1) in low SES more effective schools, the principals knew the strengths and weaknesses of all the teaching staff, while at low SES less effective schools, the

principal was seldom seen in the classrooms and could not differentiate accurately among the teachers; and (2) in middle SES more effective schools, total TOT was over 90% and the teachersâ&#x20AC;&#x2122; ratings on the VTBI indices averaged 4.5 (on five point scales), while the ratings of classroom teaching in the middle SES less effective schools were much lower.

Synthesis and Future Directions Two important lines of inquiry in educational research in the USA have been described and integrated in this article: (a) TER and teacher evaluation research; and (b) SER, including school improvement research. Although these two lines of inquiry have developed somewhat parallel to one another for nearly four decades, we view them as inextricably interrelated. Several authors have called for the merger of these two lines of inquiry (e.g., Creemers & Scheerens, 1994; Teddlie, 1994), and recently there has been an increased recognition that these two research traditions need to inform one another in order to maximize school improvement and reform in the USA. Over the years, the primary goal of TER has been to identify characteristics of exemplary teaching and learning environments, which should then enhance student learning and subsequent achievement. Though most would agree that there is not one best way to teach to achieve this end, most would probably also agree that there are core elements of teaching and learning environments that are logically and empirically linked to student outcomes (e.g., time management, student engagement, teachersâ&#x20AC;&#x2122; knowledge of subject matter). Historically, TER has greatly influenced (and will continue to influence) teacher evaluation practices in the USA. Research presented in this article has also directly linked SER

and teacher evaluation in this country. We posit that new teacher evaluation systems developed in the USA should effectively meld both TER and SER (and their combined study) in framing teacher evaluation standards and the criteria for judging them. The PACES evaluation system in Miami represents a new generation of evaluation systems that is explicitly doing that, especially with regard to its emphasis on student learning as a criterion variable. Student learning has always been the major criterion used in measuring effectiveness in SER, and the new generation teacher evaluation systems in the USA are adapting that criterion also, even though their etiology is firmly rooted in TER. Large-scale reviews and syntheses of research on schools and schooling (e.g., Wang, Haertel, & Walberg, 1993) have classified variables related to school effects and effectiveness into distal and proximal categories. Distal variables are those that are rather removed from the daily learning experiences of students that have little effect on student learning, achievement, and school improvement (e.g., policy-related initiatives such as class size, teacher pay, school choice, text book adoption). Additional examples of distal variables were provided in the multitude of popular, politically driven efforts (means) to reform education in the USA at the beginning of this article. Proximal variables are those variables that are closest to the daily lives and experiences of students (e.g., student abilities, preferences, and prior achievement, teacher characteristics and classroom behaviors, instructional materials and practices, amount of time devoted to learning, curriculum content and classroom climate). Fullan (1993) has placed these variables at the learning core of schools, and Teddlie, Stringfield and Burdett (this issue) have argued that school-based self-evaluation systems are also focused on these proximal variables. Therefore,

TER and SER have both recently focused on these proximal variables in a highly complementary manner. To summarize, it appears that now is an a propos time in the USA for the linking of various effectiveness, evaluation and improvement literatures and practices. There are trends in the fields of research and practice that have led to this watershed. These include the following. (1) There is a call (both internationally and in the USA) for researchers, policy makers and practitioners from the various fields to break out of their â&#x20AC;&#x153;separated circlesâ&#x20AC;? (Reezigt, Creemers, & deJong, this issue) and integrate research and policy more closely. For instance, Viadero (2003) in a recent edition of Education Week wrote of the need to break the disconnect between educational research and practice. Viadero noted that researchers often believe that the results of their studies will pass directly into the hands of practitioners, and that this seldom happens. There is a growing recognition that the disconnect between researchers and practitioners must be mended. This recognition is directly reflected in publications that call for the merger of the various fields discussed in this article, such as school effectiveness and school improvement (e.g., Reynolds, Hopkins, & Stoll, 1993). (2) The concept of an educational effectiveness literature that merges TER and SER is gaining increasing popularity internationally and in the USA. (3) There is now a research tradition in the USA in which teacher effectiveness variables have been included in SER, and these studies have yielded results with interesting implications for policy and practice, as noted in an earlier section of this article. Quantitatively oriented Stage Four SER in the USA has also recognized the importance of teacher effectiveness as a within

school variable that must be taken into account to explain variation in school outcomes across schools. Recent longitudinal studies of student learning and achievement convincingly document teacher effects as the most powerful predictor of student achievement (e.g., Sanders & Horn, 1998)6. (4) Stage Four SER in the USA has also increasingly focused on context variables, and results from this research have been linked to the practice of teacher evaluation in this article. The developers and implementers of teacher evaluation systems should be sensitive to the specific contexts within which teachers work and the system should be informed by the results from contextually sensitive SER. (5) New generation teacher evaluation systems in the USA focus more on student learning and subsequent achievement than previous systems did. Therefore, teacher evaluation and SER now share a common criterion variable - student achievement. We expect to see an increased number of mixed methods studies (e.g., Tashakkori & Teddlie, 2003) of educational effectiveness that will: (1) identify demonstrably effective schools based on their productivity, (2) subsequently conduct rich, on-site investigations of class and school level variables such as the quality of teaching and learning, learning environment characteristics, leadership, engaged time, the educational quality of the home environment, etc., (3) yield and disseminate results that inform the development and implementation of new

The development of sophisticated data analysis techniques such as statistical mixed model theory and development (Sanders & Horn, 1998) and hierarchical linear modeling (e.g., Bryk & Raudenbush, 1988) now make it possible to not only conduct meaningful, longitudinal investigations of school and classroom level effects on school outcomes, but also to study the effects of within school variation on school effectiveness. We project that these sophisticated methods will be used to a greater extent in future studies linking teacher effectiveness (and other proximal variables as well) to school effectiveness and improvement in a manner that informs both lines of inquiry.

generation teacher evaluation practices (both summative and formative) and (4) yield and disseminate results that lead to differentiated recommendations for school and teacher improvement across different contexts.

References Averch, H. A., Carroll, S. J., Donaldson, T. S., Kiesling, H. J., & Pincus, J. (1971) How effective is schooling? A critical review and synthesis of research findings. Santa Monica, CA: Rand Corporation. Bandura, A. (1997). Self-efficacy: The exercise of control. New York: W.H. Freeman and Company. Beecher, D.E. (1949). The evaluation of teaching: Background and concepts. New York: Syracuse University Press. Berk, R.A. (1984). A guide to criterion-referenced test construction. Baltimore: The Johns Hopkins University Press. Bickel, W. E. (1999). The implications of the effective schools literature for school restructuring. In C. R. Reynolds & T. Gutkin (Eds.) The Handbook of School Psychology (3rd Ed.), pp. 959-983. New York: John Wiley and Sons. Bond, L. (1998). Disparate impact and teacher certification. Journal of Personnel Evaluation in Education, 12(2), 211-220. Bond, L., Smith, T., Baker, W., & Hattie, J. (2000). The certification system of the National Board of Professional Teaching Standards: A construct and consequential validity study. Greensboro, NC: Center for Educational Research and Evaluation, University of North Carolina at Greensboro. Bosker, R. J., & Witziers, B. (1996) The magnitude of school effects, or: Does it really matter which school a student attends? Paper presented at the annual meeting of the American Educational Research Association, New York, NY.

Brimer, A., Madaus, G. F., Chapman, B., Kellaghan, T., & Woodrof, R. (1978) Differences in school achievement. Slough: NFER-Nelson. Brookover, W. B., Beady, C., Flood, P., Schweitzer, J., & Wisenbaker, J. (1979) Schools, social systems and student achievement : Schools can make a difference. New York : Praeger. Brophy, J. (1986). Teacher influences on student achievement. American Psychologist, 1069-1077. Brophy, J. (1988). Research on teacher effects: Uses and abuses. Elementary School Journal, 3-22. Brophy, J. E., & Good, T. L. (1986) Teacher behavior and student achievement. In M. Wittrock (Ed), Third Handbook of Research on Teaching, 328-375. New York: Macmillan. Bryk, A. S., & Raudenbush, S. (1988) Toward a more appropriate conceptual-isation of research on school effects: A three-level hierarchical linear model. In R. D. Bock (Ed) Multilevel Analysis of Educational Data, 159-204. Bryk, A. S., & Raudenbush, S. W. (1992) Hierarchical linear models : Applications and data analysis methods. Newbury Park, CA: Sage. Burstein, L. (1980) The analysis of multi-level data in educational research and evaluation. In D. C. Berliner (Ed) Review of Research in Education, 8, 158-233. Capie, W., Anderson, S.J., Johnson, C.E., & Ellett, C.D. (1980). Teacher Performance Assessment Instruments: Teaching Plans and Materials, Interpersonal Skills, Professional Standards, Student Perceptions. Athens, GA: Georgia Department of Education, Teacher Assessment Project, University of Georgia Charters, W.W. & Waples, D. (1929). The commonwealth teacher training study. Chicago:

University of Chicago Press. Clark, T. A., & McCarthy, D. P. (1983) School improvement in New York City: The evolution of a project. Educational Researcher, 12(4), 17-24. Coleman, J. S., Campbell, E., Hobson, C., McPartland, J., Mood, A., Weinfeld, R., & York, R. (1966) Equality of educational opportunity. Washington, DC : Government Printing Office. Creemers, B. P. M. and Reezigt, G. J. (1996) School level conditions affecting the effectiveness of instruction. School Effectiveness and School Improvement, 7(3), 197-228. Creemers, B. P. M., & Scheerens, J. (1994) Developments in the educational effectiveness research programme. In R.J. Bosker, B.P.M. Creemers, and J. Scheerens (Eds.) Conceptual and methodological advances in educational effectiveness research. Special issue of International Journal of Educational Research, 21(2), 125-140. Crone, L., & Teddlie, C. (1995). Further examination of teacher behavior in differentially effective schools: Selection and socialization processes. Journal of Classroom Interaction, 30(1), 1-9. Cuban, L. (1983) Effective schools: A friendly but cautionary note. Phi Delta Kappan, 64, 695-696. Cuban, L. (1990). Reforming again, and again, and again. Educational Researcher, 19(1), 3-13. Cuban, L. (1993) Preface. In Teddlie, C. & Stringfield, S. (Eds) Schools make a difference : Lessons learned from a 10-year study of school effects. New York : Teachers College Press. Darling-Hammond, L., Wise, A., & Pease, J.R. (1983). Teacher evaluation in the

organizational context: A review of the literature. Review of Educational Research, 53, 285-328. Duncan, M.J. & Biddle, B.J. (1974). The study of teaching. New Yorkâ&#x20AC;? Holt, Rinehart and Winston.Edmonds, R. R. (1979) Effective schools for the urban poor. Educational Leadership, 37(10), 15-24. Ellett, C.D. (1985). Assessing minimum competencies of beginning teachers: Instrumentation, measurement issues and legal concerns. Evaluation of teaching: The formative process. Hot Topics Series. Bloomington, IN: Phi Delta Kappa, Ellett, C.D. (1987). Emerging teacher performance assessment practices: Implications for the instructional supervision role of school principals. In William Geenfield (Ed.), Instructional Leadership: Concepts and Controversies. Boston, Allyn & Bacon, 302327. Ellett, C.D. (1990). A new generation of classroom-based assessments of teaching and learning: Concepts, issues and controversies from pilots of the Louisiana STAR. Baton Rouge, LA: Statewide Teacher Evaluation Project, College of Education, Louisiana State University. Ellett, C.D. (1997). Classroom-based assessments of teaching and learning. In James Stronge, (ED)., Evaluating teaching: A guide to current thinking and best practice (pp. 107128). Newbury Park, CA: Corwin Press. Ellett, C.D. (2001). Professional Assessment and Comprehensive Evaluation System (PACES): Teaching and Learning Professional Growth Manual. Watkinsville, GA: CDE Research Associates, Inc.

Ellett, C.D. (2003). Teacher self-assessment tasks with the PACES. Miami, FL: Miami-Dade County Public Schools. http://www.paces.dadeschools.net Ellett, C.D. & Capie, W. (1982). Measurement issues and procedures for establishing performance-based certification standards for teachers. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, New York. Ellett, C.D. & Capie, W. (1985). Assessing meritorious teacher performance: A differential validity study. Paper presented at the annual meeting of the American Educational Research Association, Chicago, Illinois. Ellett, C.D. & Garland, J.S. (1987). Teacher evaluation practices in our largest school districts: Are they measuring up to â&#x20AC;&#x153;state-of-the-artâ&#x20AC;? systems? Journal of Personnel Evaluation in Education, 1(1), 69-92. Ellett, C.D., Loup, K.S., & Chavin, S.W. (1991). System for Teaching and learning Assessment and Review (STAR). Baton Rouge, Louisiana: College of Education, Louisiana State University. Flanders, N.A. (1970). Analyzing teacher behavior. Reading, Mass: Addison-Wesley. Gage, N. (1972). Can science contribute to the art of teaching? Teacher effectiveness and teacher education: The search for a scientific basis, pp 27-39. Palo Alto, CA: Pacific Books. Gage, N. L., & Needels, M. C. (1989) Process-product research on teaching: A review of criticisms. The Elementary School Journal, 89, 253-300. Geske, T., & Teddlie, C. (1990). Organizational productivity of schools. In P. Reyes (Ed.) Teachers and their workplace: Commitment, performance and productivity ( pp. 191221). Newbury, CA: Sage.

Good, T. L. (1989) Classroom and school research: Investments in enhancing schools. Columbia, MO: Center for Research in Social Behavior. Hallinger, P., & Murphy, J. (1986) The social context of effective schools. American Journal of Education, 94, 328-355. Hanushek, E. A. (1986) The economics of schooling: Production and efficiency in public schools. Journal of Economic Literature, 24, 1141-1177. Hanushek, E. A. (1996) A more complete picture of school resource policies. Review of Educational Research, 66(3), 397-409. Heck, R. H. (1992) Principals' instructional leadership and school performance : Implications for policy development. Educational Evaluation and Policy Analysis, 14(1), 21-34. Henderson, V., Mieszkowski, P., & Sauvageau, Y. (1978) Peer group effects and educational production functions. Journal of Public Economics, 10, 97-106. Iwanicki (Ed.). (1986) Journal of Personnel Evaluation in Education, 1(1), Boston, MA: Kluwer Academic Publishers. Jaeger, R.M. & Tittle, C.K. (1980). Minimum competency testing: Motives, models, measures, and consequences. Berkeley: McCutchan. Joint Committee on Standards for Educational Evaluation (1988). The personnel evaluation standards: How to assess systems for evaluating educators. Daniel Stufflebeam (Ed.). Newbury Park CA: Corwin Press. Levine, D. U., Levine, R., & Eubanks, E. E. (1984) Characteristics of effective inner-city intermediate schools. Phi Delta Kappan, 65, 707-711. Levine, D. U., & Lezotte, L. W. (1990) Unusually effective schools : A review and analysis of research and practice. Madison, WI : The National Center for Effective Schools

Research and Development. Lezotte, L. W., & Bancroft, B. (1985) Growing use of effective schools model for school improvement. Educational Leadership, 42(3), 23-27. Link, C. R., & Ratledge, E. C. (1979) Student perceptions, IQ, and achievement. The Journal of Human Resources, 14. Loup, K.S., Garland, J.S., Ellett, C.D., & Rugutt, J.K. (1997). Ten years later: Findings from a replication of a study of teacher evaluation practices in our 100 largest school districts. Journal of Personnel Evaluation in Education, 10(3), 203-226. Madaus, G. F., Kellaghan, T., Rakow, E. A., & King, D. J. (1979) The sensitivity of measures of school effectiveness. Harvard Educational Review, 49, 207-230. Marzano, R.J., Brandt, R.S., Hughes, C.S., Jones, B.F., Presseisen, B.Z, Rankin, S.C., & Suhor, C. (1988). Dimensions of thinking: A framework for curriculum and instruction. Alexandria, VA: Association for Supervision and Curriculum Development McCormack-Larkin, M. (1985) Ingredients of a successful school effectiveness project in Milwaukee. Educational Leadership, 42(6), 31-37. McLaughlin, M. (1990). Embracing contraries: Implementing and sustaining teacher evaluation. In J. Millman (Ed.), The new handbook of teacher evaluation: Assessing elementary and secondary school teachers, (pp. 403-415). Newbury Park, CA: Sage Publications, Inc. McLaughlin, M. & Pfeifer, R.S. (1988). Teacher evaluation, improvement, accountability and effective learning. New York: Teachers College Press. Medley, D.M. (1977). Teacher competence and teacher effectiveness: A review of process-

product research. Washington, DC: American Association of College of Teacher Education. Medley, D.M., Coker, H., & Soar, R.S. (1984). Measurement-based evaluation of teacher performance: An empirical approach. New York: Longman. Medley, D. & Mitzel, H. (1963). Measuring classroom behavior by systematic observation. In N, Gage (Ed.), Handbook of research on teaching. Chicago: Rand McNally. Millman, J. (Ed.). (1981). Handbook of teacher evaluation. Beverly Hills, CS: Sage Publications, Inc. Millman, J. & Darling-Hammond, L. (Eds.). (1990). The new handbook of teacher evaluation: Assessing elementary and secondary school teachers. Beverly Hills, CA: Sage Publications, Inc Mortimore, P., Sammons, P., Stoll, L., Lewis, D., & Ecob, R. (1988). School matters: The junior years. Somerset, England: Open Books. Murnane, R. J. (1975) The impact of school resources on the learning of inner city children. Cambridge, MA; Ballinger Publishing Co. Pipho, C. (Ed.). (1978). Minimum competency testing. Phi Delta Kappan, 59(9). Pool, J.E. & Ellett, C.D., Schiavone, S., & Carey-Lewis, C. (2001). How valid are the National Board of Professional Teaching Standards assessments for predicting the quality of actual classroom teaching and learning?: Results of six mini case studies. Journal of Personnel Evaluation in Education, 15(1), 31-48. Purkey, S. C., & Smith, M. S. (1982) Too soon to cheer? Synthesis of research on effective schools. Educational Leadership, 40(12), 64-69. Ralph, J. H., & Fennessey, J. (1983) Science or reform : some questions about the effective

schools model. Phi Delta Kappan, 64(10), 689-694. Raudenbush, S. W. (1986) Educational applications of hierarchical linear models: A review. Journal of Educational Statistics, 13, 85-116. Raudenbush, S. W. (1989) The analysis of longitudinal, multilevel data. International Journal of Educational Research, 13, 685-825. Raudenbush, S. W., & Bryk, A. S. (1986) A hierarchical model for studying school effects. Sociology of Education, 59, 1-17. Raudenbush, S. W., & Bryk, A. S. (1988) Methodological advances in analysing the effects of schools and classrooms on student learning. In Ernest Z. Rothokopf (Ed), Review of Research in Education, 15, 423-475. Washington, D.C.: American Educational Research Association. Reynolds, D., Creemers, B., Stringfield, S., Teddlie, C., & Schaffer, E. (2002) World class schools: International perspectives on school effectiveness. London: Routledge/Falmer. Reynolds, D., Hopkins, D. and Stoll, L. (1993) Linking school effectiveness knowledge and school improvement practice: Towards a synergy. School Effectiveness and School Improvement, 4(1), 37-58. Rosenshine, B. (1983). Teaching functions in instructional programs. Elementary School Journal, 83, 335-51. Rosenshine, B. & Furst, N. (1973). The use of direct observation to study teaching. In R.M.W. Travers (Ed.), Second handbook of research on teaching. Skokie, Illinois: Rand McNally.Roth, W.M. & Tobin, K. (2001). The implications of coteaching/cogenerative dialogue

for teacher evaluation: Learning from multiple perspectives of everyday practice. Journal of Personnel Evaluation in Education, 15(1), 1-29. Rowan, B. (1984) Shamanistic rituals in effective schools. Issues in Education, 2, 76-87. Rowan, B., Bossert, S. T., & Dwyer, D. C. (1983) Research on effective schools: A cautionary note. Educational Researcher, 12(4), 24-31. Sanders, W.L. & Horn, S.P. (1998). Research findings from the Tennessee Value-Added Assessment System (TVAAS) database: Implications for educational evaluation and research. Journal of Personnel Evaluation in Education, 12(3), 247-257. Scriven, M. (1988). Duty-based teacher evaluation. Journal of Personnel Evaluation in Education, 1(4), 319-334. Shulman, L.S. (1987). Knowledge and teaching Foundations of the new reform. Harvard Educational Review, 57, 1-22. Simon, A. & Boyer, E.C. (1967). Mirrors for behavior: An anthology of classroom observation instruments (6 Vols). Philadelphia, PA: Research for Better Schools. Sparks-Langer, G., Simmons, J., Pasch, M., Colton, A., & Starko, A. (1990). Reflective pedagogical thinking: How can we promote it and measure it? Journal of Teacher Education, 41(4), 23-32). Stallings, J. (1977). Learning to look: A handbook for classroom observation and teaching models. Belmont, CA: Wadsworth. Stallings, J.A. (1980). Allocated academic learning time revisited, or beyond time on task. Educational Researcher, 9(11), 11-16. Stallings, J. A. and Freiberg, H. J. (1991) Observation for improvement of teaching. In H.C. Waxman and H.J. Walberg (Eds.) Effective teaching: Current research , 107-134.

Berkeley, CA: McCutchan Publishing. Stringfield, S., Teddlie, C., & Suarez, S. (1985). Classroom interaction in effective and ineffective schools: Preliminary results from phase III of the Louisiana School Effectiveness Study. Journal of Classroom Interaction, 20(2), 31-37. Summers, A. A., & Wolfe, B. L. (1977). Do schools make a difference? American Economic Review, 67, 639-652. Tashakkori, A., and Teddlie, C. (Eds.) (2003). Handbook of mixed methods in social and behavioral research. Thousand Oaks, CA: Sage Publications, Inc. Taylor, F. (1947). Scientific management. New York: Harper. Taylor,. O. (Ed). (1990) Case studies in effective schools research. Madison, WI: National Center for Effective Schools Research and Development. Taylor, P., Dawson, V., & Fraser, B. (1995). Classroom learning environments under transformation: A constructivist perspective. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Teddlie, C. (1994). Integrating classroom and school data in school effectiveness research. In D. Reynolds, et al, Advances in school effectiveness research and practice 111-132. Oxford: Pergamon. Teddlie, C., Falkowski, C. Stringfield, S., Desselle, S., & Garvue, R. (1984) The Louisiana School Effectiveness Study: Phase two, 1982-84. Baton Rouge: Louisiana Department of Education. (ERIC Document Reproduction Service No. ED 250 362). Teddlie, C., Kirby, P., & Stringfield, S. (1989). Effective versus ineffective schools: Observable differences in the classroom. American Journal of Education, 97(3), 221-236.

Teddlie, C., & Reynolds, D. (2000). The international handbook of school effectiveness research. London: Falmer Press. Teddlie, C., & Stringfield, S. (1993). Schools make a difference: Lessons learned from a 10year study of school effects. New York: Teachers College Press. Teddlie, C., Virgilio, I., & Oescher, J. (1990). Development and validation of the Virgilio Teacher Behavior Inventory. Educational and Psychological Measurement, 50 (2), 421-430. Tobin, K. (Ed.). (1993). The practice of constructivism in science education. Washington, DC: American Association for th Advancement of Science. Viadero, D. (2003). Scholars aim to connect studies to schools’ needs. Education Week, 22(27), 1, 12-14. Virgilio, I., Teddlie, C., and Oescher, J. (1991) Variance and context differences in teaching at differentially effective schools. School Effectiveness and School Improvement, 2 (2), 152-168. Wang, M.C., Haertel, G.D., & Walberg, H.J. (1993). Toward a knowledge base for school learning. Review of Educational Research, 63(3), 249-294. Winkler, D. R. (1975) Educational achievement and school peer group composition. Journal of Human Resources, 10, 189-205. Ying, P. C., & Fan, G. R. (2001). Research on traditional cases of teachers’ evaluating patterns – on the disadvantage of traditional pattern of teachers’ evaluation and study of a new pattern. Theory and Practice of Education, 2(3), 22-35.

Table One. Effective Schools' Characteristics Identified in the USA and Summarized by Levine and Lezotte (1990) 1.

Outstanding Leadership a. Superior Instructional Leadership b. Support for Teachers c. High Expenditure of Time and Energy for School Improvement d. Vigorous Selection and Replacement of Teachers e. Maverick Orientation and Buffering f. Frequent, Personal Monitoring of School Activities and Sense-making g. Acquisition of Resources h. Availability and Effective Utilization of Instructional Support Personnel

Effective Instructional Arrangements and Implementation a. Effective Teaching b. Successful Grouping and Related Organizational Arrangements c. Classroom Adaptation d. Active/enriched Learning e. Emphasis on Higher Order Thinking Skills in Assessing Instructional Outcomes f. Coordination in Curriculum and Instruction g. Easy Availability of Instructional Materials h. Stealing Time for Reading, Language, Mathematics

Focus on Student Acquisition of Central Learning Skills

a. Maximum Availability and Use of Time for Learning b. Emphasis on Mastery of Central Learning Skills 4.

Productive School Climate and Culture a. Orderly Environment b. Faculty Commitment to a Shared and Articulated Mission Focused on Achievement c. Faculty Cohesion and Collegiality d. Schoolwide Emphasis on Recognizing Positive Performance e. Problem Solving Orientation f. Faculty Input Into Decision Making

High Operationalized Expectations and Requirements for Students

Appropriate Monitoring of Student Progress

Practice Oriented Staff Development at the School Site

Salient Parental Involvement

Others a. Student Sense of Efficacy/Futility b. Multicultural Instruction and Sensitivity c. Personal Development of Students d. Rigorous and Equitable Student Promotion Policies and Practices Note : Levine and Lezotte (1990) noted that the other characteristics (Category 9) were found in a smaller proportion of the studies that they reviewed.

Table Two. Results of Research from the USA Simultaneously Studying School and Teacher Effectiveness Processes

Dimension of

More Effective

Effective Schooling

Schools

Less Effective Typical Schools

Schools

Interactive Time on Task

51%

43%

37%

Total Time on Task

76%

64%

52%

Management

4.05

3.15

3.07

Quality of Instruction

3.73

3.39

2.89

3.75

3.61

3.48

Classroom

Social Psychological Climate

Note: This table summarizes the results from several studies of differentially effective schools (e.g., Crone & Teddlie, 1995; Stringfield, Teddlie, & Suarez, 1985; Teddlie, Kirby, & Stringfield, 1989; Teddlie & Stringfield, 1993). Altogether some 1200 classroom observations were conducted in these studies in approximately 125 schools and 500 different classrooms. Stallings' Classroom Snapshot was used as the measure of interactive and total TOT. Scores on this instrument could range from 0% to 100% TOT. The Virgilio Teacher Behavior Inventory was used as the measure of classroom management, quality of instruction, and social psychological climate. Scores on items on the VTBI range from one (poor) to five (excellent).

BOX 1. Example from PACES. Assessment Indicator II.A.1 Learning begins promptly. dicator focuses on the beginning of the ng activities (teaching methods and learning begin with little time spent on organizational as roll taking and distributing materials and issue in this indicator is not simply whether s on time, but whether the lesson begins ell. Both teaching and learning should be pending upon the context. For example, t begin a learning task while the teacher s roll and /or attends to other organizational the other hand, teaching methods might begin out engagement of learners in learning tasks. Decision Making Rule e is not used efficiently at the beginning of view/ discussion must occur between the valuator to determine a professional growth