A new report released at the end of July by the National Centre for Education Evaluation, a division of the US Department of Education, has serious implications for Labor and Liberal plans to identify the best performing teachers and schools in Australia and give them cash bonuses.
It shows that a large proportion of the payments could go to the wrong teachers and the wrong schools. Highly performing teachers and schools could be overlooked and average teachers and schools wrongly rated as highly effective.
Under both parties’ plans, the bonus payments will be partially based on the “value added” by individual teachers to student results as measured by NAPLAN test scores and school based information. Labor also proposes to pay bonuses to schools which increase student achievement the most.
The NCEE report analyses the extent of statistical error in measuring teacher and school performance in primary school grades using student test score gain data and value-added models. It concludes:
Using rigorous statistical methods and realistic performance measurement schemes, the paper presents evidence that value-added estimates for teacher-level analyses are subject to a considerable degree of random error when based on the amount of data that are typically used in practice for estimation. [p.35]
It found error rates of about 26% if three years of data are used for estimation. This means that more than 1 in 4 teachers deemed as ineffective or highly effective are actually of average performance and 1 in 4 teachers who are actually highly effective or ineffective will be overlooked.
This means that a bonus payment system would miss for recognition more than one quarter of all truly high-performing teachers and would erroneously identify more than one quarter of persistently average teachers for bonuses.
The error rates will increase substantially if less than three years of data are used, which will often be the case. If only one year’s data is used, there will be a 35% chance that a teacher is misidentified as significantly worse than, or above, average, that is, about 1 in 3 teachers will be misclassified.
Even with five years of data, the error rates will still be substantial – about 20%, which means about 1 in 5 teachers will be misclassified as ineffective or very effective.
The error rates for school averages are only slightly less than those for teachers. The report found that value added measures of school performance are likely to yield error rates that are about 5 to 10 percentage points below those for teachers. These lower error rates are due to the larger sample size of students for a school than for a classroom.
The findings of the new study are supported by a large body of previous research. A number of studies have examined the extent to which differences in single-year performance estimates across teachers are due to long-run differences in performance rather than to transitory influences that induce random error, and thus imprecision, in the estimates.
The errors in test results stem from two broad sources. First, random differences occur across classrooms in unmeasured factors that affect test scores, such as student abilities, background factors, and other student-level influences. Second, idiosyncratic unmeasured factors that affect all students in specific classrooms, such as a noisy heater, a barking dog on the test day or a particularly disruptive student in the class. As both sources of error are transitory, they are directly reflected in the amount of year-to-year volatility in a given teacher’s or school’s performance estimates.
The report cites existing research that has consistently found that teacher- and school-level averages of student test score gains can be quite unstable over time. Studies have found only moderate year-to-year correlations in the value-added estimates of individual teachers.
As a result, there are significant annual changes in teacher rankings based on value-added estimates. Studies from a wide set of school districts and states in the United States have found that one-half to two-thirds of teachers in the top quintile or quartile of performance from a particular year drop below that category in the subsequent year.
Such error rates will cause havoc for schemes to reward good teachers based on value added measures. Some successful teachers will receive low ratings and not receive bonuses while some unsuccessful teachers will receive high ratings and cash bonuses. Teachers can be found to be highly effective one year, but less effective in the following year simply because of statistical error. Teachers assigned to more difficult classrooms could be denied bonuses while teachers given less difficult classes are rewarded.
In effect, some teachers will be rewarded and others overlooked on the roll of a dice.
Who receives bonus payments will quickly come under scrutiny. Erroneous identification of some teachers as high performing while others are overlooked will be apparent in schools and seen as unfair. Inevitably, such unfairness in ratings and rewards will undermine teacher morale, co-operation and collaboration in schools and everyone will suffer the consequences.
The collateral damage could be even greater. Once bonus payments are made to teachers on the basis of gains in student achievement, the next step will be to use it to determine tenure, and even dismissal. The error rates identified in the NCEE report will inevitably mean that good teachers will be dismissed.
It means that the teaching profession could come to be one in which teachers get fired for reasons that are, quite literally, random. Over time, both students and teachers will suffer as will the whole education system. After all, who would want to pursue a career in such a system?
“Value added” measurement of teacher and school performance is seen by many as a rigorous way to assess teacher and school performance. The new report demonstrates that identifying high performing teachers and schools for bonus payments requires a high level of precision which is beyond what “value added” analysis can deliver. The error rates identified make it clear that value-added models are not ready to shoulder the identification burden that many people think they can.
The report says that the error rates it examined are a key factor to be considered in designing and applying performance measures based on value-added models:
These results strongly support the notion that policymakers must carefully consider system error rates in designing and implementing teacher performance measurement systems based on value-added models, especially when using these estimates to make high-stakes decisions regarding teachers (such as tenure and firing decisions). [p.35]
It adds weight to a report by the US National Research Council published last year which cautioned against prematurely promoting the use of value-added approaches to reward or punish teachers. It said that too little is known about the accuracy of these methods to base high-stakes decisions on them.