Sunday October 11, 2009
A testing expert has made some devastating criticisms of the reliability of school test results to be published later this year or early next year.
Professor Margaret Wu from the University of Melbourne says that linking school performance to student achievement on these tests is “pure conjecture”.
In a keynote paper delivered in Hong Kong in July, Professor Wu said that the NAPLAN tests have a high level of inaccuracy. She said that there are large measurement errors at the individual student and class levels.
She said that these errors meant that high stakes decisions such as judging school and teacher performance on student scores should not be made on the basis of these tests.
Professor Wu also said that and that the tests are not suitable for measuring achievement growth between two points in time for individual students or classes. She also made some technical criticisms which call into question the validity of the tests and the method used to equate the scores of students across different year levels on the same scoring scale.
The extent of the errors is quite large, even for individual students, and they are exacerbated at the class and school levels. Professor Wu found that measurement errors in annual 40-item tests, such as those being used in NAPLAN, would lead to about 16 per cent of students appearing to go backward when they had actually made a year’s progress. She said this is a conservative estimate as it does not take account of other sources of error such as the assumption that two tests are assessing the same content. The errors could well be larger.
While the size of the measurement error reduces for classes and schools, they are still quite large. For example, Professor Wu found that the statistical uncertainty around the average results on these tests for classes of 30 students is equivalent to more than six month’s learning. Many schools around Australia only have this many students or less participating in the NAPLAN tests. For schools, with two classes of 30 students tested the error could amount to about four months of learning.
These results relate only to measurement error in the tests. There are also other sources of error, most notably sampling and equating errors, which add to the uncertainty and inaccuracy of the results.
Measurement error is a result of inconsistency in test results because the same students may achieve different results on the same test on different days because of differences in their own well-being, such as lack of sleep or food, or because of variations in external factors such as how cold or hot conditions are in the room in which the tests are conducted. It also arises from differences in the items selected for testing and the way answers are scored.
Sampling error arises from differences in the selection of students to participate in tests. A group of students selected for a test are likely to achieve different results from another group simply because of differences in their composition. The group selected for testing may not reflect the average level of ability of all students. The smaller the sample, the more likely there will be a significant difference between the average results of the sample tested and the results if all students were tested.
Sampling error occurs even when all students in a year cohort are tested. This is because inferences are made about school performance by testing selected cohorts, such as Years 3, 5, 7 and 9 in the national literacy and numeracy assessments. Each cohort of students tested is a sample of the students in the school for the purpose of measuring school performance.
Equating errors arise in comparing tests over time and in creating a common scale of scores for students across different Year levels. For example, building a common score scale across several year levels involves sophisticated statistical methodology to ensure that the results are reliable and valid. Different methodologies produce different results.
Professor Wu says that equating error is a major source of inaccuracy. This is because test items often work differently for different groups of students across states, there are curriculum differences across states and some content areas are not fully covered.
Professor Wu has followed up her criticisms in a letter to The Age recently saying that if student performance is not measured well by NAPLAN then the results cannot be used to assess school and teacher performance. She said that it could mean that schools and teachers are accused of not doing their job when they are.
Professor Wu says that the criticisms apply also to so-called like school comparisons. The large error margins make these comparisons practically meaningless.
When schools are grouped into ‘’like’’ groups, we need even more precision in the measures to detect differences between schools. It will be easy to demonstrate the difference between a high-profile private school and a low socio-economic government school, but it will be more difficult to determine significant differences between two high-profile private schools.
These are devastating criticisms. Julia Gillard has assured that the new national school performance reporting system will give accurate data on individual school performance. However, it appears that the national tests are not up to the mark.
The large statistical errors will wreak havoc when comparing school results.
It will not be possible to make reliable comparisons or rankings of schools because they may reflect chance differences in school performance rather than real differences. Such comparisons will mostly identify lucky and unlucky schools, not good and bad schools. It also means that current school performance is highly misleading as a guide to future school performance.
These statistical errors in school results also mean that school performance and school rankings are highly unstable from year-to-year. It is highly misleading to compare changes in school performance from one year to the next, especially in the case of smaller schools. It leads to unwarranted conclusions about changes and often unfairness in the inferences drawn about schools.
Professor Wu’s criticims show that Julia Gillard’s faith in the ability of NAPLAN to identify successful schools is misplaced. Rather than accurately measuring school performance as Gillard asserts, the new school performance reporting system is likely to mislead parents and policy makers.
Parents may be misled in choosing a school. Some schools may be recognised as outstanding while others are identified as unsuccessful simply as the result of chance and not because of actual programs and teaching practice. It also means that current school performance is highly misleading as a guide to future school performance.
The large error margins may also mislead policy makers because it will be difficult to identify effective school practices. It may mislead decision-makers and schools in recommending and adopting particular educational programs. Action taken to assist less successful schools may appear more effective than it is in practice.