We don’t grade lessons anymore, right?


That would be foolish, wouldn’t it. Back in 2013, Professor Rob Coe made a challenge to the teaching profession and OFSTED. He proved that judging lesson observations was not only ‘harder than we thought‘, but that grading lesson observations and ‘seeing learning’ was a fool’s errand.

Collective scales fell from our eyes. I’d happily graded lessons before this furore, apparently confident in my judgements, only to feel abashed I’d graded and hoped to be graded myself in turn. Now, well, I may well brush that reality under the table, but it also makes me ask questions about our continuing assumptions about judging teacher performance.

Now, in the mature glow of 2017, grading lessons, or seeking out progress in snapshots, is no longer the norm. And yet, in schools across the country, teachers are annually judged on their overall performance. Though we may have better, broader approaches to judge teacher performance than lesson observation snapshots, some of the issues with lesson observation gradings are imitated in our judging of overall teacher performance.

In recent research on classroom observations, ‘Classroom Composition and Measured Teacher Performance – What Do Teacher Observation Scores Really Measure? – Steinberg and Garrett found that the composition of the student groups heavily influenced judgments:

“…we find that the incoming achievement of a teacher’s students significantly and substantively influences observation-based measures of teacher performance. Indeed, teach- ers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality that are fixed over time.”

Their research goes on to reveal the paradox that the teachers judged to be more effective then get assigned the higher achieving students. It becomes a virtuous cycle for some teachers. Lesson observation scores rated approximately half of the teachers (48% in English; 54% in Maths) in the top two performance quintiles if assigned the highest performing students, whilst a lowly 37% of English and 18% of Maths teachers who were assigned the lowest performing students were highly rated in classroom observations.

The results of this evidence suggest that the best way to secure success in annual appraisals may well be to glamour to teach the ‘nice kids’ and the ‘top sets’. The rationale is obvious: ‘high performing’ students attend school, typically behave better, get better test scores. In short, high performing students make teachers look better. The opposite could prove a significant issue: teachers who are ‘great with the tricky classes’ are exposing themselves to potentially career damaging judgments.

Though we may brush off the notion that our performance management and appraisal systems are not prone to such biases and limitations, we should consider the matter more closely. We should ask:

  • How far do the classes assigned to individual teachers bias any annual judgments of teacher effectiveness?
  • What role do ungraded lesson observations play in teacher appraisal and do implicit biases still play a role in annual judgments of teacher effectiveness? 
  • How valid and reliable are our judgments of teacher effectiveness? 

The fallibility of systems for judging teacher effectiveness is nothing new. In the US, back in 2009, large scale evidence exhibited what was described as “the widget effect”, which described the failure of school evaluation systems. These systems revealed the common tendency of “school districts to assume classroom effectiveness is the same from teacher to teacher”. Teachers were rated as ‘interchangeable parts’, or widgets.

Revisting this research on the ‘widget effect’ in 2017, Kraft and Gilmour showed how school leaders cited not having time to do a good job of teacher judgements, so they avoided low ratings of their colleagues. Other reasons cited were the ‘potential’ of teachers to get better fostered giving the ‘benefit of the doubt’, alongside the desire not to demotivate teachers. Perhaps obviously, another reason for an accurate scrutiny of underperformance was simply a “personal discomfort” on behalf of school leaders. Of course, a familiar issue was cited – we don’t have enough teachers to go around as it is – we cannot be negatively judging too many teachers.

Perhaps the best marker of teacher performance school leaders can rely upon is test scores, but, once more, we have an issue here too. New research from the US has shown that test scores are a worse predictor than other factors, like absences and suspensions: ‘What Do Test Scores Miss? The Importance of Teacher Effects on Non-Test Score Outcomes‘.

Screen Shot 2017-11-19 at 2.44.46 PM

So we are left in a pickle. With weak proxies for teacher performance like lesson observations proving inadequate, we are left on shaky ground fairly judging teacher effectiveness. The evidence from Kraft shows that we are getting better with focusing teacher observations on instruction, with a broader sense of judging teacher by a range of performance categories, but judging teachers is still unreliable.

Given how important it is to fairly and accurately judge teacher effectiveness, it is an issue we must face. At a time of a shrinking workforce, a deficit model is simply untenable. How can we then support teachers to get better and evaluate them fairly and accurately? It is a question we should all be asking.

We may pat ourselves on the back for no longer grading lessons in twenty minute snapshots, but we are still going about grading teachers and making annual appraisal judgments with the same degree of confidence as we had back in the day when we were grading away twenty minute gobbets in blissful ignorance.


Further Reading:

‘Building a More Complete Understanding of Teacher Evaluation Using Classroom Observations’, by Julie Cohen and Dan Goldhaber. This American research asks some crucial questions about lesson observations.