More than a quarter of teachers who were reviewed in a two-year pilot program of the state’s new evaluation system were rated only "partially effective" or worse in one part of the new system, according to a new report.
Early results of the controversial teacher evaluations will be discussed in Trenton today, when the Evaluation Pilot Advisory Committee’s report is presented at the monthly state Board of Education meeting.
The report chronicles the state’s testing of the new evaluation system, which is being used in every school district this year. The results of the pilot project did not count.
Dorothy Strickland, retired professor of Rutgers University and a member of the state board of education, said she is optimistic about the findings, which show the majority of teachers, three-quarters, were rated at the levels required to retain their tenure.
"The report, and the process, is largely positive and hopeful," Strickland, who also served on the advisory committee, said. "It’s a challenge for everyone, at a pedagogical level and at an emotional level." [emphasis mine]Professor, I can't think of too many people who would say that it's "positive and hopeful" that one-quarter of New Jersey's teachers are lousy. But, given NJDOE's track record with data abuse, it's worth asking whether this report is, in fact, accurate: did one-quarter of the teachers evaluated actually get ratings at "partially effective" or below?
According to this article, the answer is "no"; you see, only one group of teachers in the pilot program sucked that badly:
The ratings categories are divided into four levels: ineffective, partially effective, effective and highly effective. The data show that 3 percent of the teachers who were evaluated in both years of the pilot program received a rating of ineffective, the lowest, while 25 percent were rated partially effective.Oh - so it was only the group of teachers who were evaluated over two years that did so poorly. In fact, there was a group that was evaluated over only a single year; how did they do?
In the group that was evaluated for one year, 1 percent was deemed ineffective, 13 percent partially effective, 82 percent were effective and 4 percent were highly effective. Teachers who receive ineffective or partially effective ratings for two consecutive years are at risk of losing tenure.So only 14 percent of the teachers were rated "partially effective" or lower after one year, but 28 percent were rated that low after two years. Let's stop and think about the horrifying consequences of a system that unreliable for a second: according to NJDOE's own report, the number of struggling teachers doubled in one year. Does that make any sense whatsoever?
(And did you pick up the problem with the comparison? Hang on...)
Of course not; not even NJDOE would try to pretend that it did. But, true to form, they have a slick answer, primed and ready to go:
The difference between the two groups is thought to be the result of familiarity with the system, according to the report.
"With time, greater understanding of the observation framework and more practice, observers will increase their ability to identify nuances in teacher practice," the report states.Let's translate: what NJDOE really means when they say "identify nuances" is "identify teachers whose tenure we want to revoke." But let's go right to the report itself and see how NJDOE has come to their conclusion (p. 28-29):
Differentiating Between Levels of Practice
As the pilot district experience shows, there will be a learning curve as educators employ AchieveNJ for the first time. However, pilot districts demonstrated that the process becomes easier with time and as districts begin to use the new evaluation tools to better support educator practice and student achievement. As discussed in section 3.3, districts were able to conduct many more observations as they became more familiar with the observation instrument. Additionally, their increased familiarity also helped administrators more effectively differentiate levels of teacher practice. This can be seen in Figure 3.5 [sic].
This chart compares the distribution of teacher practice ratings in SY12-13 between districts in their second pilot year to those in their first.20 The blue bars (the left bar in each rating group) show data from Cohort 2 districts – those who completed their first year of piloting during SY12-13. The red bars in the chart (the right bar in each rating group) represent data from Cohort 1 school districts – those districts who completed two years of piloting the new teacher practice model. The teacher practice ratings of Cohort 1 districts were more differentiated than those of Cohort 2. Eighty-six percent of Cohort 2 teachers were rated 3 or 4; less than 1 percent received a rating of 1. Cohort 1 districts had a broader distribution of teacher practice. Seventy-three percent of teachers were rated 3 or 4, and 3 percent were rated 1. In addition, a higher percentage of teachers were rated 4 than in Cohort 2. Since districts from the first cohort were in the second year of implementation, teachers and administrators had more experience with the observation tool, a greater understanding of what the competencies look like in practice, and more opportunities to calibrate across raters. [emphasis mine]OK, wait...
If you had looked at the caption above the bar graph, you'd think that this chart was comparing Cohort 2 - the same cohort - in two different years. You'd be wrong, because the accompanying text makes it clear that two different groups of teachers are being compared. It's a typo - but it's a particularly bad typo.
Why does that matter? Well, suppose the teachers in Cohort 2 are "better" than the teachers in Cohort 1? We would, naturally, expect that the percentage of teachers who are "partially effective" or less would be lower in Cohort 2 - no matter the year in which they were measured. What we really want to know is how many teachers were rated at or below "partially effective" in the same cohort in different years. That should be easy to show, right? It must be in the report...
Try to find it. I dare you.
Unbelievably, NJDOE - on the basis of this utterly false comparison - then comes to the following conclusion:
These data indicate that with time, greater understanding of the observation framework, and more practice, observers will increase their ability to identify nuances in teacher practice, and as a result, differentiate ratings. This increased differentiation will allow districts to better identify teachers who need targeted support and at the same time, recognize those highly effective educators whose expertise can be shared to help all teachers improve their practice. [emphasis mine]This is an outrageous conclusion to come to, even if you are looking at the same group of teachers over two different years. But NJDOE didn't even do that: unbelievable as it seems, they compared two different sets of teachers in two different years to jump to this wholly unwarranted conclusion.
Again, even if this was a fair comparison, the conclusion NJDOE comes to is completely unsupported: if the teacher evaluation tool is highly unreliable, or easily subject to manipulation, you could get the same results. But lets put this huge caveat aside and look at the two different groups of teachers. The only way that anyone in their right mind would make this comparison is if the two cohorts are similar. Are they? Look at page 47 [annotation mine]:
Look at Cohort 1: one district - Elizabeth - dominates over all others, comprising almost half of the teachers in the group. If Elizabeth had substantially different teacher evaluations than the other districts in its cohort, it would skew the entire comparison. Does it? If NJDOE knows, they sure aren't saying. At the very least, this report should have compared the same cohort in two different years; however - for reasons informed by mendacity, ignorance, or both - the report doesn't even attempt to make this simple comparison.
The two cohorts of teachers NJDOE uses to draw its conclusions are quite clearly not equivalent, and any comparisons between the two groups are useless. But you know what? It really doesn't matter: even if the cohorts were equivalent, the conclusions NJDOE draws from comparing them are unsupported.
Can we talk? NJDOE's EPAC report is hack junk, straight up. If an undergraduate turned in work like this, he'd be skewered, and rightly so. The notion that you'd even try to compare two wildly different groups of teachers in two different years and draw any conclusion is bad enough. To say that the comparison is evidence that the teacher evaluation system gets better with age is not only wrong: it's embarrassingly inept.
AchieveNJ: Codename "Operation Hinderburg"