I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Sunday, October 21, 2012

Test Scores and the Newark Teachers Contract

It's a fact: test scores will be used to determine who gets merit pay bonuses in the proposed Newark teachers contract. From the new TEACHNJ Act, which sets out criteria for teacher evaluations:
Standardized assessments shall be used as a measure of student progress but shall not be the predominant factor in the overall evaluation of a teacher.
As I've said here many times, it doesn't matter if tests are the "predominant factor" or not when used to make a high-stakes decision like granting a bonus: some of the evaluation becomes all of the decision. If two teachers have the same observation scores, the one with a better test-based rating is going to get the money. Period.

Now, we all know this will change what is taught in the classroom: teachers teach the test when the test tests teachers. And don't let know-nothings like Michelle Rhee try to convince you that teachers won't drill-and-kill because the research shows that doesn't work: the truth is, it shows no such thing. We also know these tests are unvetted, often poorly constructed, often poorly graded, and quite often harmful to kids. In addition, recent cost-benefit analysis suggests the price for implementing these tests far exceeds any real or perceived benefit.

But let's put all that aside for the moment and ask the fundamental question: will test-based evaluations do a good job at identifying who is a "highly effective" teacher, and who is not? Luckily, Newark's teachers don't have to guess: there have been many experiments across the country with using student tests to evaluate teachers. So how have they worked out? (all emphases mine)

Geoffrey Robinson is a National Board certified teacher at Osceola High School in Pinellas County who says 60 percent of his upper-level calculus students last year tested so well they earned college credit.
But this week Robinson received his teacher evaluation, based on a controversial new formula being rolled out statewide.
He was shocked to see how poorly he scored in the "student achievement" portion: 10.63 out of 40. 
"It's hilarious," said Michael Rush, a math teacher at New Tampa's Wharton High School. "The first year my (student achievement) score was a 17-point-something and this year it was a 25-point-something."
That's a 50 percent increase, though his teaching methods are more or less the same. "You cannot deviate that much as a teacher," he said. "It's all voodoo math."
What aggravates teachers most is that 40 to 50 percent of their evaluation is based on "student achievement" — but it's not always their own students who are being measured.
For example, fourth- and fifth-grade teachers are rated partly on their students' FCAT scores. But the FCAT is not given until third grade. So if you teach a lower grade, then your "student achievement" score is based on the scores of older students at your school. Similarly, teachers of subjects that don't even appear on the state's standardized test are being evaluated, at least in part, on FCAT scores. Eventually, the newer end-of-course exams will be counted into the equation.
Kim Musselman teaches kindergarten at Clearwater's Eisenhower Elementary, where children's families tend to move a lot, she said. So she likely will be evaluated on the scores of older children whom she never taught.
When the evaluations are used to determine raises, "my pay is going to be based on kids that I've never had before," Musselman said.
The state of TN paid millions to the Milkin family of Wall St fraud for their TAP/TEAM evaluations system that scores teachers on a 5 point rubric. The national trainers indoctrinated, I mean, trained all evaluators that a score of (3) is "rock solid". If there are "too many" high scores the trainer asserted that evaluators were gaming the evaluation tool. 

TN Commissioner of Education, TfA grad Kevin Huffman bloviates to media outlets that previous teacher evaluations inflated the scores and that "too many teachers were overrated. 

The Milkins have a fix for that problem. Since the scores of TEAM/TAP teachers follow a Bell-shape distribution (according to their non-peer reviewed research), only 15% of teachers will achieve scores above (4) or significantly above (5) expectations and 85% will perform at or below expectations. 
More Tennessee:
At the end of the school year, districts with observation scores that don’t closely align with value-added growth scores could penalize principals by taking 10 percent off their own evaluations, Barton said. For now, the state focus is that educators continue to get used to the new evaluations and that teachers get constructive feedback.
Even more Tennessee:
For 15 percent of their testing evaluation, teachers without scores are permitted to choose which subject test they want to be judged on. Few pick something related to their expertise; instead, they try to anticipate the subject that their school is likely to score well on in the state exams next spring.
Several teachers without scores at Oakland Middle School conferred. “The P. E. teacher got information that the writing score was the best to pick,” said Jeff Jennings, the art teacher. “He informed the home ec teacher, who passed it on to me, and I told the career development teacher.”
It’s a bit like Vegas, and if you pick the wrong academic subject, you lose and get a bad evaluation. While this may have nothing to do with academic performance, it does measure a teacher’s ability to play the odds. There’s also the question of how a principal can do a classroom observation of someone who doesn’t teach a classroom subject.
 New York City:
But this new analysis shows that a teacher teaching the same grade, the same course, with the ‘same students’ does not get consistent results.  It is truly like weighing yourself on a scale, getting off the scale and then one second later getting on the same scale, and having your ‘weight’ change by twenty pounds.
 More New York City:
In one extreme case, the formula assigned an eighth-grade math teacher at the prestigious Anderson School on the Upper West Side the lowest possible rating, a zero, even though her students posted test scores 1.22 standard deviations above the mean — normally good enough to rank in the 89th percentile. Her problem? The formula expected her high-achieving students to be 1.84 standard deviations higher than the average — roughly the 97th percentile.
Not all teachers who work with the city’s better-off students can be “above average,” said Sean Corcoran, an associate professor of educational economics at New York University who has written about the city’s scores, developed through a system called value-added modeling.
Los Angeles:
Next, they developed an alternative, arguably stronger value-added model and compared the results to the L.A. Times model. In addition to the variables used in the Times’ approach, they controlled for (1) a longer history of a student’s test performance, (2) peer influence, and (3) school-level factors. If the L.A. Times model were perfectly accurate, there would be no difference in results between the two models. But this was not the case. For reading outcomes, the findings included the following: 
 More than half (53.6%) of the teachers had a different effectiveness rating under the alternative model. 
• Among those who changed effectiveness ratings, some moved only moderately, but 8.1% of those teachers identified as effective under our alternative model are identified as ineffective in the L.A. Times model, and 12.6% of those identified as ineffective under the alternative model are identified as effective by the L.A. Times model. 

The math outcomes weren’t quite as troubling, but the findings included the following: 
• Only 60.8% of teachers would retain the same effectiveness rating under both models. 
• Among those who did change effectiveness ratings, some moved only moderately, but 1.4% of those teachers identified as effective under the alternative model are identified as ineffective in the L.A. Times model, and 2.7% would go from a rating of ineffective under the alternative model to effective under the L.A. Times model. 
[Adding] Louisiana:
An increasing number of educators say Louisiana’s evaluation system makes it more likely teachers at high-achieving public schools will get poor reviews, threatening their job security. 
“You are looking at trouble,” Norma Church, principal of top-rated Westdale Heights Academic Magnet School in Baton Rouge, said.
Superintendent of Education John White said the concerns are mostly misplaced, and teachers at the best schools are better positioned to get good reviews.

The issue surfaced earlier this month when a state lawmaker said South Highlands Elementary, Louisiana’s highest-rated elementary school, was seeing teachers rated as “ineffective” even though students had some of the state’s best test scores.

The problem is that when high-scoring students fall off a bit from the previous year’s performance teachers can be rated as ineffective.


Erin Darwin Pizarro, who taught at highly ranked Caddo Middle Magnet School in Shreveport last school year, said that while almost all of her gifted students scored in the highest levels on their sixth-grade iLEAP exam, one dropped a notch and some fell slightly.

The result, Pizarro said, was that she received a 40 percent rating on a 100 percent scale.

She said that while she loves teaching, evaluation worries have “made me consider quitting forever because of an assessment which could eventually label me ineffective and cost me my job.”

To me, the evidence is quite clear: test-based teacher evaluations are not accurate enough to make high-stakes decisions. There really is no debating this.

So Newark's teachers have a choice: they can accept a boatload of private money they wouldn't otherwise get, distributed based on teacher "effectiveness."

But - unless the negotiators of the contract can present evidence otherwise - they can be assured that the money will not be distributed with a reasonable level of accuracy if standardized test-based evaluation is involved.

Again: the argument that the tests only represent "some" of the evaluation is not compelling. The tests become what Bruce Baker calls the "tipping point," because it is the part of the evaluation that varies the most. Say it with me: some of the evaluation; all of the decision.

Speaking of Baker: we haven't even addressed the subject of New Jersey's use of tests through Student Growth Percentiles (SGPs). As bad as my above examples were, at least they attempted to tease out teacher effect; we're not even going to try to do that in Jersey. More to come...

As a parent and a teacher, I feel strongly that teachers need guidance. We need to always be striving. We need to be energized by thoughtful analysis of our successes and failures. But we need to devise a way of measuring teacher-effectiveness that provides teachers with meaningful data — and yearly progress on state exams just isn’t it. Teachers’ unions are right to resist any increase in the weight of this unreliable measure. I’m not sure what the solution is, but if twin studies are the gold standard for sciences like behavioral genetics, perhaps they could help in education reform as well. This is our family’s small contribution.


He currently attends an elementary school in a very rural county that prides itself on its school rating. When he entered first grade, they were an “A School.” Honestly, rather than seeing this as a plus, it made me uneasy.
At the end-of-the year student awards ceremony, the only subjects that the principal mentioned were math and reading. He announced that the school had the highest FCAT score for 3rd graders in our state. I noticed, however, that very few students made the all-subject honor role (all As and Bs). After speaking with teachers, it became apparent that all emphasis was on math and reading and, as a result, enthusiasm and achievement in other subjects suffered. Luckily, his own teacher rebelled and actually read the class Isaac Asimov. My grandson now loves science fiction and believes science is important.
This school philosophy, however, is seriously hurting him now. His current teacher has announced that his entire class will not have recess until their AR (Accelerated Reading) scores improve. It turns out that their school declined to a B school, and current scores indicate that they are not improving. Mind you, they have only been in school for 6 weeks.
Imagine how an active, imaginative, and very verbal 7 year old boy will function during a school day that does not include an outlet for him to express himself or learn to socialize with others in an unstructured environment.
Campbell's Law in action. If the tests scores determine a teacher's pay, the teacher will do whatever it takes to raise them.

Everyone OK with this?


jcg said...

Update on TN and TfA cultist Huffman's response to educators complaints about the TEAM eval.

Recall, the legislature was flooded with complaints from educators that TEAM was a simple checklist of behaviors that could not capture the complexity of teaching and the 1-5 scoring metric of said checklist resulted in a forced bell curve. Michale Winerip wrote a NYTimes article featuring the TN fiasco and the problems with the TEAM metric.How did Huffman respond?

This year the TN state dept retrained evaluators with these words of wisdom:
1. The TEAM evaluation is NOT a checklist.
2. The TEAM metric is not a forced bell curve.
3. Don't use "rock solid" for a score of '3'.
4. Recommended evaluators read a book entitled "How to Have the Hard Conversations" (A book written by a business person to assist employers how to give poor evals or fire employees)

ZERO evaluation of TEAM metrics and scoring criteria. ZERO credence given to the merits of arguments against its misuse.

How does one counter such stubborn ignorance when facts and evidence don’t matter?

Duke said...

Thx, jcg. It starts with you folks in TN, and spreads outward.