Jersey Jazzman: Fact: Some Newark Merit Pay Will Go To the Wrong Teachers

The teachers of Newark have a decision to make: they can accept a large sum of private money and settle their contract... but only if they adhere to a merit pay program.

I am not about to tell anyone what to do, but I do feel compelled to lay out the facts. And one fact is for certain:

If the merit pay bonuses are based on the state-mandated four-tier evaluation system, at least some of the merit pay will go to the wrong teachers.

How can I be so sure of this? Because the state law requires "student achievement" - in other words, standardized test scores - to be used in teacher evaluations. And in any system where student scores are used to determine teacher evaluations, there will be errors.

But don't take my word for it; talk to the people who actually advocate for using Value Added Modeling (VAM) in the use of teacher evaluations. Every one of them admits that there will be errors; they understand it would be ridiculous to imply that there won't be any. For example, here's Dan Goldhaber, an economist and big VAM cheerleader:

We do believe it makes sense for policymakers to be transparent with stakeholders about how teacher rankings can be affected by model choice. This is important, for our findings show that models that, broadly speaking, agree with one another (in terms of high correlations) can still generate, arguably, meaningful differences in teacher rankings that correlate with the type of students they are serving. Moreover, some models that could be seen by researchers as providing more accurate estimates of true teacher performance (i.e. school and student fixed effects specifications) generate effectiveness estimates that are far less stable over time than less saturated specifications. And, issues like intertemporal stability, or the transparency of a measure, may also influence teachers’ perceptions of the measure. [emphasis mine]

In other words: this stuff is subject to noise and bias - there will be errors. How about Raj Chetty and John Friedman, authors of the (in)famous study about the economic value of teacher effects?

More generally, we should recognize that value-added data can be a useful statistic even though it’s not perfect, just like performance measures in other occupations. The manager of a baseball team pays attention to a player’s batting average even though it too is an imperfect statistic that bounces around over time. If a new player gets no hits in his first month, one option is waiting to see whether he is just in a temporary slump. Another is more coaching. But on occasion, the best option may be to let that player go and call up a replacement. [emphasis mine]

That's what I love about these guys: they soar over schools at 30,000 feet, conjecturing about "options," seemingly oblivious to the idea that it may only take a few isolated cases of getting a high-stakes decision wrong before teachers turn against the system.

(The baseball analogy is also not great. It's not my batting average; it's my students'. And there's no error involved in measuring batting: either you hit the ball or you didn't. Standardized tests are poor proxies for judging the learning we should care about the most.)

Now, in New Jersey, the DOE has pretty much stated that we're going to be using Student Growth Percentiles (SGPs) to come up with teacher ratings based on test scores. How does the guy who developed SGPs, Damian Betebenner, feel about using them to identify teachers in a high-stakes decision like merit pay?

Professor Baker emphasizes this with SGPs:

Again, the whole point here is that it would be a leap, a massive freakin’ unwarranted leap to assume a causal relationship between SGP and school quality, if not building the SGP into a model that more precisely attempts to distill that causal relationship (if any). [Emphasis in original]

We would add that it is a similar “massive … leap” to assume a causal relationship between any VAM quantity and a causal effect for a teacher or school, not just SGPs. We concur with Rubin et al (2004) who assert that quantities derived from these models are descriptive, not causal, measures. However, just because measures are descriptive does NOT imply that the quantities cannot and should not be used aspart of a larger investigation of root causes. [emphasis mine]

By "not causal," Betebenner means that SGPs don't even try to tease out the cause of a student's test score. Well, if they don't, how can they possibly be used to determine merit pay, which will require very fine judgments at the cut level between those who get it and those who don't?

Betebenner's comment here is in response to Bruce Baker (here's Baker's reply to the reply). Baker has said that using SGPs to determine a teachers effect on test scores is like trying to fly a plane with a saxophone: it's the wrong "instrument" for the job.

I'll say this again: these are smart guys, way above my pay grade. But that may be the problem: they are so wrapped up in the technical side of what they're doing, they can't see the larger picture. As Matt DiCarlo has pointed out, saying you want to improve teacher quality is not an actual policy; you have to actually put a real plan on the table and evaluate its total effect.

So, yes, VAMs and SGPs may tell us something about teacher quality, but tying them to merit pay is a whole different ballgame. We really have no idea how many "wrong" merit pay decisions it will take before teachers change their behaviors - for better or worse. We do know, however, that merit pay has never worked every time it's been tried. And we know that there will most certainly be inaccurate teacher evaluations in Newark over the next three years. So, if merit pay is enacted, there is good reason to believe it won't help test scores, and there may be negative consequences arising from the "inaccurate" distribution of the bonuses.

Newark teachers, you have to ask yourselves: are you willing to take this risk? Are you willing to be subjects in an experiment that doesn't have much evidence to support its underlying theory? Is the money worth it?

I can't answer that for you. I know many of you are frustrated by working without a contract for so long. I know you need those step increases. I know many of you think believe you're not going to get a better deal than this one, and you may actually wind up doing worse if you fight it.

Those are all legitimate considerations. But understand the trade-off. Understand what you will signing on to if you agree to this package. And understand that the rest of us may see things differently - especially since we won't have private money attached to our deals.

ADDING: Let's please put aside this "multiple measures" argument once and for all, OK? The one where testing cheerleaders say: "Well, test scores are only part of the evaluation."

Say it with me: "Some of the evaluation; all of the decision."

ADDING MORE: Bill Ferriter has three things you must know about merit pay before you approve this deal.

EVEN MORE: The Carnegie Foundation has some very good briefs on VAM; I'd encourage you to read them (h/t Matt DiCarlo). From the first one, by Stephen W. Raudenbush:

Imprecision is present in any personnel evaluation system. Standard statistical procedures help us cope with it. One rule is that we should examine confidence intervals rather than point estimates of teacher-specific value-added scores. On purely statistical grounds, it’s hard to justify publishing point estimates of teachers’ value added in local newspapers, especially without including any information regarding the margins of error. Even if the estimates are free of bias, the current level of imprecision renders such point estimates, and distinctions among them, misleading. A second rule is that administrators who want to use value-added statistics to make decisions about teachers must anticipate errors: falsely identifying teachers as being below a threshold poses risk to teachers, but failing to identify teachers who are truly ineffective poses risks to students. Knowing the reliability of the value-added statistics and, in particular, estimating their correlation with future gains in student learning, allows us to quantify the likelihood of making those errors under varied scenarios. These well-known techniques work when we assume that the measures of teachers’ value added are unbiased, and managing the problem of bias is important. [emphasis mine]

Make no mistake: there will be errors.

4 comments:

YastreblyanskyOctober 23, 2012 at 4:24:00 PM PDT
You do such great work, but I have a quibble here, with

By "not causal," Betebenner means that SGPs don't even try to tease out the cause of a student's test score. Well, if they don't, how can they possibly be used to determine merit pay, which will require very fine judgments at the cut level between those who get it and those who don't?

I think it's that raising the SGP does not cause a rise in school quality, even if they often go together. It's important because it's a really huge insight into how it works, and why reformies (I love that word) feel so confident that they are right although they are wrong. Namely:

There is some factor or set of factors X that is hard to quantify or even define that makes a school good. If you add more X, the school will get better (as measured let's say by student happiness and by subject mastery), and the test scores will likely go up too (and the SGPs with them), but it's not certain.

There are also other approaches that can raise test scores and SGPs, and they are even less certain, but they are easily quantifiable. These are what reformies like. But Betebenner says they will not make the school a better place, and he's right.

You can make a game out of test taking, for instance, the way they do at Kaplan, and students will perform better without knowing the material.
Or you can drill them on every possible question, and they will perform better except when they can't extrapolate their knowledge to an unfamiliar question. Well, you know all that.
YastreblyanskyOctober 23, 2012 at 4:41:00 PM PDT
Not that they believe consciously that the number is the cause of the quality; but that they model it were, because they think it looks scientific.
YastreblyanskyOctober 23, 2012 at 4:43:00 PM PDT
...model it as if it were...
DukeOctober 23, 2012 at 6:10:00 PM PDT
Yast, you're making a point that can't be reiterated enough: this is all contingent on the tests being valid and reliable measures of student learning.

They aren't. That needs to be said over and over again.

Thanks for commenting.

Sorry, spammers have forced me to turn on comment moderation. I'll publish your comment as soon as I can. Thanks for leaving your thoughts.

Jersey Jazzman

Pages

Tuesday, October 23, 2012

Fact: Some Newark Merit Pay Will Go To the Wrong Teachers

4 comments: