I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Saturday, April 29, 2017

Desperately Searching For the Merit Pay Fairy

It's been a while since we've talked about the Merit Pay Fairy.

Yo, it's me -- da Merit Pay Fairy, makin' all your reformy dreams come true!

The Merit Pay Fairy lives in the dreams and desires of a great many reform-types, who desperately want to believe that "performance incentives" for teachers will somehow magically improve efforts and, consequently, results in America's classrooms. Because, as we all know, too many teachers are just phoning it in -- which explains why a system of schooling that ranks and orders students continually fails to make all kids perform above average...

One of the arguments you'll hear from believers in the Merit Pay Fairy is that teaching needs to be made more like other jobs in the "real world." But pay tied directly to performance measures is actually quite rare in the private sector (p. 6). It's even more rare in professions where you are judged by the performance of others -- in this case, students, whose test scores vary widely based on factors having nothing to do with their teachers

But that doesn't matter if you believe in the Merit Pay Fairy; all that counts is that some quick, cheap fix be brought in to show that we're doing all we can to improve public education without actually spending more money. And, yes, merit pay as conceived by many (if not most) in the "reform" world, is cheap -- because it involves not raising the overall compensation of the teaching corps, but taking money away from some teachers and giving it to others, using a noisy evaluation system incapable of making fine distinctions in teacher effectiveness.

Which brings us to the latest merit pay study, which has been getting a lot of press:
Student test scores have a modest but statistically significant improvement when an incentive pay plan is in place for their teachers, say researchers who analyzed findings from 44 primary studies between 1997 and 2016.
“Approximately 74 percent of the effect sizes recorded in our review were positive. The influence was relatively similar across the two subject areas, mathematics and English language arts,” said Matthew Springer, assistant professor of public policy and education at Vanderbilt’s Peabody College of Education and Human Development.
The academic increase is roughly equivalent to adding three weeks of learning to the school year, based on studies conducted in U.S. schools, and four weeks based on studies across the globe.
Let's start with the last paragraph first: the notion that you can translate this study's effects into "weeks of learning" is completely without... well, merit. Like so much other research in this field, the authors make the translation based on a paper by Hill et al. (2008). I'll save getting into the weeds for later (and in a more formal setting than this blog), but for now:

Hill et al. make their translation of effect sizes into a time periods based on what are called vertically-scaled tests. These are tests that let at least some students attempt to answer at least some common items between concurrent grade levels, allowing for a limited comparison between grades (see p.17 here).

There is no indication, however, that any of the tests used in any of the 44 studies are vertically scaled -- which makes a conversion into "x weeks of learning" an unvalidated use of test scores. In other words: the authors in no way show that their study can use the methods of Hill et al., because the tests are likely scaled differently.

Furthermore: do we have any idea if the tests used in international contexts are at all educationally equivalent to the tests here in the US? For that matter, what are the contexts for the teaching profession, and how it might be affected by merit pay, in other countries? So far as I'm concerned, the effect size we care about is the one found in studies conducted in this country.

That US effect size is reported in Table 3 (p. 44) as 0.035 standard deviations. How can we interpret this? Plugging into a standard deviation-to-percentiles calculator (here's one), we find this effect moves students at the 50th percentile to 51.4.* It's a very tough haul to argue that this is an educationally meaningful effect.

Which brings us to the next limitation of this meta-analysis: the treatment is not well defined. To their credit, the authors attempt to divide up the different studies by their characteristics, but they only do so in the international aggregate. In other words: they report the differences between a merit pay plan that uses group incentives versus a "rank order tournament" (p. 45, Table 4), but they don't divide these studies up between the US and the rest of the world.

Interestingly, group incentives have a greater effect than individual competitions. But there is obviously huge variation within this category in how a merit pay plan will be implemented. For example: where did the funds for merit pay come from? 

In Newark, merit pay was implemented using funds dedicated by Mark Zuckerberg. Teachers were promised that up to $20 million would be available; of course, it turned out to be far less (and it's worth noting that there's scant little evidence Newark's outcomes have improved). Would this program have different effects if the money had not come from an outside source?** What if the money came, instead, from other teachers' salaries (which may, in fact, be the case in Newark)?

Any large-scale merit pay plan will be subject to all sorts of variations that may (or may not) impact how teachers do their jobs. Look at the descriptions in Table 6 (p. 47), which recounts how various merit pay plans affect teacher recruitment and retention, to see just how diverse these schemes are.

I think it's safe to say that "merit pay" in the current conversation is not really about giving bonuses for working in hard-to-staff assignments, or for taking on extra responsibilities, or even for working in a group that meets a particular goal. I'm not suggesting we shouldn't be looking at the effects of programs like this, but I don't think it's helpful to put them into the same category as "merit pay."

I think, instead, that "merit pay" is commonly understood as being a system of compensation that differs from how we currently pay teachers: one where pay raises are based on individual performance instead of experience or credentials. The Chalkbeat article certainly implies this by making this comparison:
Teacher pay is significant because salaries account for nearly 60 percent of school expenses nationwide, and research is clear that teachers matter more to student achievement than any other aspect of schooling (although out-of-school factors matter more). About 95 percent of public school districts set teacher pay based on years of experience and highest degree earned, but merit pay advocates argue that the approach needs to change. [emphasis mine]
Take a look at a sample of articles on teacher merit pay -- here, here, here, here, and here for example -- and you'll see merit pay contrasted with step guides that increase pay for more years of experience or higher degrees. You'll also notice none of the proponents of merit pay are suggesting that the overall amount spent on our teaching corps should increase.

I can understand the point of writers like Matt Barnum who argue that merit pay can come in all sorts of flavors. But I contend we're not talking about things like hard-to-staff bonuses or group incentives: When America debates merit pay, it's really discussing whether we should take pay from some teachers and give it to others.

Unfortunately, by analyzing all of these different types of studies together, the Vanderbilt meta-analysis isn't answering the central question: should we ditch step guides and move to a performance based system? That said, the study may still be giving us a clue: the payoff will likely be, at best, a meager increase in test scores.

Of course, we have to weigh that against the cost -- or, more precisely, the risk. Radically changing how teachers are paid would create huge upheavals throughout the profession. Would teachers who were in their current assignments stay on their guides, or would they potentially take huge hits in pay? If they were grandfathered out of a merit pay scheme, how would they work with new teachers who were being compensated differently?

Would merit pay be doled out on the basis of test scores? How much would VAMs or SGPs be weighted? How would teachers of non-tested subjects be eligible? Would the recipients of merit pay be publicly announced? In New Jersey and many other states, teacher salaries are public information. Would that continue? And how, then, would students be assigned to the teachers who receive merit pay? Will parents get to appeal if their child is assigned to a "merit-less" teacher?

The chaos that would result from implementing an actual merit pay plan is a very high cost for a potential 0.035 standard deviation improvement in test scores.

I know believers in the Merit Pay Fairy would like to think otherwise, but clapping harder just isn't going to make these very real issues go away.

Don't listen to dat Jazzman guy! Just clap harder, ya bums!

ADDING: More from Peter Greene:
Researchers' fondness for describing learning in units of years, weeks, or days is great example of how far removed this stuff is from the actual experience of actual live humans in actual classrooms, where learning is not a featureless tofu-like slab from which we slice an equal, qualitatively-identical serving every day. In short, measuring "learning" in days, weeks, or months is absurd. As absurd as applying the same measure to researchers and claiming, for instance, that I can see that Springer's paper represents three more weeks of research than less-accomplished research papers.

* Some folks don't much care for making this kind of conversion. In my view, it's much more defensible than converting to "x weeks of learning," which, even setting aside the problems of converting from vertically scaled tests, suffers from unjustified precision. In addition, the implications behind the translation are subject to wild misinterpretation.

Converting to percentiles might a bit problematic. But it's not nearly as bad as using "x weeks of learning."

** We'll never know because no one has bothered to find out if the Newark merit pay program actually worked. Think about it: $100 million in Facebook money, and no one ever considered that maybe reserving a few thousand for a program evaluation was a good idea.

If I was cynical, I might even think folks didn't want to study the results, because they were afraid of what they might find. Good thing I'm not cynical...

No comments: