I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Friday, June 7, 2013

Jonah Rockoff Testifies at NJBOE: Part I

I've been given a remarkable piece of audio that I think has to be shared widely. Jonah Rockoff, a professor of finance and economics, testified at the New Jersey State Board of Education this past Wednesday on the topic of teacher evaluation.

There is no arguing with Dr. Rockoff's academic qualifications or intellect; still, it speaks volumes about where we are in our education debate that an economics professor is brought before the NJBOE as an expert in education policy. As Bruce Baker says:
Behavioral economics is an interesting and potentially useful field of academic inquiry. At its best, real behavioral economics attempts to address some of the concerns I raise here. But many if not most assumptions about human behavior and response to incentives are not representative of behavioral economics at its best. 
Specifically,  I’m increasingly concerned with what I see as the simple-minded projection of economic thinking onto everyone and anyone else, leading to ridiculous policy recommendations – that amazingly – get taken seriously – at least by the media and punditocracy. [emphasis mine]
Each time I listen back to Dr. Rockoff's presentation, I find myself thinking's about Dr. Baker's words. And that's why I'm going to take some time over the next several days to really pick apart Rockoff's arguments from a practical perspective.

It's all well and good to run statistical models and extrapolate theories based on tangentially related research. It's quite another to take these musings and apply them to everyday, real-life schools. I sometimes wonder if folks like Rockoff - brilliant as they are  - are flying at 40,000 feet over those of us down in the trenches, dropping bombs and worrying little about the collateral damage.

But judge for yourself:

Before we dive into the details, let me make three points about a metaphor Rockoff uses several times in his speech:

Rockoff apparently likes baseball analogies. He spends a lot of time in the presentation talking about batting averages. Of course, he acknowledges, batting average is only one component of determining whether a player is valuable. The problem is that he then analogizes this to the "multiple measures" practice of teacher evaluation: combining test-based measures with observations and other qualitative measures.

Here's the first problem with Rockoff's analogy: in baseball, many other quantitative measures are used (in addition to qualitative measures) besides BA to determine a player's "value." Check out this post about players who hit under .300 and yet were still great offensive forces for their teams. You couldn't quantitatively determine their worth without considering so many other stats: Plate Appearances, On Base Percentage, Runs, Hits, etc.

My wife's family is a bunch of baseball nerds, and they'll pour over this stuff for hours. It would be nice to think that we could do this for teachers, but, of course, we can't: it would be wildly expensive to create and collect so much data, it would take too much time from instruction, and we would be measuring the teacher on the performance of her students when only 10 to 15 percent of student outcome is based on teacher inputs.

Further, the "plate appearances" of a teacher in a test-based evaluation model come down to the number of students they teach: each student is a "batter" who gets one turn at the plate on the yearly NJASK. Would anyone try to guess whether Derek Jeter is contributing to the Yankees' offense on the basis of 25 or 30 plate appearances a season?

Baseball has an easy way to incorporate a lot of statistical information into each player's evaluation; teaching does not. On that basis alone, Rockoff's metaphor is clearly wanting.

Second problem: Rockoff talks about a .300 batting average as the mark of a great hitter. This is a cut score: an arbitrary level that a player must cross to be categorized as "great." Would anyone really argue that a .299 hitter is practically much worse than a .300 hitter? Would we say that Mickey Mantle or Barry Bonds (putting the you-know-what problems aside) weren't great hitters?

The NJDOE is proposing to take mSGP scores for teachers and convert them into ranked categories - what statisticians call ordinal measures*. This necessitates the use of cut scores; it's equivalent to saying: "You're only getting into the Hall of Fame if you hit .300." In this situation, little differences count for a lot - especially when the decision is being forced on to an administrator (more on this later).

This is a fundamental problem when you combine a highly variable measure like mSGP with a four-rank measure like observation scores: some of the evaluation, all of the decision. It would be like determining a player's worth by combining all of the other aspects of his game - power hitting, running, fielding, clubhouse leadership - into one of four categories, and then combining that with his batting average. The BA becomes the deciding factor, just as the mSGP becomes the deciding factor for the teacher.

Third problem: this post about the magical .300 BA is fascinating:
In a recent New York Times article Alan Schwarz talks about research by several University of Pennsylvania professors about the psychological importance of hitting .300.
This phemomena can be demonstrated by looking at the season batting averages of all baseball hitters in the past 50 seasons. I collected the batting averages for all regular players since 1960 who had at least 300 opportunities to bat. For each player, I rounded the batting average to three decimal spaces as commonly done in professional baseball. The figure displays a line graph of the frequencies of the batting averages between .290 and .310.
Popularity of Different MLB Batting Averages, by Jim Albert
This graph shows the importance of the .300 mark in baseball. Batting averages slightly smaller than .300 are unlikely but a batting average of .300 is very likely. Athletes might like to say that the performance statistics are not important, but there are patterns in this data that suggest otherwise.
So, why would we have this big jump between .299 and .300? To me, it's as good of a demonstration of Campbell's Law as you could imagine:

Everyone interested in understanding how the ceaseless pressure to raise test scores can corrupt the tests should be familiar with Campbell’s Law.

This is an adage written by social scientist Donald T. Campbell in a 1976 paper. It says:
The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” (You can google the paper, or find it linked on Wikipedia: Campbell, Donald T., Assessing the Impact of Planned Social Change The Public Affairs Center, Dartmouth College, Hanover New Hampshire, USA. December, 1976.)
Campbell’s Law explains why high-stakes testing promotes cheating, narrowing the curriculum, teaching to the test, and other negative behaviors. [emphasis mine]
It's quite conceivable that when a hitter is hovering around the .300 level, he starts changing his behavior. Maybe he asks the manager to let him get more chances at the plate when he's a little below; maybe he asks for fewer chances when he's just above. The arbitrary value assigned to getting above .300 creates a corrupting pressure (in the case of Barry Bonds and Michelle Rhee, perhaps an extreme corrupting pressure).

Why wouldn't we believe the same thing about using test scores in teacher evaluations? Why wouldn't we think the use of what will inevitably be an arbitrary cut score will create behaviors designed to game the system? And is that really in the best interests of children?

More to come...

NJDOE: They may be lost but they're making great time!

ADDING: Rockoff says that if you are in the bottom quartile of hitters in the majors, you don't get to bat. But there are terrible batters in the National League who bat regularly throughout the season: they're called pitchers.

* Actually, the SGP is an ordinal measure as well. More to come.


alm said...

What do you think the long run consequences of the 'teachers are only 10%' argument are? If student achievement is just a function of income, what is the justification for equalization aid & supplementary funding? What's the compelling case for going above and beyond on the funding front if schools can't be expected to make an impact, anyways?

My take, FWIW, the "teachers can't impact student achievement, anyway" argument is both incorrect and tactically/politically unwise in tight state budget environments.

Duke said...

That's a legitimate question, but one I want to answer in a full post. Try to get to it this week...