I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Sunday, March 10, 2013

NJ Teacher Evaluation: There's No Escaping The Problems With VAMs

Let's face it: it's been a rough couple of years for the supporters of test-based teacher evaluations.

Back in 2010, the Los Angeles Times published the "ratings" of the city's teachers based on test-scores, using a method called Value-Added Modeling (VAM). Researchers at the National Education Policy Center took at look at the LA Times's ratings and found them to be highly unreliable: when NEPC used a different (and arguably stronger) statistical model, more than half of the teachers fell into different categories of effectiveness.

In the next year, the Times tried to account for the bias and noise in their ratings. Their efforts, however, only made it more clear that VAMs are far too imprecise to use for high-stakes decisions.

In New York City, Gary Rubinstein published a series of brilliant posts on the VAM-based ratings of teachers last year. Not only did Rubinstein find there was little correlation between a teacher's score from one year to the next; there was little correlation for the same teacher's rating in two different subjects in the same year!

The truth is that the "confidence intervals" in the NYC ratings are very large; the ratings present a range of scores for a teacher, making it impossible to order rank teachers by their "effectiveness" with any reasonable certainty (unlike ranking them by, say, seniority, which we can do with absolute precision).

All of this comports with the research consensus on VAM: Value-Added Modeling is simply too unreliable to use when making operational decisions. Which brings us to the new teacher evaluation system proposed by the New Jersey Department of Education. NJDOE still calls for using test scores in teacher evaluations - but there's a difference:
One prevailing question both during and after the meeting had to do with how the process would work, especially the tests scores and a complex formula called “student growth percentiles” (SGP) that will measures students’ progress against that of comparable peers.
The state is expected to release its first SGP scores for overall school performance this month, based on 2011-2012 tests, but has only started to share them for individual teachers.
[...]
After the meeting, a spokesman for the NJEA leadership was less combative than he was the day before after first reviewing the regulations -- but not by much.
“They are just moving really fast with the SGP, when there are real questions as to how it will work,” said Steve Wollmer, the NJEA’s communications director. “None of it has been tested, we don’t know it will work, or what it will measure.”
“Why don’t we slow this down and test the SGP?” he said. [emphasis mine]
It's a good question: given all the error rates that have been demonstrated with VAMs, why aren't we just as leery of SGPs? Are they "better" than VAMs? Do they avoid all of the now well-documented problems that VAMs suffer from?

I want to be fair here: I have not yet seen the NJDOE explicitly say that SGPs are a superior method than VAMs. But the department's avoidance of the term "VAM" suggests that they are readying an argument in favor of SGPs, at least in part, on the basis that they are not VAMs.

Yes, this is a conjecture on my part - but I'm not the only one predicting the argument will unfold this way:
Arguably, one reason for the increasing popularity of the SGP approach across states is the extent of highly publicized scrutiny and large and growing body of empirical research over problems with using VAMs for determining teacher effectiveness (Baker, Darling-Hammond, Haertal, Ladd, Finn, Ravitch, Rothstein, Shavelson, & Shepard, 2010; Corcoran, 2010; Green, Baker, & Oluwole, 2012). Yet, there has been far less research on using student growth percentiles for determining teacher effectiveness. The reason for this vacuum is not that student growth percentiles are simply immune to problems of value-added models, but that researchers have until recently chosen not to evaluate their validity for this purpose – estimating teacher effectiveness – because they are not designed to infer teacher effectiveness. [emphasis mine]
Again, I think it's important that I be fair: the NJDOE has not explicitly said SGPs are superior to VAMs. But they have dropped some clues that, going forward, the argument in favor of their proposal will include the contention that SGPs are not VAMs, and should therefore not be subject to the same criticisms.

Look, for example, at this video, available through the department's website, which explains how SGPs are constructed (the video is quite good; I'd encourage every NJ teacher and parent to take a few minutes to watch it). In slide five, the video states that SGPs do not use student characteristics when determining a student's "academic peers" (annotation mine):


The "most thorough VAMs" will use student characteristics as part of their model to determine teacher effectiveness - but not SGPs. As the video explains, students are compared - and thus, teachers are rated - solely on the basis of their previous scores on state tests. In contrast to VAMs, SGPs do not attempt to account for bias in student test score differences due to demographic factors.

This may seem like a subtle point, but it's actually quite important. Because while VAMs fail to accurately quantify the effectiveness of individual teachers, they at least make the attempt. But as Damian Betebenner, the "inventor" of SGPs, has admitted, SGPs do not even try to determine which part of a student's "growth" can be attributed to a teacher:
We would add that it is a similar “massive … leap” to assume a causal relationship between any VAM quantity and a causal effect for a teacher or school, not just SGPs. We concur with Rubin et al (2004) who assert that quantities derived from these models are descriptive, not causal, measures. However, just because measures are descriptive does NOT imply that the quantities cannot and should not be used as part of a larger investigation of root causes. [emphasis mine]
We'll talk later about what a "larger investigation of root causes" should look like. For now, contrast the quote above with the following slide put out by the NJDOE this week in support of their proposals (annotation in yellow mine):




Wrong! The SGP does not measure the teacher's ability; it is a descriptive measure of a student's achievement! Even the "inventor" of SGPs says so!

NJDOE can't distance themselves from the problems with VAMs by saying they are instead using SGPs, because while VAMs attempt - and fail - to attribute student test score changes to teachers, SGPs don't even make the attempt. In other words: the department can't brush away the problems with VAMs in justifying the use of test score data simply by claiming they are using SGPs and not VAMs.

Once again: we haven't yet reached the stage where NJDOE has directly answered their critics when it comes to the use of test scores; I am anticipating their argument. Maybe I'll be proven wrong; maybe NJDOE won't disavow the poor record of test-based evaluation by claiming their methods are different from the ones used in NYC, Los Angeles, and elsewhere.

But, were I a betting man, I'd stake a hefty wager that I'm going to be right. The dates for public presentations on NJDOE's proposals are:
  • March 13, 3:30 p.m. - 5:30 p.m, Toms River High School North
  • March 15, 1:00 p.m. - 4 p.m. Morris County Fire Fighters & Police Academy (Parsippany)
  • March 19, 9:00 a.m. - 3:00 p.m. Rutgers Camden (303 Cooper St.)
  • March 21 (2 sessions), 1:00 p.m. - 3:00 p.m. and 4:00 p.m. - 6:00 p.m. New Jersey Supervisors and Principals Association (Monroe)
  • April 10, 9:00 a.m. - 11:30 am Ocean City High School
  • April 11 (2 sessions), 1:00 p.m. - 3:00 p.m. and 4:00 p.m.- 6:00 p.m. Teaneck High School
I would hope NJDOE has the courage of their convictions to take questions at these meetings. I would hope they'd be open to a frank discussion of the error rates in VAMs, and how using SGPs will not in any way address this problem.

I would hope.


ADDING: Lisa from the comments adds this:
Not only does the creator of the SGP state that the SGP does not identify the contribution to student growth made by a teacher, but so does the the State of NJ itself on its NJDOE website. Yet, 45% of a teacher's evaluation of effectiveness is the SGP.

The below quote is copied directly from the last page of the _SGP-General Overview_ paper that’s located on the NJDOE NJSmart web page on Student Performance: 

http://www.nj.gov/education/njsmart/performance/SGP_Detailed_General_Overview.pdf

“The median student growth percentile is descriptive and makes no assertion about the cause of student growth. This differs from current value-added models where the purpose is to specify the contribution to student achievement provided by a given school or teacher.”

If the creators and users (NJDOE) of SGP state that the SGP “makes no assertion about the cause of student growth”, i.e. that SGP does not--and is not designed to--show who or what is responsible for the student growth or lack thereof, why is 35%-45% of a teacher’s effectiveness evaluation based on student growth as measured by SGP?

This blaring contradiction needs to be answered and addressed by the State before it's forced to defend it in court at taxpayer expense. [emphasis mine]
 Yep.


4 comments:

Unknown said...

Not only does the creator of the SGP state that the SGP does not identify the contribution to student growth made by a teacher, but so does the the State of NJ itself on its NJDOE website. Yet, 45% of a teacher's evaluation of effectiveness is the SGP.

The below quote is copied directly from the last page of the _SGP-General Overview_ paper that’s located on the NJDOE NJSmart web page on Student Performance:

http://www.nj.gov/education/njsmart/performance/SGP_Detailed_General_Overview.pdf

“The median student growth percentile is descriptive and makes no assertion about the cause of
student growth. This differs from current value-added models where the purpose is to specify
the contribution to student achievement provided by a given school or teacher.”

If the creators and users (NJDOE) of SGP state that the SGP “makes no assertion about the cause of student growth”, i.e. that SGP does not--and is not designed to--show who or what is responsible for the student growth or lack thereof, why is 35%-45% of a teacher’s effectiveness evaluation based on student growth as measured by SGP?

This blaring contradiction needs to be answered and addressed by the State before it's forced to defend it in court at taxpayer expense.

--Lisa

Duke said...

Brava, Lisa! Great catch.

Mrs. King's music students said...

OMG - Seniority?? Really? The peach that hung on the tree the longest is the one you want? Regardless of what happened while it hung there?

You mathy types can throw down all the stats you want. You aren't going to KNOW anything for sure about the peaches until you tend to the orchard.

And here's a thought, now that we've lived with the seniority model for a good long time, and seen the impact IT had on a host of issues in education, how about gathering up the extensive data on that and treating what's rotten.

Bill Michaelson said...

Extensive data? What data? Mathy? Rotten? Peaches? Orchard? Huh?