Bruce Baker has an important but somewhat technical post up about the methods that are used to analyze test scores so they can be used for either high-stakes or low-stakes decisions. The reason for the post is that Bruce has seen some corporate "reformers" make the case that Student Growth Percentiles (SGPs) are better at estimating teacher effectiveness than Value-Added Modeling (VAM).
A very short and incomplete summary of the issue is that VAM at least tries to account for student characteristics (and does a miserable job), while SGP doesn't even make the attempt. OK, pretty knotty stuff, but I encourage you to read the post.
I've not seen anyone make this argument myself, although you could say some of the wording in Chris Cerf's press release about the pilot program in New Jersey suggests it may become an issue here. I'm not going to weigh in myself on the technical aspects of SGPs vs VAMs; however...
One of the commenters on Bruce's post is Steven Glazerman of Mathematica Policy Research, who said that the issue is not really SGP vs VAM:
That having been said, the relevant question for both types of models for estimating teacher effects should be, “How useful are they for making policy decisions?” Such decisions can include low stakes (redirection of PD resources) or higher stakes (pay and promotion). Degree of influence can be higher or lower based on the reliability and the consequences of incorrect decisions. It doesn’t have to be a black and white “use it” or “don’t use it”. See here for more discussionWhat he provides is a link to is a Brookings Institute paper that I have seen cited over and over again by corporate "reformers" as a justification for the use of tests scores (VAM, SGP, whatever - doesn't matter what the analysis tool is) in high-stakes policy decision making.
Let me be clear: this paper is not agnostic about the policy implications of using test scores to hire and fire teachers, or to set their pay. What follows is my response to Glazerman. When I speak about the "cut point," I'm talking about the teacher evaluation score where you make your decision: above this score you keep your job, below you get fired. The cut point is a serious issue because too high a cut point will mean good teachers are wrongly fired, and too low a cut point means bad teachers are retained.
I admit, I am out of Glazerman's league when it comes to the math - but maybe that's not my problem, but his...
**********
I have more than a few small problems with that link:
"Full-throated debate about policies such as merit pay and “last in-first out” should continue, but we should not let controversy over the uses of teacher evaluation information stand in the way of developing and improving measures of teacher performance."
This is, to be blunt, a cop out. You can't build bombs and then disavow yourself from the collateral damage they cause by claiming you didn't have any say over where they were dropped.
There is no doubt that everyone reading this knows that good teachers will be fired if VAM, SGP, or other such systems are used in high-stakes personnel decisions. The insouciance of the authors toward this problem is deeply troubling. It is not simply mitigated by having fewer low-scoring teachers who sneak past the cut point; it is a matter of basic fairness that stains the entire profession.
And there is no evidence I see that having a low-scoring teacher (not a bad teacher, a low-scoring one) in one year is such an enormous detriment to student learning anyway. Where is the empirical evidence that this is a greater danger than firing good teachers? In data-driven research? Seems circular to me.
"Our message is that the interests of students and the interests of teachers in classification errors are not always congruent, and that a system that generates a fairly high rate of false negatives could still produce better outcomes for students by raising the overall quality of the teacher workforce."
This is pure conjecture. "Could" produce better outcomes? Please. You're going to have to do a lot better than that if you want to advocate for a system that you freely admit is going to fire good teachers.
I also am bothered by the casual tone the authors take toward moving the cut point, as if that were a simple matter - it is not. The assumption of standard distributions in the graph is troubling enough; to say that it is merely a matter of moving the cut point along the x-axis to get you a policy outcome you want dismisses the difficulty of determining what the tests scores tell us about student learning in the first place.
"It is instructive to look at other sectors of the economy as a gauge for judging the stability of value-added measures. The use of imprecise measures to make high stakes decisions that place societal or institutional interests above those of individuals is wide spread and accepted in fields outside of teaching."
Please name one field where the work of another determines your evaluation, and you have no say in whom you choose to do this work. And where that work is a four-day test you aren't allowed to see, and is likely being graded by a low-paid, low-skill amateur.
Even doctors have latitude in the patients they accept, and their patients have far less control of their health outcomes than students do over their test scores. Teaching needs unique evaluations because it is a unique job.
The fact is that VAM or SGP or whatever the latest fad is does one thing: it judges whose classes get good test scores. This slavish adherence to the cult of the bubble test is enormously bothersome. It is not the entirety of teaching - it is not even a small part of good teaching. The assumption that young children who take these tests are beneficiaries of the standardized testing regime needs to be questioned, and not merely in the abstract, data-driven language I see here.
With all due respect to the many excellent scholars who collaborated on the Brookings paper, I think you are too far removed from those of us down in the trenches to see the implications of your ideas. You need to get out more.
I think one key point here is that I don't believe Steve G. thinks that this information should ever be used to make an absolute determination regarding either compensation or dismissal.
ReplyDeleteThat said, the Brookings paper does lean toward supporting evaluations that might do just that. The Brookings paper is disturbingly agnostic on many really important technical issues with real, serious ethical implications. For example, the Brookings paper remains agnostic on how and to what extent weights are assigned to VAM estimates, or even how they are used. Brookings doesn't even give a crap as to whether the model meets certain technical criteria, but rather that the model is consistent (and it's easier to have a model that consistently rates teachers incorrectly than it is to have one that rates teachers more correctly. Just leave in all of the bias due to student characteristics and make sure some teachers always get the toughest kids. The Brookings paper is god-awful and favors technocratic agnosticism over making any thoughtful, ethical judgment calls.
That aside, Steve has actually presented some interesting alternative perspectives on using VAM estimates. For example, we might assume that VAM provides a statistically noisy estimate of teacher effect - one that is somewhat more likely to be correct than wrong (accepting the narrowness of the measure) - but one that is too error prone for making any final high stakes decisions. It's like the rapid strep test. It's only used as pre-screening with full acknowledgment of likelihood of false positives or negatives. The information might be used as a basis for additional classroom observations (including peer observation) which might reveal that the initial estimate told us nothing, or might reveal that additional supports/intervention are needed. Still not high stakes. That seems far more reasonable than current policy proposals/mandates, but still assumes that the information is correct at least marginally more often than it is wrong.
That's really the fine line between me and Steve on this at this point, as I understand it. He and others still believe that the best estimated VAM can produced somewhat useful information that can at least be used as a noisy pre-screening tool. I'm increasingly pessimistic that VAMs can produce information that is useful even at this level. The year to year variation within teachers, the vast differences based on different tests, and continued influence of non-random sorting has me stumped as to how to really make VAM useful at all as a decision informing tool.