I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Sunday, May 21, 2017

Random Thoughts On Using VAM for Teacher Evaluation

You may have read the piece in the New York Times today by Kevin Carey on the passing of William Sanders, the father of idea of using value-added modeling (VAM) to evaluate teachers. Let me first offer my condolences to his family.

I'm going to skip a point-by-point critique of Carey's piece and, instead, offer a few random thoughts about the many problems with using VAMs in the classroom:

1) VAM models are highly complex and well beyond the understanding of almost all stakeholders, including teachers. Here's a typical VAM model:

Anyone who states with absolute certainty that VAM is a valid and reliable method of teacher evaluation, yet cannot tell you exactly what is happening in this model, is full of it.

There was a bit of a debate last year about whether it matters that student growth percentiles (SGPs) -- which are not the same as VAMs, but are close cousins -- are mathematically and conceptually complex. SGP proponents make the argument that understanding teacher evaluation models are like understanding pi: while the calculation may be complex, the underlying concept is simple. It is, therefore, fine to use SGPs/VAMs to evaluate teachers, even if they don't understand how they got their scores.

This argument strikes me as far too facile. Pi is a constant: it represents something (the circumference of a circle divided by its diameter) that is concrete and easy to understand. It isn’t expressed as a conditional distribution; it just is. It isn’t subject to variation depending on the method used to calculate it; it is always the same. An SGP or a VAM is, in contrast, an estimate, subject to error and varying degrees of bias depending on how it is calculated.

The plain fact is that most teachers, principals, school leaders, parents, and policy makers do not have the technical expertise to properly evaluate a probabilistic model like a VAM. And it is unethical, in my opinion, to impose a system of evaluation without properly training stakeholders in its construction and use.

2) VAM models are based on error-prone test scores, which introduces problems of reliability and validity. Standardized tests are subject to what the measurement community often calls "construct-irrelevant variance" -- which is just a fancy way of saying test scores vary for reasons other than knowing stuff. Plus there's the random error found in all test results, due to all kinds of things like testing conditions. 

This variance and noise causes all sorts of issues when put into a VAM. We know, for example, that the non-random sorting of students into teacher classrooms can create bias in the model. There is also a very complex issue known as attenuation bias when trying to deal with the error in test scores. There are ways to ameliorate it -- but there are tradeoffs. 

My point here is simply that these are very complicated issues and, again, well beyond the apprehension of most stakeholders. Which dictates caution in the use of VAM -- a caution that has been sorely lacking in actual policy.

3) VAM models are only as good as the data they use -- and the data's not so great. VAM models have to assign students to teachers. As an actual practitioner, I can tell you that's not as easy as it sounds. Who should be assigned a VAM score for language arts when a child is Limited English Proficient (LEP): the ELL teacher, or the classroom teacher? What about special education students who spend part of the school day "pulled out" of the homeroom? Teachers who team teach? Teachers who co-teach?

All this assumes we have data systems good enough to track kids, especially as they move from school to school and district to district. And if the models include covariates for student characteristics, we need to have good measures of students' socio-economic status, LEP status, or special education classification. Most of these measures, however, are quite crude.*

If we're going to make high-stakes decisions based on VAMs, we'd better be sure we have good data to do so. There's plenty of reason to believe the data we have isn't up to the job.

4) VAM models are relative; all students may be learning, but some teachers must be classified as "bad." Carey notes that VAMs produce "normal distributions" -- essentially, bell curves, where someone must be at the top, and someone must be at the bottom.

I've labeled this with student test scores, but you'd get the same thing with teacher VAM scores. Carey's piece might be read to imply that it was a revelation to Sanders that the scores came out this way. But VAMs yield normal distributions by design -- which means someone must be "bad."

General Electric's former CEO Jack Welch famously championed ranking his employees -- which is basically what a VAM does -- and then firing the bottom x percent. GE eventually moved away from the idea. I'm hardly a student of American business practices, but it always struck me that Welch's idea was hampered by a logical flaw: someone has to be ranked last, but that doesn't always mean he's "bad" at his job, or that his company is less efficient than it would be if he was fired.

I am certainly the last person to say our schools can't improve, nor would I ever say that we have the best possible teaching corps we could have. And I certainly believe there are teachers who should be counseled to improve; if they don't they should be made to leave the profession. There are undoubtedly teachers who should be fired immediately.

But the use of VAM may be driving good candidates away from the profession, even as it is very likely misidentifying "bad" teachers. Again, the use of VAM to evaluate systemic changes in schooling is, in my view, valid. But the argument for using VAM to make high-stakes individual decisions is quite weak. Which leads me to...

5) VAM models may be helpful for evaluating policy in the aggregate, but they are extremely problematic when used in policies that force high-stakes decisions. When the use of test-based teacher evaluation first came to New Jersey, Bruce Baker pointed out that its finer scale, compared to teacher observation scores, would lead to making SGPs/VAMs some of the evaluation but all of the decision.

But then NJDOE leadership -- because, to be frank, they had no idea what they were doing -- made teacher observation scores with phony precision. That led to high-stakes decisions compelled by the state based on arbitrary cut points and arbitrary weighting of the test-based component. The whole system is now an invalidated dumpster fire.

I am extremely reluctant to endorse any use of VAMs in teacher evaluation, because I think the corrupting pressures will be bad for students; in particular (and as a music teacher), I worry about narrowing the curriculum even further, although there are many other reasons for concern. Nonetheless, I am willing to concede there is a good-faith argument to be made for training school leaders in how to use VAMs to inform, rather than compel, their personnel decisions.

But that's not what's going on in the real world. These measures are being used to force high-stakes decisions, even though the scores are very noisy and prone to bias. I think that's ultimately very bad for the profession, which means it will be very bad for students.

Carey mentions the American Statistical Association's statement on using VAMs for educational assessment. Here, for me, is the money quote:
Research on VAMs has been fairly consistent that aspects of educational effectiveness that are measurable and within teacher control represent a small part of the total variation in student test scores or growth; most estimates in the literature attribute between 1% and 14% of the total variability to teachers. This is not saying that teachers have little effect on students, but that variation among teachers accounts for a small part of the variation in scores. The majority of the variation in test scores is attributable to factors outside of the teacher’s control such as student and family background, poverty, curriculum, and unmeasured influences. 
The VAM scores themselves have large standard errors, even when calculated using several years of data. These large standard errors make rankings unstable, even under the best scenarios for modeling. Combining VAMs across multiple years decreases the standard error of VAM scores. Multiple years of data, however, do not help problems caused when a model systematically undervalues teachers who work in specific contexts or with specific types of students, since that systematic undervaluation would be present in every year of data. 
A VAM score may provide teachers and administrators with information on their students’ performance and identify areas where improvement is needed, but it does not provide information on how to improve the teaching. The models, however, may be used to evaluate effects of policies or teacher training programs by comparing the average VAM scores of teachers from different programs. In these uses, the VAM scores partially adjust for the differing backgrounds of the students, and averaging the results over different teachers improves the stability of the estimates [emphasis mine]
Wise words. 

NJ's teacher evaluation system, aka "Operation Hindenburg."

* In districts where there is universal free-lunch enrollment,parents have no incentive to fill out paperwork designating their children as FL-eligible. So even that crude measure of student economic disadvantage is useless.


Unknown said...

...and yet NJEA supported AchieveNJ & tenure reform & evaluation models, like Danielson or Marzano. From my experience, it has destroyed teacher morale within entire districts. While I am a proud NJEA member, this is the one issue that seems to take a backseat to everything else & it infuriates me. It's great that they fight for our pension & our future but our present situation is awful. It is absurd to think that issuing a score 0-1-2-3-4 in an evaluation or observation will work to improve teacher performance... administrators observe teachers with numbers & ridiculous buzz words like 'developing', 'applying' or 'beginning' to somehow validate what is an extremely subjective measure of a teacher's performance. The new & current system is flawed & extremely broken and only makes teaching more difficult.

edlharris said...