I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Saturday, June 30, 2012

Getting Teacher Evaluation Basics Right - Or Not

Oh, Laura...
Here's four challenges that face New Jersey's public school system as it implements new procedures under a harsh spotlight. None of these challenges are insurmountable, but all will require careful oversight and strong leadership.
Refine Teacher Evaluation Rubric: Last year, the DOE rolled out a pilot program of value-added teacher evaluations under the heading of Excellent Educators for New Jersey. Participation was limited to 11 districts (including Newark), plus 20 low-performing schools that received federal grants. Original plans called for statewide roll-out in 2012-2013; that's been pushed back a year until the kinks are worked out, although all districts will tiptoe towards the new system in September. Districts are also allowed to use their own templates, as long as they conform to the minimum standards in the bill, which include evaluating teachers based on "multiple objective measures of student learning." [second emphasis mine]
Yeah, uh, no. The DOE rolled out a pilot program not of Value Added Modeling (VAM), but of Student Growth Percentiles (SGP):
Q: How does New Jersey measure student growth? 
A: New Jersey measures growth for an individual student by comparing the change in his or her NJ ASK achievement from one year to the next to that of all other students in the state who had similar historical results (the student’s "academic peers"). This change in achievement is reported as a student growth percentile (abbreviated SGP) and indicates how high or low that student’s growth was as compared to that of his/her academic peers. For a school or district, the growth percentiles for all students are aggregated to create a median SGP for the school or district. The median SGP is a representation of “typical” growth for students in the school or district. [emphasis mine]
Why does this matter? Bruce Baker explains:
But what about those STUDENT GROWTH PERCENTILES being pitched for similar use in states like New Jersey?  While on the one hand the arguments might take a similar approach of questioning the reliability or validity of the method for determining teacher effectiveness (the supposed basis for dismissal), the arguments regarding SGPs might take a much simpler approach. In really simple terms SGPs aren’t even designed to identify the teacher’s effect on student growth. VAMs are designed to do this, but fail.
When VAMs are challenged in court, one must show that they have failed in their intended objective. But it’s much, much easier to explain in court that SGPs make no attempt whatsoever to estimate that portion of student growth that is under the control of, therefore attributable to, the teacher (see here for more explanation of this).  As such, it is, on its face, inappropriate to dismiss the teacher on the basis of a low classroom (or teacher) aggregate student growth metric like SGP. Note also that even if integrated into a “multiple measures” evaluation model, if the SGP data becomes the tipping point or significant basis for such decisions, the entire system becomes vulnerable to challenge.* [emphasis mine]
Yes, that's right: the NJDOE is proposing to use a method of evaluating teachers - SGPs - that does not even attempt to estimate how much influence the teacher has on student growth!

This is critically important to understand in the days ahead: the NJDOE is not proposing to use an inaccurate method like VAM to evaluate teachers; they are proposing to use SGP, a method that is completely inappropriate to the task!

Let's hope Laura Waters figures out the difference before her next column.


Tony C said...

The state of Georgia is also planning to use SGP in its teacher effectiveness ratings.

Duke said...

Tony, that's interesting. Apparently this guy Betebenner has been selling them around the country, even as he disavows their use in teacher evaluation!


Now, here is a quote from Betebenner and colleagues’ response to my criticism of policymakers proposed uses of SGPs in teacher evaluation.

From Damian Betebenner & colleagues

A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

(emphasis added)

But, you see, using these data to “evaluate teachers” necessarily infers “attribution of responsibility for that progress.” Attribution of responsibility to the teacher! If one cannot use these measures to attribute responsibility to the teacher, then how can one possibly use these measures to “evaluate” the teacher? One can’t. You can’t. No-one can. No-one should!

Read Bruce's entire post. It's ridiculous that any state would choose to use SGPs when even the creator says they aren't appropriate for use in teacher evaluation.

I'd like to know exactly how involved Betebenner is in selling GA on SGPs. This is a very underreported story.

I'm going to fix that. Stand by...

Deb said...


When you wish for something like Laura Waters figuring out the differences that you mention, please save some lives and tell people not to hold their breath!

If ever there was a less informed blogger.....anyway, thanks for calling her on this one more time. i know it can get dreary but we have to keep pointing out when they get it wrong.


Galton said...

Good points!
One of the things people do not realize is that within EVERY "academic peer group" 33% of the students will show "growth" in the bottom 33% ! These students will be labeled "low growth" regardless of any actual learning! This will happen even for the highest performing students ( criterion referenced ).
Seems like a strange twist on the "Lake Wobegon" . Here though, because of the normative basis of the SGP, we can by definition now say, that 1/3 of students in the state will have low growth years ( every year, every academic peer group!) .
Quite brilliant of them. Failure by design!

Duke said...

The normative nature of SGP is very troubling, but it speaks to how shallow the debate about this stuff is. We worry about the students who are "below average" without thinking what the "average" actually represents.

I will be driving deep into SGP this summer, for sure. This stuff is widespread and barely reported on, but it's one of the most important issues in education at the moment. Bruce Baker is one of the only folks I know talking about this - that has to change.

Awo Okaikor Aryee-Price said...

Thanks for this post. I think people need to be more aware, and I think Bruce does an awesome job of this because he gets it! I even used his research to write a letter to our administrators to protest the use of student test scores in our evaluations. In their new article, Green, Baker, & Oluwole (2012) they even talk about test scores being the tipping point. I can understand why they insist on using a tool that has proven to be unreliable and invalid!

** Correct me if I am wrong, but VAM is the larger model and SGP falls under this larger statistical model to measure teacher effectiveness. In other words, there are several different VAMs and SGP is just one of them.



Awo Okaikor Aryee-Price said...

** Sorry correction: I can't understand why they insist on using a tool that has proven to be unreliable and invalid!

Duke said...

Awo, we probably have a case here of a term like "VAM" being introduced into the language in a way that allows it to have multiple meanings. That's a problem, because the distinction is important between SGP and VAM.

I interpret the "Value" in VAM to refer directly to an attempt to isolate just how one factor - in this case, teachers - "adds value" to an outcome (student achievement). By attempting to control for other factors, VAM tries to isolate one variable to ascribe a "value" to it.

If so, SGPs couldn't be VAMs. SGP does not attempt to isolate the variables that lead to an outcome; it merely describes the outcome. Which is why it is totally inappropriate for use in teacher evaluation.

Betebenner is making the case that SGP is better than simply looking at a raw score, because it recognizes that a student could have made great gains yet still not be performing as well as others.

That's fine, but SGP does not attempt to tell us why that is. All it says is: "That kid had better than average growth. That kid had less than average growth."

VAM at least tries to say: "That kid better than average growth for a girl living in poverty who doesn't speak English at home." Ultimately, it fails to do this, because the error rates are so high, but at least it makes an attempt, and it attempts to quantify the level of error.

SGP doesn't even TRY to do any of that. So I don't think it qualifies as a VAM, because it doesn't attempt to show where the "value" is.

Thx for commenting. More to come.

Awo Okaikor Aryee-Price said...

Hey Duke,

I agree. Confusing the terms is a problem, and I don't want to distract from your initial message with SGP vs VAM. But I quote William Sanders from the SAS Institute because he sort of explain why people use it more broadly. He refers to "Class Average Gains, and this is the SGP we speak of:

"During the past several years there has been a growing interest nationally in using standardized test data to provide a measure of the impact of various educational entities on the rate of student progress. Unlike other uses of standardized test data, the intent of various value-added models is to use student achievement test data longitudinally so that many of the influences on student achievement can be negated by following the progress of individual students. I was one of the first to invoke the label “value-added assessment” to the comprehensive analytical process which we developed for Tennessee in the early 90’s. However in recent years, many have begun to attach the “value-added assessment” label to a broad range of analytical procedures; these procedures range from being very analytically simplistic to very sophisticated. Often policy makers are being mislead into believing that these procedures give nearly identical results.
The purpose of this presentation is to characterize the differences among several different classroom-level “value-added” modeling efforts each having been applied to the same data structure from two different rather large school districts. An attempt will be made to show the advantages and disadvantages of each, with special attention given to the egregious risks of misclassification when some of these models are applied to provide classroom teaching effects estimates.


Class Average Score. This does not measure value added since it does not consider any of a student’s previous scores. It has been included for comparison with value-added models and because, unfortunately, it is still too commonly used to compare teachers (or schools or districts).

Class Average Gain. With this approach each student’s previous score for a subject is subtracted from the current score to obtain a gain, and then a simple average gain for the class is calculated. This is the simplest possible “value-added” model. The calculations are simple and easily understood. However, this is one of the least desirable of all of the “value-added” approaches for a number of reasons."

Duke said...

Now that's really interesting. I would think he'd want to clarify that these are NOT value-added procedures. Maybe that's why he keeps putting it in quotes.

In any case, it's clear he's trying to distance his methods from it.

Thx for sharing!

Galton said...

J, one more point here. Aren't we conceding too much when we accept year over year high stakes assessment changes to define " student growth". When I think back on some of the best teaching and "student growth" I have witnessed, I do not think of NJASK scores.
I think we have let the idiots corrupt our verbiage. Normative test score increases are not perfectly synonymous with "student growth". But, perhaps you would need some Expeience with students to understand.
Thanks again for the work you do,

Duke said...

Galton, good point.

There are several layers to this argument:

- Student tests do not accurately reflect student growth! But...

- Even if you accept they do, VAM is a bad way to determine the results! But...

- As bad as VAM is, at least it makes an attempt to isolate teacher effects, unlike SGP! But...

And so on. We could do that all day.

My philosophy at this point is to just point out what's inconsistent and illogical, and proceed from there. The larger narrative will emerge on its own.


afrida tasnin said...

Thanks for your post.Need to repair or replace your outdated furnace? Call Aladdin Plumbing & Mechanical, the HVAC heating contractors NJ to take care of the heating services.

Heating Contractors NJ | Heating Services NJ | HVAC Contractors NJ