Pages

Sunday, March 24, 2013

How ONE Question on ONE Test Can Cost You Your Job

Last week, I pointed out a big math failure in the proposed New Jersey teacher evaluation system, AchieveNJ:
Slide 14 (annotations mine):



Slide 20:


OK, wait a minute...

Slide 14 says my students' Student Growth Percentiles, or SGPs, will be calculated on a scale of 1 to 99. We've already established that my evaluation will use the Median SGP for my class, even though that is potentially a hugely distorted metric (see NJDOE Math Fail #1 for more). But Slide 20 has an example where a teacher gets a "raw score" of 2.0 for their mSGP.

How in the hell did the NJDOE calculate this number?
The more I looked at this, the more absurd it seemed. If NJDOE is going convert a number from 1-99 into a two-digit number between 1.0 and 4.0, how will they do it? Will a "2.1" be allowed? If so, don't they understand that by making the SGP measure more variable than the other measures, they are giving it more weight in a high-stakes decision? Again: some of the evaluation, all of the decision.

Well, it looks like someone at NJDOE is reading the Jazzman, because a new raft of promotional materials resources about AchieveNJ are now available from the department. And here's a slide from one of the presentations that addresses this very issue (my annotation):



"Guidance is forthcoming..." Nice use of the passive voice; avoids having to take ownership of the decision...

So NJDOE has at least acknowledged what was always a central problem, spelled out here by Bruce Baker:
First, the standard evaluation model proposed in legislation requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. Student achievement growth measures are assigned, for example, a 40 or 50% weight alongside observation and other evaluation measures. Placing the measures alongside one another in a weighting scheme assumes all measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. Each measure must be included, and must be assigned the prescribed weight – with no opportunity to question the validity of any measure. [1] Such a system also assumes that the various measures included in the system are each scaled such that they can vary to similar degrees. That is, that the observational evaluations will be scaled to produce similar variation to the student growth measures, and that the variance in both measures is equally valid – not compromised by random error or bias. In fact, however, it remains highly likely that some components of the teacher evaluation model will vary far more than others if by no other reasons than that some measures contain more random noise than others or that some of the variation is attributable to factors beyond the teachers’ control. Regardless of the assigned weights and regardless of the cause of the variation (true or false measure) the measure that varies more will carry more weight in the final classification of the teacher as effective or not. In a system that places differential weight, but assumes equal validity across measures, even if the student achievement growth component is only a minority share of the weight, it may easily become the primary tipping point in most high stakes personnel decisions. [emphasis mine]
Basically, NJDOE reads this and says: "OK, Professor, we understand the SGPs are more variable than the observations, and that's going to be a problem. But we'll solve that by converting the SGPs into a measure that only varies as much as the other parts of the evaluation! See, problem solved!"

Except the problem isn't solved at all: it's made far, far worse, because the only way to make the conversion is to assign cut points in the SGPs!

Let's say, for example, that NJDOE comes up with a conversion table that looks like this:

"Raw" mSGP"Converted" mSGP
1-241
25-492
50-743
75-994


What's that? Your mSGP is a "49"? Oh, so close, but too bad! Enjoy your next career...

It really doesn't matter where you put the cut points: a difference of one point in mSGP is enough to tip the measure. What's worse, the mSGP is the median of a class's SGP scores, not the average. Remember when I showed that two classes with very different average growth could still have the same median SGP?


Here are two classes with very different average SGPs, but the same median SGP (Ms. Jones's blue diamond is hidden by Ms. Smith's red box). But let's suppose that the student with the mSGP in Ms. Jones's class misses just one question on the NJASK. His SGP dips down ever so slightly, and so does Ms. Jones's mSGP. If that dip occurs right at the cut point, guess what?



You can barely see it, but that middle student in Ms. Jones's class is right below the cut point; her entire evaluation changed because of a tiny adjustment in one student's test scores! Yes, Ms. Jones, we know you have higher average "growth" in your class than Ms. Smith, but you still got a "2" on your evaluation, and Ms. Smith got a "3." And budget cuts mean we have to RIF someone; start working on your resume...

There is an easy way to solve all of this: don't force a principal to act on the data using a top-down system dictated by the NJDOE. More in a bit...

7 comments:

  1. This is what happens when there is no peer review, no robust pilots, and a rush to DO ANYTHING, SOMETHING NOW, BECAUSE THE HOUSE IS ON FIRE!! These people really don't know what they are doing.

    ReplyDelete
  2. Commuting Teacher,

    On the contrary, these people know exactly what they're doing: they're the one's who've set the house on fire, hope to let it burn to the ground, and then re-develop it in their interests.

    ReplyDelete
  3. SGP is not a valid measure of anything related to teacher contributions. http://schoolfinance101.wordpress.com/2011/09/02/take-your-sgp-and-vamit-damn-it/

    ReplyDelete
  4. CA, if I had the power, I would force every parent and teacher to read that post. Thx! And thx CT and MF for the comments.

    ReplyDelete
  5. Jazzman,
    Great.
    Also, teachers who teach high level classes (Algebra...in grade six) will have a class with students who have very small "academic peer group" sizes. Small numbers add variability. Ask Daniel Kahnaman, or Hward Wainer about Bill Gates misguided efforts in advocating for small schools. Gates wasted a billion of his reform dollars and countless more taxpayer money because he did not understand variance and "regression toward the mean".
    Plus, those high scoring students will be reaching ceiling effects because the NJASK was NEVER designed to be sensitive to small score changes at its upper limits!
    Pop off and Schulman are intelligent to know the problems: will they put courage and truth ahead of self interest and speak up?

    ReplyDelete
  6. Galton: exactly right. Of course, in this idiotic scheme, if you have a high-level class, are you going to push your students, or have them review over and over again what they've already done? When your livelihood is at stake?

    At this point, I don't count on anyone at NJDOE speaking truth to power.

    ReplyDelete
  7. Any calculation based on one year's worth of change in test scores is likely to be highly random. Could you write about that? Even if one believed in growth scores measuring teacher quality - which I don't- this is a fatal flaw.

    ReplyDelete

Sorry, spammers have forced me to turn on comment moderation. I'll publish your comment as soon as I can. Thanks for leaving your thoughts.