I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Tuesday, August 13, 2013

NY State Tests: Misguided Mixing of Norms and Criteria

ADDING: I continue my discussion here.

Before I go any further, I think I need to regroup. Most of you probably have a good familiarity with what I'm going to discuss; however, I think it may be helpful to step back for a moment and think about some of the basics before we proceed. I know it will be helpful to me.

When we last left our discussion of the crashing of New York State's test scores, I was explaining how NYSED and its commissioner, Reformy John King, were using the Triple Lindy to set the cut scores for the state:


  • Start with "college and career ready," an ill-defined phrase that could mean just about anything.
  • Leap to freshman year GPA in selected courses at a limited number of four-year colleges. Could be graded on or off a curve (normative or criteria-based - more on this later); varies widely between professors, schools, and courses; doesn't necessarily indicate whether the student's entire college experience was "successful."
  • Spring to SAT/PSAT scores, somewhat correlated to first year college GPA, but a normative assessment (meaning a set number of students must score at each percentile - someone's got to lose). This is a test, by the way, tightly correlated to family income.
  • Bounce to 8th grade NY State test scores, which are given three years before the SAT.
  • Carom (got a thesaurus?) to 3rd through 7th NY State test scores, which would assume all children follow the same learning trajectory.
  • Jounce (SAT word!) to teacher/principal evaluations and school evaluations and student retention decisions.
In that post, I glossed over what is really the most significant problem with all this: it's dangerous to use a norm-referenced test like the SAT to set the benchmarks for a criterion-referenced test, which is what the NY State test should be.

That is, admittedly, a jargon-filled sentence. Let's break it down:

There are basically two types of tests: norm-referenced and criterion-referencedA criterion-referenced test is a test of mastery: can the student do something we want her to do? Can she, for example:

  • Accurately multiply two three-digit numbers together?
  • Define a list of 20 vocabulary words?
  • Play back a rhythm with a steady tempo?
  • Step forward with the opposite foot when throwing a ball?
  • Calculate an object's speed at a specific time during its fall?

Almost every test a teacher gives a student at school is criterion-referenced: the teacher wants to know who can do what and with what degree of accuracy. For example, Mr. Crabtree might give his math students a "B" on a test if they can accurately calculate the area of different simple polygons 80% of the time. Here's the key: it is conceivable that every student in his class can meet this objective. The only thing that keeps a student from achieving a passing score on a well-written criterion-referenced test is her ability to do what the test is asking her to do.

A norm-referenced test, however, is a test that judges all students against each other: how well did the student do in relation to other students? It's impossible to have only one student ever take a norm-referenced test, because she has to be judged in the context of other students. Is she in the 99th percentile, or the 10th?

The SAT is the most famous norm-referneced test: each section has a mean score of 500, the average score for a representative population of students who took it when it was developed (the SAT was last "re-centered" in 1995). Here's the key: on a norm-referenced test, there are always high scorers, and there are always low scorers. A norm-referneced test should yield a shape similar to a bell curve:

Let's go back to Mr. Crabtree's math class. It's possible that every one of his students can calculate the area of a simple polygon accurately at least 80% of the time. Within that class, however, he may have students who got an 82% on his criterion-referneced test, or a 93%, or a 100%. If he "grades on a curve," there must be students who, even though they demonstrated their mastery of math, are "below average."

Because norm-reference tests like the SAT are ostensibly useful for finding out where a student places among his peers, they are used for decisions like college admissions and scholarships, where the tester wants to find the relative positions of the test takers (whether the tests do a good job is a highly controversial matter that I'm not going to broach right now). But if the tests are going to be useful, they have to be constructed in a way that yields results where a few students are high, a few are low, and most are in the middle. It has to be that way; otherwise, the test is useless.

Now, the New York State tests, like all tests mandated by No Child Left Behind, are accountability measures. They are administered by the state to hold students to account for their learning, teachers to account for their teaching, and schools to account for their results. If the student or the teacher or the school isn't demonstrating learning, there will be consequences.

It makes no sense, therefore, to use a norm-referenced test as an accountability measure. After all, someone's got to be below average on a normalized test; someone has got to lose, even if everyone is learning. If all of Mr. Crabtree's students can calculate an area, and they all pass a test, but Jimmy's score on that test is lower than Billy's, it doesn't mean that Jimmy hasn't learned what he is supposed to learn.

In the same way: if Mr. Fishbush's students score, on average, higher than Mr. Crabtree's class, it doesn't mean Mr. Crabtree didn't do his job. Aside from the fact that Mr. Fishbush's students may have scored higher for reasons having nothing to do with Mr. Fishbush, we also have to acknowledge that students learned what we wanted them to in Mr. Crabtree's class.

Let me try yet again: Scottie Pippin was not a bad basketball player just because Michael Jordan was better. Roger Moore was not a bad James Bond because some people liked Sean Connery more. Merlot isn't a bad wine just because you think Pinot Noir tastes better with lamb. And a child is not necessarily being "left behind" just because other kids are further ahead on standardized tests.

One of the biggest problems I see in the education policy world right now is that far too many people say they know the difference between norm- and criterion-referenced tests, but then proceed to act in ways that betray their ignorance. Which brings us back to New York...

If the NY State tests are accountability measures - and most assuredly, they are - then they must be criterion-referenced. It would make no sense whatsoever to make the test norm-referenced, because even though we want all children to succeed, someone would have to lose. If the goal is to hold everyone accountable, we must have a test where everyone can succeed.

That's the central problem with the Triple Lindy method NYSED is using above. The state is tying their benchmark for "proficiency" on what should be a criterion-referenced test to a norm-referenced test: the SAT (and PSAT). When you set the place for the cut score based on a relative measure like the SAT, someone has got to lose.

This is the point Carol Burris, a NY State Principal of the Year, made recently in the Washington Post: the reason NYSED Commissioner John King could predict the results of the state tests even before they were scored was that he set up the tests to mirror the results of norm-referenced exams. Burris reprints the chart NYSED used:


To be "proficient" in math, NY State students have to score at an equivalent level of a 540 on the math section of the SAT (adjusted for grade level, one of the last parts of the Triple Lindy). Remember, a 500 is the mean or average score; the test is designed to ensure that just about half of the test takers get below a 500, and half score above 500. NYSED seems to believe that everyone should be above average, a logical impossibility.

Let me add a caveat here: theoretically, it is possible to translate that 540 into a criterion-referenced benchmark. It's theoretically possible to say: "A 540 means the student will be able to do x, y, and z." But as a practical matter, this is not what's happening at all; in fact, it's exactly the opposite. The Triple Lindy is not setting the cut score to a criterion; it's setting the cut score to a norm. How do I know this?

Because the "higher-order thinking" questions on the exams aren't designed to test a skill: they're designed to create a normalized distribution of scores. More in a bit.


ADDING: Emailing back and forth today, I am reminded that, in a practical sense, all tests are more or less norm-referenced. Stand by...

ADDING MORE: And here's the next part of this discussion. And it's got Will Ferrell in it, so it can't be all bad...

4 comments:

jcg said...

Wonderful post on the idiocy of NYC test correlations. They seem to be making it up as they go along- crazy!

For a data-head your explanation of testing stats is understandable for the average reader. Too bad our edu-aristocracts have no understanding of psychometrics and are too blind to their religion to understand the implications of their decisions.


Bonny Buffington said...

I have been beating this drum ever since I learned that the scores on Ohio's statewide tests are derived scores....in other words, our tests are norm-referenced. The tests are specifically designed to yield results on a normal curve. The is no conceivable way we can use these results to evaluate anyone. Thanks for explaining it so well. I have a notion that most legislators wouldn't place too high on the bell curve for math.

Fred Smith said...

Thank you for making these technical points so clearly. Newspapers don't touch this subject or anything they deem to be too "arcane."

I want to add to your observation that the NYS ELA and math exams were normative in nature rather criterion-referenced or standards-based.

Apart from "benchmarking" the statewide exams to the NAEP or SAT, it also appears that the standard setting process (as described by Carol Burris, you, and an actual participant, Dr. Maria Hopkins Baldassarre), defining the cut scores that delineate "performance levels" depended in part on item statistics generated by students who took the test.

This ties the cutoff points to the way the other students performed-- a normative model rather than a mastery, minimum competency or criterion-referenced formulation. In so doing, this year's common core standard (a four-foot high bar) will vary from next year's, reflecting the achievement level of the next cohort.

In other words--it's a floating standard--the same problem the state had for a decade under NCLB and a contradiction of what is commonly thought of as a standard.

Fred Smith

navigio said...

Great post!

That said, I am really glad you added the caveat at the end. As soon as you said the central problem is tying the criterion-referenced cutoff to a norm-referenced result I was waiting for your caveat. :-)

From my perspective, it is not a problem to do this (as long as the test thereafter remains criterion-referenced) because norm-referenced and criterion referenced imply very different things. A norm-referenced test will 'adjust' its results over time to the test taking population (by maintaining the bell curve irrespective of any learning that might have happened). The key phrase is 'over time'.

In contrast, a criterion referenced test is tied to a static measure, even from year to year. If anything changes in the test-taking population (learning, teaching, or whatever else one wants to attribute to such changes), it is reflected in histograms of the test results (eg proficiency rates).

Put another way, criterion-referenced tests are designed to measure CHANGE in 'achievement'. Norm-referenced are not. That is a critical distinction.

It is also important to understand that cutoff points for performance definitions are arbitrary. Because the goal is to measure the change relative to a static measure, it does not necessarily matter where the cutoff is set (ie how the performanced is 'pegged' initially). (That is obviously not entirely true; a criterion-referenced test will not do a very good job measuring change if 100% of the students pass it. Alternatively, it can have a nefarious impact on education if it's initial results are too low, but generally speaking, change can be measured irrespective of whether its from 10% to 20% or 50% to 60%.). Personally, I think the 30% pass rate was intentional. This is eerily similar to the pass rates in our state's (California) standardized tests when they were first 'pegged' a bit over a decade ago. My guess is that test designers feel that is high enough to not be too negative, but still gives lots of room to measure change. Its even possible one of the reasons common core is coming now (and these kinds of changes come fairly regularly) is we need a continuous 're-pegging' in order to give us something to strive for. In CA we have some schools and even districts where the proficiency rates are nearing perfection. There is not longer any ability to measure those students' 'learning'.

Now, that all said, I am not defending anything. Personally, I think there might be evidence that these tests are not really as criterion-referenced as we think they are (there are a number of ways in which that can play out). That would be nothing short of a disaster if it can be proven. I also believe it is more difficult to create a criterion-referenced test when the criterion are measured in a more abstract manner. I think even if the CC standards are better, the test will likely be less accurately reflecting those standards (I read a post earlier today from an experienced teacher who got a 3rd grade problem wrong). Anyway, I await with bated breath the follow up from your email discussion. You did a great job of describing this. Kudos!