Before I go any further, I think I need to regroup. Most of you probably have a good familiarity with what I'm going to discuss; however, I think it may be helpful to step back for a moment and think about some of the basics before we proceed. I know it will be helpful to me.
When we last left our discussion of the crashing of New York State's test scores, I was explaining how NYSED and its commissioner, Reformy John King, were using the Triple Lindy to set the cut scores for the state:
In that post, I glossed over what is really the most significant problem with all this: it's dangerous to use a norm-referenced test like the SAT to set the benchmarks for a criterion-referenced test, which is what the NY State test should be.
- Start with "college and career ready," an ill-defined phrase that could mean just about anything.
- Leap to freshman year GPA in selected courses at a limited number of four-year colleges. Could be graded on or off a curve (normative or criteria-based - more on this later); varies widely between professors, schools, and courses; doesn't necessarily indicate whether the student's entire college experience was "successful."
- Spring to SAT/PSAT scores, somewhat correlated to first year college GPA, but a normative assessment (meaning a set number of students must score at each percentile - someone's got to lose). This is a test, by the way, tightly correlated to family income.
- Bounce to 8th grade NY State test scores, which are given three years before the SAT.
- Carom (got a thesaurus?) to 3rd through 7th NY State test scores, which would assume all children follow the same learning trajectory.
- Jounce (SAT word!) to teacher/principal evaluations and school evaluations and student retention decisions.
That is, admittedly, a jargon-filled sentence. Let's break it down:
There are basically two types of tests: norm-referenced and criterion-referenced. A criterion-referenced test is a test of mastery: can the student do something we want her to do? Can she, for example:
- Accurately multiply two three-digit numbers together?
- Define a list of 20 vocabulary words?
- Play back a rhythm with a steady tempo?
- Step forward with the opposite foot when throwing a ball?
- Calculate an object's speed at a specific time during its fall?
Almost every test a teacher gives a student at school is criterion-referenced: the teacher wants to know who can do what and with what degree of accuracy. For example, Mr. Crabtree might give his math students a "B" on a test if they can accurately calculate the area of different simple polygons 80% of the time. Here's the key: it is conceivable that every student in his class can meet this objective. The only thing that keeps a student from achieving a passing score on a well-written criterion-referenced test is her ability to do what the test is asking her to do.
A norm-referenced test, however, is a test that judges all students against each other: how well did the student do in relation to other students? It's impossible to have only one student ever take a norm-referenced test, because she has to be judged in the context of other students. Is she in the 99th percentile, or the 10th?
The SAT is the most famous norm-referneced test: each section has a mean score of 500, the average score for a representative population of students who took it when it was developed (the SAT was last "re-centered" in 1995). Here's the key: on a norm-referenced test, there are always high scorers, and there are always low scorers. A norm-referneced test should yield a shape similar to a bell curve:
Because norm-reference tests like the SAT are ostensibly useful for finding out where a student places among his peers, they are used for decisions like college admissions and scholarships, where the tester wants to find the relative positions of the test takers (whether the tests do a good job is a highly controversial matter that I'm not going to broach right now). But if the tests are going to be useful, they have to be constructed in a way that yields results where a few students are high, a few are low, and most are in the middle. It has to be that way; otherwise, the test is useless.
Now, the New York State tests, like all tests mandated by No Child Left Behind, are accountability measures. They are administered by the state to hold students to account for their learning, teachers to account for their teaching, and schools to account for their results. If the student or the teacher or the school isn't demonstrating learning, there will be consequences.
It makes no sense, therefore, to use a norm-referenced test as an accountability measure. After all, someone's got to be below average on a normalized test; someone has got to lose, even if everyone is learning. If all of Mr. Crabtree's students can calculate an area, and they all pass a test, but Jimmy's score on that test is lower than Billy's, it doesn't mean that Jimmy hasn't learned what he is supposed to learn.
In the same way: if Mr. Fishbush's students score, on average, higher than Mr. Crabtree's class, it doesn't mean Mr. Crabtree didn't do his job. Aside from the fact that Mr. Fishbush's students may have scored higher for reasons having nothing to do with Mr. Fishbush, we also have to acknowledge that students learned what we wanted them to in Mr. Crabtree's class.
Let me try yet again: Scottie Pippin was not a bad basketball player just because Michael Jordan was better. Roger Moore was not a bad James Bond because some people liked Sean Connery more. Merlot isn't a bad wine just because you think Pinot Noir tastes better with lamb. And a child is not necessarily being "left behind" just because other kids are further ahead on standardized tests.
One of the biggest problems I see in the education policy world right now is that far too many people say they know the difference between norm- and criterion-referenced tests, but then proceed to act in ways that betray their ignorance. Which brings us back to New York...
If the NY State tests are accountability measures - and most assuredly, they are - then they must be criterion-referenced. It would make no sense whatsoever to make the test norm-referenced, because even though we want all children to succeed, someone would have to lose. If the goal is to hold everyone accountable, we must have a test where everyone can succeed.
That's the central problem with the Triple Lindy method NYSED is using above. The state is tying their benchmark for "proficiency" on what should be a criterion-referenced test to a norm-referenced test: the SAT (and PSAT). When you set the place for the cut score based on a relative measure like the SAT, someone has got to lose.
This is the point Carol Burris, a NY State Principal of the Year, made recently in the Washington Post: the reason NYSED Commissioner John King could predict the results of the state tests even before they were scored was that he set up the tests to mirror the results of norm-referenced exams. Burris reprints the chart NYSED used:
To be "proficient" in math, NY State students have to score at an equivalent level of a 540 on the math section of the SAT (adjusted for grade level, one of the last parts of the Triple Lindy). Remember, a 500 is the mean or average score; the test is designed to ensure that just about half of the test takers get below a 500, and half score above 500. NYSED seems to believe that everyone should be above average, a logical impossibility.
Let me add a caveat here: theoretically, it is possible to translate that 540 into a criterion-referenced benchmark. It's theoretically possible to say: "A 540 means the student will be able to do x, y, and z." But as a practical matter, this is not what's happening at all; in fact, it's exactly the opposite. The Triple Lindy is not setting the cut score to a criterion; it's setting the cut score to a norm. How do I know this?
Because the "higher-order thinking" questions on the exams aren't designed to test a skill: they're designed to create a normalized distribution of scores. More in a bit.
ADDING: Emailing back and forth today, I am reminded that, in a practical sense, all tests are more or less norm-referenced. Stand by...
ADDING MORE: And here's the next part of this discussion. And it's got Will Ferrell in it, so it can't be all bad...