King has tried to prepare drivers for the big change in test scoring that is coming their way. He and Regents Chancellor Merryl Tisch have been warning that the passing rate for the New York tests is going to plummet. But, as Secretary of
To help us meet our national driving goals, a new generation of tests has been developed (I will refrain from making a bad pun on PARCC here...). But we need to test these tests; we need to determine if they are really testing the driving skills our nation needs.
To do that, NYSED/NYDMV convened a panel of experts to "benchmark" the state's driving exams. By comparing the new driving exam to other tests of driving skill, the state can attempt to gauge the validity of its new tests. And since the state is setting the rigorous standard for all drivers of "race track and highway ready," it's going to use the NASCAR license* exam as its primary benchmark.
Naysayers have already voiced their criticisms. "Why should we set such a high standard?" they ask. "The majority of New York's drivers don't need to perform at such a high level, and it's unrealistic to think that everyone will have the skill to race in a NASCAR series!" They make a good point: traditionally, NASCAR has been reserved for those drivers who want to drive fast and who have the skill to do so. Does it make sense for everyone to become a NASCAR driver?
Which brings up another issue: the NASCAR test is designed to show us who the best of the best are*. It's a norm-referenced test: while it can help determine a driver's level of skill, its primary function is to pick out the best drivers from the pack for admission into NASCAR. A regular drivers license test, however, is criterion-referenced: its function is to show whether the test taker has the requisite skill to be a safe driver. The worst licensed driver on the road should still have the skills necessary to drive safely, even if every other driver is better.
Now, a very smart person pointed out to me yesterday that while we all say an assessment like a drivers test is criterion-referenced, it is, in fact, norm-referenced. Because when we "test the test," we inevitably look at the range and distribution of the results, and make decisions about whether the test was actually valid. If we give a driving test and, for example, only 10% of drivers pass, we're probably correct in thinking that the test has some problems and isn't valid. And if everyone is passing the test but we still see a lot of accidents on the road, we're justified in believing our test isn't valid either.
But, in either case, we can't preclude the possibility that the test is valid, even when we wind up with results we didn't expect and don't match our other observations. It may well be that a test that shows a very high pass rate is perfectly fine, and our test takers have just simply learned what they were supposed to learn. We have to thread a bit of an epistemological needle here: while it's fine for the results of a test to inform our belief in its validity, we shouldn't confuse a test whose purpose is to assess knowledge with a test whose purpose is to rank the test takers.
And I'd go a step further: as a practical matter, we shouldn't let worries about using normative measures to show predictive validity on a test keep us from declaring that the test is mostly criterion-referenced.
(Yeah, OK, that's a weird sentence... but I'm standing by it. And I know this is wonky, but I promise - the next post is about Michelle Rhee (I've seen my hit counters - admit it, you people love it when I do that!). Just stick with me for a bit - I promise, this is going somewhere...)
Again: a regular drivers licence test and a NASCAR test serve two different purposes. The regular test is there to make sure all drivers can meet a certain level of skills; the NASCAR test is there to pick out the very best drivers. The intent of each test is markedly different. There's no need for the state to test whether regular drivers can handle speeds of 180 MPH, as those drivers won't ever go that fast anyway. And the purpose of the state test isn't to find the best drivers; its purpose is to set a standard for all drivers. But the NASCAR test should push the limits of drivers to see who rises to the top.
Is it wrong for NYSED/NYDMV to use another driving test, like the NASCAR test, as a benchmark? In theory, no - in fact, it's a good idea for the state to look at some sort of external validity test, even if it's mostly normative. Even if the NASCAR test is much harder, informed test creators can use the results of the state test and the NASCAR test to make some well-reasoned decisions about whether the state test is assessing what it is supposed to assess.
But they must always remember two things:
- The NASCAR test must push the limits of the test takers if it's going to sort out who is the best; it's going to necessarily demand skills at a much higher level than we need for the state test.
- While it's fine to use a normative assessment to inform our belief in the state test's validity, we have to remember our intent on the state test is not to rank test takers, but to assess whether they can demonstrate particular skills and knowledge.
Does everyone see where I'm going with this?
Um... no, not really...
Well, let's try to bring this back to the real world and see if it makes any sense...
When the New York State scores crashed last week, Commissioner King and Chancellor Tisch and SecEd Duncan all told us not to panic. Everything was fine: NY was just using a new, more "realistic" standard for the state exams. But how did they determine the new cut scores? How did they know where to "set the bar" on these new tests?
According to NY State documents and the members of the committee that set the cut points, they largely relied on normative assessments, like the SAT, that are (supposedly) correlated to a student's freshman GPA at a four-year college or university. Like our drivers test above, NYSED used another test, the SAT, as a benchmark for determining cut points on the state test.
Is this unacceptable? In theory, no: it's fine - and probably necessary - to benchmark the state test to another exam. But the state must keep in mind that there are dangers:
1) NYSED created a convoluted pathway from the benchmarks to the decisions that are mandated by the state test, something I call the Triple Lindy.
I could actually live with the Triple Lindy if NYSED gave any indication that every leap from benchmark to benchmark introduced error. But there is no such indication: the state is mandating that actions be taken on clearly defined cut scores. That tells us NYSED does not understand - or, worse, does not want to acknowledge - that tying together benchmarks that are derived from assessments with widely varying ranges and scales and reliability introduces error into the system.
If you're going to force high-stakes decisions based on tests that are benchmarked this way, the very least you can do is acknowledge that those decisions are prone to error.
2) A normative assessment like the SAT - OK, a mostly normative assessment - is designed to yield a range of results; its primary purpose is to rank the test takers, not to assess their skills. That is a different goal than mostly criterion-referenced assessments like the NY State tests.
If the bar starts getting higher or lower for college admissions, the state tests run the risk of moving their bars as well. The state test is then no longer trying to meet its goal of assessing to a criterion reference (a goal it may not ever be able to fully meet, but a worthy goal nevertheless). We run the risk of changing the criterion, rather than changing how the normative assessment informs how we determine the validity of whether the criterion is being assessed. Which brings us to our third danger:
3) The benchmark for the state exams was not "college and career ready," which I claim is an utterly fraudulent phrase. No, the real benchmark was set at "four-year college ready." Even though this is transparently obvious, no one pushing for these new benchmarks seems to want to admit it. Why is that?
Because when you tie a mostly criterion-referenced test like the state exam to a mostly norm-referenced test like the SAT you run a great risk of setting the benchmark for the criterion-based test too high. You become dangerously close to creating an assessment that leans far more toward the norm-referenced side than the criterion-referenced side; you start creating and grading your exam on the belief that a certain number of the test takers must fail. Because of that, you push the envelope on the difficulty of test items. You must have really hard questions to sort out the very top of the distribution, and that creates the danger of conflating items designed to rank test takers with items that actually assess what you're trying to assess.
In our NASCAR example, we're benchmarking a test to determine if drivers can handle a car at speeds up to 65 MPH with a test where the very best drivers go 180 MPH. I contend there's going to be a natural tendency for test makers and benchmarkers to say: "Well, our very best drivers can go 180 MPH; why are we setting the speed limit for the state test so low? Let's push it up to 100 MPH; that way, we can be extra sure everyone who gets a license can handle 65 MPH because they passed the driving test at a much higher speed!"
I am convinced this is what happened in New York. About 30 percent of New Yorkers hold a four-year degree; about 30 percent of the students passed the tests. Coincidence? I don't think so. The highest standard has become the norm: because some students score high enough to get into four-year colleges, NYSED believes all students should score that high.
But shifting the bar by itself isn't enough: you aren't addressing the root causes that create success. Simply yelling "Drive faster!" isn't going to make people better drivers. And you haven't justified your goals: should all drivers be able to go 100 MPH? Why, if the speed limit is 65? Should all students be "ready" to go to a four-year college? Why, when so many necessary jobs don't actually require a four-year degree?
There's a great deal to unpack when it comes to the ramifications of NYSED's decision to proceed this way. I'd make the case that this was a political maneuver: it allows King and Duncan and Tisch and others to claim the issue in our country's lack of social mobility remains readiness for college (blame the teachers unions!), rather than access to college and access to good jobs that don't require college. Big topic; more later.
Let's end for now on this: the state test cut scores were tied to mostly normative measures. Even if you believe their current benchmarks are valid, the question going forward is whether any changes in proficiency rates on the exams will indicate an increase in learning, or simply reflect monkeying around with where the bar for "proficiency" is set on the basis of normative assessments.
If 40% of New York's students are "proficient" next year, will that mean 10% of the student population is "readier"? Or does it really mean that NYSED just moved the benchmark (and maybe even changed test items - more later) to allow more students into "proficiency"? And if you can't answer that question with any certainty: why in the world would you mandate high-stakes decisions based on these tests?
* Again, I'm making this up. I really don't know anything about getting a NASCAR license, or even if there is testing involved. Just play along, OK?