I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Wednesday, January 20, 2016

PARCC, Teacher Evaluations & "Junk Statistics": An Expert Speaks

A little background on what you're about to read: 

In the spring of last year, nj.com posted a story about the PARCC exam -- the new, semi-national standardized test that has been a large source of controversy -- and how it would affect teacher evaluations in the state.

I happened to notice a really great comment from "Rutgers Professor" just below the article. The scholar in question is Gerald A. Goldin. I don't know him personally, but I had certainly heard about him: he is the definition of a scholar, distinguished in both his field and the teaching of his field.

It bothered me, frankly, that someone as knowledgable as Goldin, who had written a genuine essay within his comment, wasn't featured more prominently in this post. Since I'm at Rutgers myself, I contacted him to ask if I could publish what he wrote.

I didn't hear back from him until later in the fall; he was away, and then I was away, and you know how that goes. Dr, Goldin, however, was very gracious and agreed to let me reprint what he wrote. I only apologize I haven't done so until now.

What you're about to read is important; Gerald Goldin's opinion on how PARCC will be used matters. I know the state has dropped the percentage of SGP used in a teacher's total evaluation to 10 percent, but even that's too much for a method that is fundamentally invalid. 

I'm honored to host this essay on my blog. Thanks, Dr. Goldin, for this contribution.

* * *

An 8th thing to know: Junk statistics

I read with interest the on-line article (March 16, 2015), “7 things to know about PARCC’s effect on teacher evaluations” at www.nj.com/education/.

As a mathematical scientist with knowledge of modeling, of statistics, and of mathematics education research, I am persuaded that what we see here could fairly be termed "junk statistics" -- numbers without meaning or significance, dressing up the evaluation process with the illusion of rigor in a way that can only serve to deceive the public.

Most New Jersey parents and other residents do not have the level of technical mathematical understanding that would enable them to see through such a pseudoscientific numbers game. It is not especailly reassuring that only 10% of the evaluation of teachers will be based on such numbers this year, 20% next year, or that a teacher can only be fired based on two year’ s data. Pseudoscience deserves no weight whatsoever in educational policy. It is immensely troubling that things have reached this point in New Jersey.

I have not examined the specific plans for using PARCC data directly, but am basing this note on the information in the article. Some of the more detailed reasons for my opinion are provided in a separate comment.

In short, I think the 8th thing to know about PARCC’s effect on teacher evaluation is that the public is being conned by junk statistics. The adverse effects on our children’s education are immediate. This planned misuse of test results influences both teachers and children.


Gerald A. Goldin, Ph.D.
Distinguished Professor
Mathematics, Physics, and Mathematics Education
Rutgers – The State University of New Jersey


Why the reportedly planned use of PARCC test statistics is “junk science”:

First, of course, we have the “scale error” of measurement in each of the two tests (PARCC and NJ-ASK). Second, we have random error of measurement in each of the two tests, including the effects of all the uncontrollable variables on each student’s performance on any given day, resulting in inattention, misreading of a question, “careless mistakes,”  etc. Third, we have any systematic error of measurement – possibly understating or overstating student competency –  that may be present in the test instruments, may be different in the two instruments, and may vary across the test scales.

The magnitude of each of these sources of error is about doubled when the difference of two independently-obtained scores is taken, as it is in calcualting the gain score. In addition, since two different test instruments are being used in the calculation, taking the difference of the scores requires some derived scale not specified in the article, which can introduce additional error. These sources of error mean that each student's individual gain score has a wide "error bar" as a measure of whatever it is that each test is designed to measure.

Fourth, we have “threshold effects” – some students are advanced well beyond the content intended to be measured by each test, while others are far behind in their knowledge of that content. The threshold effects contribute to contaminating the data with scores that are not applicable at all. Note that while the scores of such students may be extremely high or low, their difference from one year to the next may not to be extreme at all. Thus they can contribute importantly in calculating a median (see below).

A fifth effect results from students who did not take one of the two tests. Their gain scores cannot be calculated, and consequently some fraction of each teacher’s class will be omitted from the data. This may or may not occur randomly, and in any case it contributes to the questionability of the results.

Sixth is the fact that many variables other than the teacher influence test performance – parents’ level of education, socioeconomic variables, effects of prior schooling, community of residence, and so forth. Sophisticated statistical methods sometimes used to “factor out” such effects (so-called “value added modeling”) introduce so much additional randomness that no teacher’s class comes close in size to being a statistically significant sample. But without the use of such methods, one cannot properly attribute “academic growth” or its absence to the teacher.

According to the description in the article, the student gain scores are then converted to a percentile scale ranging from 0 to 100, by comparison with other students having "similar academic histories." It is not clear to me whether this means simply comparison with all those having taking both tests at the same grade level, or also means possibly stratifying with respect to other, socioeconomic variables (such as district factor groupings) in calculating the percentiles. Then the median of these percentile scores is found across the teacher’s class. Finally the median percentile of gain scores is converted to a scale of 1-4; it not specified whether one merely divdes by 25, or some other method is used.

However, a seventh objection is that test scores, and consequently gain scores, are typically distributed according to a bell-shaped curve (that is, approximately a normal distribution). Percentile scores, on the other hand, form a level distribution (that is, they are uniformly distributed form 0 to 99). This artificially magnifies the scale toward the center of the bell-shaped distribution, and diminishes it at the tails. Small absolute differences in gain scores near the mean gain score result in important percentile differences, while large absolute differences in gain scores near the extremes result in small percentile differences.

There are more complications. The distribution of performance on one or both tests may be skewed (this called kurtosis), so that it is not a symmetrical bell-shaped curve. How wide the distribution of scores is (the “sample standard deviation”) is very important, but does not seem to have been taken into account explicitly. Sometimes this is done in establishing the scales for reporting scores, in which case one thereby introduces an additional source of random error into the derived score, particularly when distributions are skewed.

Eighth, and perhaps most tellingly, the median score as a measure of central tendency is entirely insensitve to the distribution of scores above and below it. A teacher of 25 students with a median “academic growth” score of 40 might have as many as 12 students with academic growth scores over 90, or not a single student with an academic growth score above 45. To use the same statistic in both cases is patently absurd.

These comments do not address the validity of the tests, which some others have criticized. They pertain to the statistics of interpreting the results.

The teacher evaluation scores that will be derived from the PARCC test will tell us nothing whatsoever about teaching quality. But their use tells us a lot about the quality of the educational policies being pursued in New Jersey and, more generally, the United States.

Gerald A. Goldin, Ph.D.
Distinguished Professor, Rutgers University
Mathematics, Physics, Mathematics Education

No comments: