In the spring of last year, nj.com posted a story about the PARCC exam -- the new, semi-national standardized test that has been a large source of controversy -- and how it would affect teacher evaluations in the state.
I happened to notice a really great comment from "Rutgers Professor" just below the article. The scholar in question is Gerald A. Goldin. I don't know him personally, but I had certainly heard about him: he is the definition of a scholar, distinguished in both his field and the teaching of his field.
It bothered me, frankly, that someone as knowledgable as Goldin, who had written a genuine essay within his comment, wasn't featured more prominently in this post. Since I'm at Rutgers myself, I contacted him to ask if I could publish what he wrote.
I didn't hear back from him until later in the fall; he was away, and then I was away, and you know how that goes. Dr, Goldin, however, was very gracious and agreed to let me reprint what he wrote. I only apologize I haven't done so until now.
What you're about to read is important; Gerald Goldin's opinion on how PARCC will be used matters. I know the state has dropped the percentage of SGP used in a teacher's total evaluation to 10 percent, but even that's too much for a method that is fundamentally invalid.
I'm honored to host this essay on my blog. Thanks, Dr. Goldin, for this contribution.
* * *
An 8th thing to know: Junk statistics
I read with interest the on-line article (March 16, 2015),
“7 things to know about PARCC’s effect on teacher evaluations” at www.nj.com/education/.
As a mathematical scientist with knowledge of modeling, of
statistics, and of mathematics education research, I am persuaded that what we
see here could fairly be termed "junk statistics" -- numbers without
meaning or significance, dressing up the evaluation process with the illusion
of rigor in a way that can only serve to deceive the public.
Most New Jersey parents and other residents do not have the
level of technical mathematical understanding that would enable them to see
through such a pseudoscientific numbers game. It is not especailly reassuring
that only 10% of the evaluation of teachers will be based on such numbers this
year, 20% next year, or that a teacher can only be fired based on two year’ s data. Pseudoscience deserves no weight whatsoever in educational policy. It is
immensely troubling that things have reached this point in New Jersey.
I have not examined the specific plans for using PARCC data
directly, but am basing this note on the information in the article. Some of
the more detailed reasons for my opinion are provided in a separate comment.
In short, I think the 8th thing to know about PARCC’s effect
on teacher evaluation is that the public is being conned by junk statistics.
The adverse effects on our children’s education are immediate. This planned
misuse of test results influences both teachers and children.
Sincerely,
Gerald A. Goldin, Ph.D.
Distinguished Professor
Mathematics, Physics, and Mathematics Education
Rutgers – The State University of New Jersey
-------------------------------------------------------------------------------------------------------------------
Why the reportedly planned use of PARCC test statistics is
“junk science”:
First, of course, we have the “scale error” of measurement
in each of the two tests (PARCC and NJ-ASK). Second, we have random error of
measurement in each of the two tests, including the effects of all the
uncontrollable variables on each student’s performance on any given day,
resulting in inattention, misreading of a question, “careless mistakes,” etc. Third, we have any systematic error
of measurement – possibly understating or overstating student competency – that may be present in the test
instruments, may be different in the two instruments, and may vary across the
test scales.
The magnitude of each of these sources of error is about
doubled when the difference of two independently-obtained scores is taken, as
it is in calcualting the gain score. In addition, since two different test
instruments are being used in the calculation, taking the difference of the
scores requires some derived scale not specified in the article, which can
introduce additional error. These sources of error mean that each student's
individual gain score has a wide "error bar" as a measure of whatever
it is that each test is designed to measure.
Fourth, we have “threshold effects” – some students are
advanced well beyond the content intended to be measured by each test, while
others are far behind in their knowledge of that content. The threshold effects
contribute to contaminating the data with scores that are not applicable at all.
Note that while the
scores of such students may be extremely high or low, their difference from one
year to the next may not to be extreme at all. Thus they can contribute
importantly in calculating a median (see below).
A fifth effect
results from students who did not take one of the two tests. Their gain scores
cannot be calculated, and consequently some fraction of each teacher’s class
will be omitted from the data. This may or may not occur randomly, and in any
case it contributes to the questionability of the results.
Sixth is the fact
that many variables other than the teacher influence test performance –
parents’ level of education, socioeconomic variables, effects of prior
schooling, community of residence, and so forth. Sophisticated statistical
methods sometimes used to “factor out” such effects (so-called “value added
modeling”) introduce so much additional randomness that no teacher’s class comes
close in size to being a statistically significant sample. But without the use
of such methods, one cannot properly attribute “academic growth” or its absence
to the teacher.
According to the
description in the article, the student gain scores are then converted to a
percentile scale ranging from 0 to 100, by comparison with other students
having "similar academic histories." It is not clear to me whether
this means simply comparison with all those having taking both tests at the
same grade level, or also means possibly stratifying with respect to other,
socioeconomic variables (such as district factor groupings) in calculating the
percentiles. Then the median of these percentile scores is found across the
teacher’s class. Finally the median percentile of
gain scores is converted to a scale of 1-4; it not specified whether one merely
divdes by 25, or some other method is used.
However, a seventh objection is that test scores, and
consequently gain scores, are typically distributed according to a bell-shaped
curve (that is, approximately a normal distribution). Percentile scores, on the
other hand, form a level distribution (that is, they are uniformly distributed
form 0 to 99). This artificially magnifies the scale toward the center of the
bell-shaped distribution, and diminishes it at the tails. Small absolute
differences in gain scores near the mean gain score result in important
percentile differences, while large absolute differences in gain scores near
the extremes result in small percentile differences.
There are more complications. The distribution of
performance on one or both tests may be skewed (this called kurtosis), so that
it is not a symmetrical bell-shaped curve. How wide the distribution of scores
is (the “sample standard deviation”) is very important, but does not seem to
have been taken into account explicitly. Sometimes this is done in establishing
the scales for reporting scores, in which case one thereby introduces an
additional source of random error into the derived score, particularly when
distributions are skewed.
Eighth, and perhaps most tellingly, the median score as a
measure of central tendency is entirely insensitve to the distribution of
scores above and below it. A teacher of 25 students with a median “academic
growth” score of 40 might have as many as 12 students with academic growth
scores over 90, or not a single student with an academic growth score above 45.
To use the same statistic in both cases is patently absurd.
These comments do not address the validity of the tests,
which some others have criticized. They pertain to the statistics of
interpreting the results.
The teacher evaluation scores that will be derived from the
PARCC test will tell us nothing whatsoever about teaching quality. But their
use tells us a lot about the quality of the educational policies being pursued
in New Jersey and, more generally, the United States.
Gerald A. Goldin,
Ph.D.
Distinguished
Professor, Rutgers University
Mathematics, Physics,
Mathematics Education
No comments:
Post a Comment
Sorry, spammers have forced me to turn on comment moderation. I'll publish your comment as soon as I can. Thanks for leaving your thoughts.