teacherken's musings on everything

from a public HS teacher (Gov't, Religion, Soc. Issues), who is eclectic (Dem-leaning) politically and Quaker (& open) on everything else. Hope you enjoy what you find here.

Sunday, September 25, 2005

High stakes testing - what is it, and does it work?

For today’s educational diary, I thought I’d explore the subject of high stakes tests. In this posting I will explain the use of the terminology, and look at applications of such tests historically and more recently. Finally I will post the a press release about the most recent study of the current round of high stakes testing after I discuss the implications of that study.

(cross-posted at dailykos, myleftwing and teacherken)

High stakes testing is a term normally applied to tests created externally from a school which have consequences for not achieving a certain level (a passing or cut score) such as not receiving credit for the course, not being promoted to the next grade or not graduating from high school. Please note the emphasis on the locus of the creation of the test. Such tests are not created by those delivering the instruction. When I crete a test, it will be closely tied to the instruction that has gone on in my room, and will serve as a feedback mechanism not only for my students on how well they heave learned, but also for me in how I have prepared them via my instruction. Further - and this is of critical importance - the students and I receive the results in a timely manner. Except for final examinations that affords me the opportunity to offer remedial instruction either to specific students or to an entire class should I detect any patterns of misunderstandings or gaps in the knowledge that I expect to be present.

Nowadays the term “high-stakes” has been applied to the entire regimen required under NCLB. Those annual tests in reading and math for students grades 3-8 do not fit the normal definition of high stakes -- for the students. Yes, there are some states that will use those tests as a hurdle for promotion to the next grade, but there is no requirement under the Federal law to do so. The consequences are for the schools and school districts, with the punitive actions possible, ranging from being marked as a school failing to make Annual Yearly Progress to possible loss of part of Federal aid and/or having students being able to transfer out. Such punitive actions towards schools pre-date NCLB. In many states the testing regimen imposed upon schools after 1983 potentially could lead to schools being ‘restructured” - administrators would be removed and all teachers would have to reapply for their jobs (even though stability of teaching staff correlates very strongly with academic success of students, but proponents of these punitive approaches rarely consulted the appropriate research). In some stats, test scores would be only part of criteria -- attendance, discipline issues and other factors would also be examined.

Clearly a heavily weighted final examination prepared by a teacher can have consequences such as not being promoted or even not graduating if that test results in an overal failing grade. Here I note that I try never to make any examination or project that is part of a course I teach worth more than about 20% of the grade for a marking period, simply because with adolescents the level of pressure that represents is unfair - what if there is surrounding that one test circumstances that legitimately affect the ability of a student (a serious illness in the family, a pending divorce) or an entire school (the suicide or murder of a popular student or staff member) to perform successfully on that instrument on that day?

Externally composed tests having heavy consequence are not a new phenomenon. Here I will ignore the early entrance examinations, since while the repents a denial of an opportunity it does not represent a a refusal to grant credit for work already done. Even so, high stakes tests are not new. New York State has long had a series of examinations required in certain courses. Created under the supervision of the state’s board of education, known as the Regents, these Regents examinations were a common part of high school for those of us who graduated from high school in the Empire State in the period of the 1950’s and later. I don’t remember how many I took, but I do remember 2nd year Latin and French, a batch of math courses, World History, American History and English. I do not believe that we were required to pass the test to receive credit for the course, but that the score on the exam - which was marked locally at least for those of u in the high school class of 1963 - was included in averaging our final grade. Passing a certain number of Regents examinations resulted if memory serves in the receipt of a high level of diploma, but lack of such examination success was not per se a bar to receiving any diploma.

After the publication of (the badly flawed) A Nation At Risk in 1983, many states began to move in the direction of creating a series of tests that had to be passed in order for students to graduate from high school, or in some cases for promotion to the next grade. This phenomenon was intensified after the release of the Goals 2000 program that I remind readers was largely a production of a National Governors’ Association led by Bill Clinton. Most statewide high stakes tests were thus already in place before the passage of the so-called No Child Left Behind Legislation. It is important to note that one of the arguments for the passage of NCLB was the claim by Bush and those working with him that Texas had by using high stakes testing improved their educational outcomes, something that was clearly disputable at the time as the work of Walt Haney at the Lynch College of Education at Boston College and others demonstrated at the time.

One should not make the mistake of assuming that imposition of such testing regimens was a partisan action. I have already referred to Bill Clinton’s role as head of NGA. There were other prominent Democratic governors supporting such an approach, people such as Jim Hunt of NC> And former TN Governor Lamar Alexander served as Sec Ed under the first President Bush at the time much of this expansion of high stakes testing took place.

Let us now proceed to the question of whether the imposition of testing regimens with serious consequences - whether for the students, the schools, the teachers, and/or any combination of the foregoing - leads to improved educational consequences. If one looks only at the scores on the tests themselves, one might argue that it does. Many states and schools systems will proudly point at their increasing performance on the state tests. But often this is illusory. Let me offer several explanations for such phenomenon.

First, when any testing regimen is introduced, those providing instruction for the courses to be tested had to adjust to how the test operates. Mere familiarity with the test allows for better test preparation, which may not necessarily mean that the higher scores represent better knowledge. here I note that merely retaking the old (1600 point maximum) SAT resulted in an improved score in excess of 40 points, and thus it was pretty easy for a company like Princeton Review [disclosure - I taught and tutored for them for 3 years] to guarantee an improvement of 100 points.

Second, there is far too much evidence of setting the initial passing levels too high, then lowering them , so that comparing the pass rate from one year to the next is not a valid comparison. This has clearly happened among other places in Texas on a broad scale and in Virginia on the American History tests (and as a middle school teacher in Arlington Virginia I was a beneficiary of the latter).

Third, comparing the scores of this year’s 3rd graders to last year’s is not a valid comparison of improvement in learning or teaching, because it is not the same children, and the cohorts can vary significantly in prior knowledge, demographics, etc. Comparisons of 4th grade scores to 3rd grade scores can also be fraught with problems,unless one can demonstrate a clear vertical relationship between the two tests and can isolate how much of the improvement is due to 4th grade instruction (the so called value added component) and not to outside learning experiences that have occurred since the 3rd grade test.

All of these issues are independent of any structural, reliability or validity issues on the tests themselves. As an example of a structural issue, several years ago the Maryland High School Assessment in Government had forms (different versions) of the test that varied significantly in their composition. One form had 2 longer writing pieces and about 6 short ones (Extended Constructive and brief Constructed Responses), while another had no EC R and something like 13 to 16 BCRs. The latter example required so much writing that students who had never done that much writing at one time had their hands cramping up, and it is not certain that one can draw valid inferences across the different forms of the examination.

A test is reliable if it consistently gives the same results. That does not mean it is accurate. I have two scales in my house. One appropriately measured my weight this morning at 179 pounds. The other measured me at 138 - it consistently measures weight 41 pounds low. It is reliable in doing so. But it is not an accurate measurement, and hence using it (without the known correction factor) would lead me to draw the invalid inference that I was far skinnier than I actually am. Without reliability there can be no validity - no ability to draw valid inferences. But reliability is in itself insufficient.

Further, one has to know what the test is measuring. Is it in fact measuring underlying knowledge and/or ability, is it measuring the ability to take the test? Requiring an answer to be constructed a certain way may accurately measure how well the student can construct in that format but may not measure the content that is in theory being assessed. Imagine if you can being given a test which is printed upside down or backwards, and you are not allowed to spin the paper around or use a mirror. The score you achieve on that test will largely be a function of how well you can translate the image into something you can comprehend, and far less a measure of your actual knowledge of the correct answers. Far too many tests -- and here I most acknowledge that these include teacher-created tests - have too much of assessing the ability to interpret the test or to give an answer in a fixed format, and thus are not necessarily really measuring the knowledge and learning in theory the test is supposed to assess.

Okay, enough on all that. Let’s assume that we have accounted for all the issues I have described above. What valid information are we obtaining from high stakes testing? how can we be certain that improved results on statewide tests represent an improvement in learning and/or teaching? Perhaps the best way is to apply over time an independent measure of the same domains. One can, for example, look at SAT scores in a state or NAEP scores. NAEP is a national assessment that samples across the state on a voluntary basis and yet is a random sampling. The state does not have to participate, but those that do test a randomly selected group of schools and students. Thus in a snapshot it gives a standard against which improvements on state tests can be compared. For this comparison purpose it is probably the most effective single way of doing such comparison. SATs have two problems. First, outside of the two coasts, many colleges/universities do not require SATs (often accepting ACTs which are less expensive) so that only the elite students are likely to take the tests. Thus SAT scores might remain static even though learning had improved. Further, as more students are encouraged to consider colleges and take the SAT in communities on the coasts (such as the County in which i teach in Maryland), the overall SAT average could decrease even as the disaggregated scores of all groups increase - this is due to the phenomenon known as Simpson’s Paradox (google it if you want to know more).

This past week an important study was released jointly by the Education Policy Studies Laboratory at Arizona State University and the Great Lakes Center for Educational Research. I have enclosed the entire press release below. To summarize, there is no evidence to support the theory behind high stakes testing, that the increased pressure will lead to increased student achievement.

I suggest that you take the time to read the press release. You may even decide to use the imbedded link to read the report. I will look forward to any comments.

note I have removed from the press release the phone numbers that were listed, although I have kept in the names and emails that were provided

ARIZONA STATE UNIVERSITY
EDUCATION POLICY STUDIES LABORATORY (EPSL)
Education Policy Research Unit (EPRU)

****NEWS RELEASE--FOR IMMEDIATE RELEASE****

NATIONAL STUDY FINDS NO CONVINCING EVIDENCE THAT HIGH-STAKES TESTING PRESSURE LEADS TO INCREASED STUDENT ACHIEVEMENT

Contact: Teri Moblo (email) tmoblo@mea.org or
Alex Molnar (email) epsl@asu.edu

TEMPE, Ariz. (Tuesday, September 20, 2005) - The pressure associated with high-stakes testing has no real impact on student achievement, according to "High-Stakes Testing and Student Achievement: Problems for the No Child Left Behind Act," a study released by the Education Policy Studies Laboratory at Arizona State University and the Great Lakes Center for Education Research
and Practice.

Under the federal No Child Left Behind Act (NCLB), high-stakes test scores are the indicators used to measure school and student success on a statewide basis. Low test scores can result in severe consequences for schools under this law. The underlying theory behind this type of accountability program is that the pressure of high-stakes testing will increase student achievement. But according to this study, there is no convincing evidence that this kind of pressure leads to increased student achievement.

The authors, Sharon L. Nichols, University of Texas at San Antonio, and Gene V Glass and David C. Berliner, Arizona State University, studied the National Assessment of Educational Progress (NAEP) test data from 25 states. The results suggest that increases in testing pressure are related to increased retention in grade and drop-out rates. The authors found that states with the highest proportions of minority students implemented accountability systems that exerted the greatest pressure. Thus, the negative impacts of high-stakes testing will disproportionately affect America's minority students.

"This most recent research demonstrates that the pressure to produce high test scores as a result of No Child Left Behind hasn't helped students to achieve more, and has served to limit the depth and breadth of what students are being taught in schools around the country," said Teri Moblo, director of the Great Lakes Center.

Four key findings emerged from the study:

*States with greater proportions of minority students tend to implement accountability systems that exert greater pressure. An unintended consequence of this patterning is that problems associated with high-stakes testing risk disproportionately affecting America's minority students.

*Increased testing pressure is related to increased retention and drop-out rates. High-stakes testing pressure is negatively associated with the likelihood that eighth and 10th graders will move into 12th grade.

*NAEP reading scores at the fourth- and eighth-grade levels were not improved as a result of increased testing pressure. This finding was consistent across African American, Hispanic, and White student subgroups.

*Weak correlations between pressure and NAEP performance for fourth-grade mathematics and the unclear relationship for eighth-grade mathematics are unlikely linked to increased testing pressure. While a weak relationship emerged at the fourth-grade level, a systematic link between pressure and achievement was not established. For eighth-grade performance, the lack of clarity in the relationship may arise from the interplay of other indirect factors. Inconsistent performance gains in these cases are far more likely the result of indirect factors such as teaching to the test, drill and practice, or the exclusion of lower-achieving students than pressure.

What the researchers could not find is also of great importance. Many different analyses were unable to establish any consistent link between the pressure to score high in a particular state and that state's student performance on the NAEP. That means that claims of a clear-cut link between pressure and performance cannot be considered credible.

"A rapidly growing body of research evidence on the harmful effects of high-stakes testing, along with no reliable evidence of improved performance by students on NAEP tests of achievement, suggests that we need a moratorium in public education on the use of high-stakes testing," said Nichols, the study's lead author.

Find this document on the web at:
http://www.asu.edu/educ/epsl/EPRU/documents/EPSL-0509-105-EPRU.pdf

Posted by: teacherken / 8:20 AM

Comments:

One other possible cause of change in test scores not mentioned here has to do with how teachers actually administer the tests. In all of my education course observations that have taken place during testing weeks, I have seen teachers, usually with the best of intentions, cheat in administering the tests. The stakes are high for them, for their schools, and (in Philadelphia) for their students, and they are hardly a disinterested party administering this testing instrument. Teachers have asked me to read the test to certain children, repeated questions over and over when they realized their kids weren't getting it when they read it to themselves, and have emphasized certain choices or syllables in phonics questions. As the stakes rise and teachers gain a better knowledge of the test, it seems to me that this kind of thing would increase as well.

I came here from Daily Kos, by the way, and I love your blog and your links.

# posted by

Anonymous : 12:00 PM

teacherken's musings on everything

Links

Archives

Sunday, September 25, 2005

High stakes testing - what is it, and does it work?