Language Testing Review of the Year 2012

This site designed and maintained by
Prof. Glenn Fulcher



What A Year!
How time passes. It seems like only yesterday that I was watching the news reviews of the year and thought it would be a good idea to do something similar for language testing. And what a fantastic year it's been. Whenever I've switched the radio on this year testing has been right up there along with war, famine, and Eurogedden. Some of the stories have raised serious issues, even if they also raise eyebrows and a giggle. So let's get straight to it. You know what to expect. This isn't the kind of page you click away from in a few seconds. It's going to demand a little bit of concentration. I've read The Shallows and believe that the internet is capable of achieving so much more. So here's my critical news analysis, and I hope you enjoy my personal selection of the year's stories.

The Alpha and Omega
Unlike the worrying start to 2011, the first thing I spotted in 2012 was an unusually perceptive article in the Jakarta Post by Setiono Sugiharto of Atma Jaya Catholic University. It transpires that the TOEFL paper-based test is being used in Indonesia by the Trade Ministry, which requires its civil servants to get a minimum of 600. Apart from the fact that this version has been retired for many years, Sugiharto asks the quite reasonable question: what evidence exists that a test designed to measure academic English is valid for the assessment of the language needed to communicate in a Trade Ministry? This is an excellent question, asked in an interesting context. The generic questions is one that I've been asking for some time, and articulated in this article. The APA Standards for Educational and Psychological Assessment clearly states that "If a test is used in a way that has not been validated, it is incumbent on the user to justify the new use, collecting evidence if necessary" (Standard 1.4). This always requires explicit changes to the validation argument, which are made available for public and expert scrutiny. This clearly isn't the case for the Trade Ministry. It is probably a poor predictor of performance, and the content is not relevant. But the likely misjudgements are not quite as serious as in the health sector. I've had cause to complain in previous years about the use of tests of academic English in screening doctors and nurses. And this is where the Omega part of the story comes in; because towards the end of the year the Australian press reported Australian Nursing and Midwifery Federation says English test creates workplace risks. Once again, this is the use of IELTS to make judgments about fitness to practice without the provision of any validation evidence. To date only one research paper has been written on this topic, and that's an alignment study, which isn't validation. There are serious issues at stake here. Not if, but when, there is another communication related tragedy, involving health practitioners who have been certified as proficient to practice using IELTS, litigation is likely to follow. I have recently been researching case law that may be called upon in this context, and this will be published in 2013 as: Fulcher, G. (2013). Language Testing in the Dock. In Kunnan, A. (Ed.) The Companion to Language Testing. London: Wiley-Blackwell. This is a topic I'm sure I'll be returning to in future years. In the meantime, see this scenario for further thoughts.

Cheating - Deja Vu all over again!
This is one item that recurs each and every year. So this year I'm not going to dwell on it. Suffice it to say, that it's happening all over the world, and as the video on the left from Insider Health testifies, it's not just language tests. Professionals are also caught - it's quite chilling. So let's keep this one fairly short this year, by asking who's been cheating, and why they got into the news. In February the news broke in Korea that a test prep company was being prosecuted for sending staff to take the TOEIC and other tests, recording questions and listening texts, so they could be used in test preparation. The company in question is appropriately called "Hackers". In Hong Kong suspects were jailed when they were caught trying to impersonate the real candidates to get higher grades on the TOEFL. The IELTS/Curtin scandal hasn't gone away, as another man is jailed for taking bribes and acting as an intermediary in changing IELTS scores. And in Vietnam, schools are found guilty of creating "fake" TOEIC tests, undercutting the prices of ETS, and issuing certificates. Finally, take a look at this new take on cheating from the Huffington Post. I still prefer my own analysis. But it's up to you to compare and decide. Worth a seminar or a debate?

FCAT Power Bars
One of the more wonderful stories of the year is the attempt of one Florida primary school to improve student scores on standardized tests by handing out Power Bars, containing an apple flavoured cereal brain snack guaranteed to raise test performance. Of course, the school knew that the bars had no traceable effect on test scores. But the placebo effect is well known in medicine and many other areas of human performance. An interesting idea indeed! I couldn't find any trace of a follow-up to find out if this worked. Perhaps it's an area worthy of some serious research? In the meantime, it's a great story.

Teaching to the Test
The last item raises the question of how we prepare learners to take tests. This wonderful article from the School Book in March is one that testers and teachers should read. It should remind us that tests are always abstractions that are essential because of logistics, but should never drive teaching. There is a strong argument for test-driven curriculum which is winning through in many standards-based systems, but these are accountability driven agendas that question the abilities of professionals to deliver without constant observation. It's not only in teaching, either. But we lose a great deal of the joy of learning in such environments. And testing is far too important to do badly, too often, or narrowly. That's precisely the point of all the newspaper articles about the race to the bottom, when it was discovered that UK examination boards were competing to make their examinations easier and more coachable. Why? Well, they'd sell more. It's the bottom line of the exam board's budget. Competition isn't always the solution when applied to education.

This is fascinating stuff, because we can see the tensions not only between educators and examination boards, but between politicians, and the rest of us as well. In this video clip President Obama addresses the vexed question of No Child Left Behind and what to do with it. I find this particular clip very instructive. There is definitely a recognition here that "teaching to the test" is in some way not really acceptable. But politicians always have, and always will, see the reform of educational systems as central to achieving utopia. If you don't believe me, you haven't read Plato's Republic. This is why the first work on polity was also about education. So the discussion is couched in the language of standards and accountability. Even the most socially liberal politicians of all Western nations seem unable to break away from this agenda. So another theme of 2012 has been the extension and deepening of the use of tests to hold educational systems, establishments, and teachers, to account. But as we saw above, there are unintended consequences. But enough of this. Let's move on to the bizarre, followed by the truly giggleworthy (I don't think there is such a word, but now I've used it on the internet, you never know when it might appear in a dictionary).

Leave it to the Unprofessionals
There are a couple of great stories here. A Greek woman resident in Ireland gets on a plane to Dublin in Barcelona. I know what you're thinking - this is the start of one of those rather poor pub jokes. Well, yes and no. Because it did happen. Aer Lingus staff at the check-in asked her to complete forms in both Greek and English to prove that her passport was not forged. This makeshift language test resulted in a complaint and a storm of protest. Rightly so. Within a very short period of time the language tests were withdrawn and apologies made. Language testing for identity is nothing new. It's done badly at best, and this is what happens when you let untrained personnel have a go using their own intuition. Which leads us to the second story, which is probably more serious, although equally bizarre. In August the UK government announced that border guards would test the English of incoming students to check "that their English meets the level of the test certificate they have submitted." Given that the students are probably jet-lagged, intimidated, and suffering from culture shock after visiting the toilets, facing a border guard with a grim face muttering something about why they want to enter the UK in a regional accent when they have just been presented with a valid visa, isn't going to encourage discussion of the great academic issues of the day. It's time the education and immigration secretaries understood assessment practice and theory before they introduce such daft innovations. And finally, when it comes to assessing the language of health practitioners, are reliable and valid tests of medical English going to be developed? No. The government is going to hand the responsibility over to "responsible officers" - another bureaucratic fudge.

Banned Words
Now this is a real hoot. The video on the left shows just what a gift these kinds of decisions are for the media. Basically, the New York Board of Education banned 50 words or phrases from being used in tests, and you can find the list by clicking on the previous link, or this one to the Washington Post with blog. This included D.I.V.O.R.C.E, which was immediately jumped on by the Huffington Post, which enlisted the help of twitters to rubbish. Really, even Billy Connolly couldn't have made it up. To cut to the quick, the press were up in arms, accusing the men at the Board of being language police and cowards. Branded sensitivity nonsense, what is going on here? Well, there is a serious point behind the ban, even though most of the words banned are foolish choices. It is common practice for test designers and examination boards to put test items through four separate reviews, one of which is called a "sensitivity review". The purpose is to identify items that may cause undue stress or harm to a particular subgroup of the test taking population, particularly if that subgroup is defined by a protected characteristic, such as gender, religion, or disability. If there is any possibility that an individual from such a group may get a lower score not because of their ability on the construct of interest, but because of a negative reaction to the language or content of the item, there is construct irrelevant variance in the test. This is definitely unfair, and can lead to litigation. So the question is really: can the use of any of these words reasonably be claimed to cause construct irrelevant variance that would impact a protected subgroup of the population? Maybe a couple. But let's face it, the media were right in this case. It's purely pandering to minority groups, often with rather marginal spurious beliefs. If we do this for every pressure group in society we might as well just close the schools as well. But it all gets even stranger, when we turn to....

In 2012 it certainly looked like extreme silliness was focused on New York. What's it all about? Well, Pearson Assessment (yes, the Borg again) has got the contract for producing tests for New York, and on this year's 8th grade reading test there was a passage about a pineapple. As the New York Times tells the story, it's a take on Aesop's Fable, but in this version a Pineapple challenges the Hare to a race. The animals think that the Pineapple probably has a trick up its sleeve, so reckon it will win. The Pineapple doesn't move when the race starts, so they eat it. The moral of the story is that Pineapples don't have sleeves. The questions baffle the test takers and their teachers. The Huffington Post does a great job of rubbishing the questions and getting feedback on its blog. You can read the test questions by following this link as well. But just in case that disappears, you can also download a pdf of the passage and sample questions here.

This is one of the many news stories that hit the headlines in the US. I like the interviews with the kids! Anyway, Schoolbook reported that the tests had been scrapped, and also let us know that Pearson charges $32 million for testing services to New York. It was generally agreed that the test questions were ambiguous; but Pearson defended the test and the items. You can read the full text of their defence here. I merely quote this part:

"The Hare and the Pineapple" passage and associated items were chosen for the operational form. This was a sound decision in that "The Hare and the Pineapple" and associated items had been field tested in New York State, yielded appropriate statistics for inclusion, and it was aligned to the appropriate NYS Standard. "The Hare and the Pineapple" passage is intended to measure NYS Standard "interpretation of character traits, motivations, and behavior" and "eliciting supporting detail". The associated six multiple choice items are aligned to the NYS Reading Standards, specifically to Strand 2. The NYS performance indicator assigned to the items is "Interpret characters, plot, setting, theme, and dialogue, using evidence from the text".

This shows the deep weaknesses of many standards-based approaches to testing. Who made the decision that this particular passage and the questions were "aligned" to this standard? How was the decision made? How do we know the difficulty level is appropriate? Well, the clue is in appropriate statistics. These statistics aren't difficult to generate, but they need interpretation, and they can't be used in isolation. Too many decisions are driven purely by statistical information interpreted in a mechanical fashion; and statistical determinism isn't good for us. The problem with psychometrics is that it is a discipline which believes it is a natural rather than a social science. Alas, it has become largely a collection of procedures without any substantive theory.

One final observation from The Washington Post. Pearson pre-tests some of its items, and includes others in operation forms to estimate item statistics. It turns out that the Pineapple passage was part of its pre-tested bank for inter-state comparisons. The article points out that the development of the passage and items was paid for by other states, and was then sold on to other states, resulting in multiple forms of revenue for the same item sets. They also report students recognizing sections of tests from pre-testing. This is clearly an ethical and practical minefield. Which is why some parents argue: "Our kids are being used as guinea pigs for the financial benefit of Pearson, to the detriment of their own educational experience". And so this is precisely the point at which we should consider other income streams....

Fruity Textbooks
As part of the row over the pineapple, it was pointed out on Alan Singer's blog in the Huffington Post that Pearson not only produces the tests, they publish and market the test preparation "common core" textbooks, they deliver staff development to teach the common core, and they employ some of the authors of the standards. Hmmmmm. Sound familiar? We've been here before in both the US and the UK, and I know it's also a problem in many other countries. In this article in the BBC in November all exam boards were criticized for providing the textbooks and preparation guides for their own tests. And they continue to train teachers to teach to the test, despite the scandals of recent years. Not only does this encourage the worst of teaching to the test, but as OFQUAL has said(correctly for once - maybe a lucky accident), "an unacceptable degree of predictability of test content from preparation texts". This is a monopoly that should be broken up; but I suspect that it won't happen soon. Money, money, money.

Automated scoring is never far from the news. And once again this year it surfaced in connection with assessing writing. The argument this time is that if teachers don't have to read what their students write, they'll assign more writing tasks. And that would be good for their students! The researchers are always trying to get the computer to predict human scores - the "gold standard". But however high the correlations get (yep, statistical determinism again) the computers just aren't looking at the same things. The researcher in the article admits that his program counts the number of commas in a text, and they're a good predictor of human scores. Yes, and that's a feature of text length. All surrogates for what, I wonder?

Grade Inflation Fiasco
Every year more students pass tests. And every year there's a great debate. Are the tests getting easier? Are the students getting brighter? Are teaching standards improving? It's so predictable, it's become boring. UK examination boards don't understand the use of specifications, parallel forms, or equating; they are populated by administrators, not testing experts. But this year it all suddenly became extremely interesting - and for many of us - even entertaining. Not for the students of course, as this video makes clear. In a nutshell, the examination boards suddenly decided to "raise the grade boundaries" (translation for professionals: change the cut score) to make it harder to pass. The press was full of articles on the unfairness of it all, especially when it came to English where most of the reduction was observed. But was something else going on? The Secretary of State for Education had been suggesting that we should go back to a system of a fixed percentage of pupils getting specific grades in order to stop the observed "inflation". The Head of Ofqual had a statement to make: "Our job is make sure exam results are right. What we have done this year, and last year, is to hold the line on standards steady." I know that this is the Head of the Regulatory Body appointed by the Secretary of State, and I'm just a Professor of Education and Language Assessment, but if I had an undergraduate write this in an essay, it would fail. What does "right" mean? How do you "hold the line steady?" Not a clue. So in the press, the language started to change, with the appearance of words like cockup. And it is all the more important, because the results are not only critical for pupils, but for the schools that are placed in league tables according to these results. Then there was a call for an enquiry into the fiasco, closely followed by the threat of legal action. Then it transpired there were letters from the Regulatory Body to one of the largest examination boards putting pressure on it to lower grades. If you follow that last link and watch the video, you'll hear John Townley asking whether the position of the Head of the Regulatory body "is tenable or not". Strong words. So I'll let you make your mind up if you think he's right.

Here is the relevant extract from which I took the quote above. You will note at the end the news interviewer is so baffled by the response to one of his questions that all he can do is laugh.

Of course, this is my blog, so I guess I have to give my view. This is bureaucratic waffle at its worst. The non-understanding of variables and how we control them is breath-taking. This kind of stuff turns my brain to mush. She might just have got away with it if she hadn't given the example, which betrays complete ignorance. I have to sympathise with the interviewer. It made me laugh too. Then cry. Where does the government find these people? And is it any surprise the system is in a mess? And now, we can expect to hear the outcome of the High Court Challenge against the English test results before Christmas. What a treat!

Take a Chance on Me
The Head of the Regulatory Authority may do better at predicting what's going on if she had a crystal ball. And if the quality of some of the tests I've commented on this year is anything to go by, we might be better off using one to award grades. The empty-headed may also find the practice easier to defend when quizzed by journalists. And on a similar note, some students have been turning to fortune tellers to find out what grades they might get, what they should study, and which Universities they should apply to. And for parents, there has been a growth in online praying platforms to seek divine help for their children in the exams. All testing and assessment is a probabilistic activity. The fundamental process is a classic case of inductive inference, but dependent entirely upon the care with which the tests are designed, and the validation evidence supplied to support their use. I've been highly amused by much of what has got into the media this year; but the downside is that it reminds me just how badly most testing and assessment is done. And of the superstition which surrounds it. Especially on the part of officialdom. For a social tool that has such impact, it's a terrible indictment of the institutions involved.

Or, as the cartoonist put it:

Glenn Fulcher
December 2012