Language Testing Review of the Year 2009

This site designed and maintained by
Prof. Glenn Fulcher



Feature for December 2009 - January 2010
It's the end of another year. And in this feature we look back at language testing in the news during 2009. The stories that I've chosen have all been taken from news items that have appeared in the daily news section of the site during the year, and which caught my attention for a variety of reasons. But each story shows in one way or another the scale of the impact that language testing has on individuals, society and educational institutions.
Before turning to the stories that I've chosen to review, I have to say that the titles of some stories from the Indian press during 2009 have been absolutely delightful. Take this one for example: Three dummy candidates held during IELTS exam. Hmmmm. Only three? It turned out to be a story about cheating and fraud. Investigations led to IELTS Scam Kingpin Arrested. Language testing in India sounds more like Chicago in the 1930s! But of course, it was a serious story. Test security is a great concern for all testing agencies, as we see in our very first story.
Review of the Year's Top Stories in the Media

The Gaokao strikes again!

The earlier part of 2009 seemed to contain endless news from China about preparation for the yearly event that is the Gaokao. One of the most touching stories was that of Dai Panpan who didn't do very well, and decided to go onto the streets of Xi'an with a placard to recruit a university for himself. I suspect we won't hear any more of this story, but I do hope that a university applies to have him as a student!

But of course, what really hit the headlines was the extent to which people will try to cheat on high stakes tests. And then we hear about the lengths that the authorities go to, to stop them cheating. In 2009 the stories of high-tech cheating and countermeasures abounded. While the students were not guilty of any serious offences under Chinese law, the providers of the cheating equipment could be accused of trading in 'state secrets' and face imprisonment. Fascinatingly, this was being reported at the same time as the Telegraph reported the discovery of cheat sheets from the middle of the Qing Dynasty (1644 - 1912); which goes to show that although the technology might change, the news doesn't!

A cheating device cleverly concealed
in a wallet

Holy Cash Cows

One of the longest running news stories of the year came from Australia. It dominated the news items during July, and the story was copied across the Indian press as well. To quote from the Australian Broadcasting Corporation's Four Corners web site: "If a student wants to apply for permanent residency they must pass an English language test. Four Corners has found clear evidence that unscrupulous immigration and education agents are offering English language tests for a price. In some cases the exam paper is worth up to $5,000." This programme raises issues surrounding the unintended consequences of using language tests for high-stakes immigration control, especially when it gets mixed up with testing for admission to educational institutions. The last story (below) tells of a related industry that has grown up as a result.

Watch this short 'taster' from Holy Cash Cows broadcast by ABC on 27th July. The full 40 minute programme and what happened next is available on the Four Corners web site.

More Litigation

In the US testing is almost never out of the news, where we have continued to see frequent reports of litigation over whether tests should be offered in languages other than English. This ranges from job tests for firemen, through aircraft mechanics, to educational achievement in schools. When it comes to the latter the stakes are especially high, as the NCLB legislation ties funding to test scores. As always, the lawyers are the major beneficiaries.
However, in 2009 we started to see cases of litigation outside the United States. In this report from the International News of Pakistan, a man seeking an Australian immigration visa sat the IELTS and received a lower grade than expected. He challenged the score, and it was adjusted upwards; because of delays in getting his visa, additional financial costs, and 'mental torture', he sued the British Council for 500 million Pakistani Rupees (approximately US$ 600,000). This just goes to show that having a wide confidence interval around scores can be a pretty expensive luxury!

Whatever is happening to standards?

In the UK the late summer and autumn sees the usual flurry of stories about 'maintaining standards', as more school pupils than ever pass their tests, with ever higher grades. This leads to the inevitable claim that the tests are getting easier. But this year was different in two respects. Firstly, SAT results actually got slightly worse. As we all know, there is bound to be random fluctuations up and down, and a fair amount of regression to the mean; but for politicians every little change is either a major success for the government, or a sign to the opposition that educational policy is a disaster.

Secondly, the opposition Conservative Party decided that the way to solve the standards issue once and for all is the creation of an online database of past test papers dating back to Victorian times. This way, the argument goes, we can all see clearly whether the tests are actually getting easier. While this may show that politicians still don't understand the issues, we can forgive them, because it is just so funny. And it provides wonderful material for the satirists, as the following extract from the Now Show illustrates.

Singing in the Test

It's not often that we see funny testing stories getting into the news, but the Jakarta Post for 8th August reported:

"Exam time: Four young people recently took a super-tough spoken language test. Their assignment: Pretend you are a group of hikers who have trekked out of range of cell phone reception. Your friend Kelvin has fallen off the edge of a cliff and hurt his leg. Discuss your response. The examiners started the clock, expecting the teenagers to talk about who should help Kelvin and who should get emergency help. But that's not what happened. Nobody moved. After a moment, the following conversation took place. Candidate A: "What should we do to help Kelvin?" Candidate B: "I think we should sing to him because he will be bored." Candidate C: "I think that's a great idea." Candidate B: "What shall we sing?" Candidate C: "I think we should sing Happy Birthday." (It was not Kelvin's birthday.)"

The journalist gives an hilarious account of how singing can help you overcome stress - including test stress. Somehow, I think it may have been a pre-prepared ploy to use language of their choice, rather than to overcome stress! But it is a wonderful story nevertheless.

A Test Prep Practice Too Far?

During 2009 there were so many stories about test preparation that I ran this feature. The longest running story was about training teachers to script speaking tests. Once the practice was in the public eye a number of teachers were subsequently suspended for 'malpractice'. This is a link to the BBC news item, in which one of the interviewees links poor test preparation practice to the fact that test scores are not only used to assess students, but also to hold teachers and schools accountable through league tables.

Upsetting the Balance of Power?

The two largest providers of tests of English for academic purposes, ETS and Cambridge, have long been able to share much of the world's language testing market between themselves, with IELTS expanding quite rapidly because of its use as a screening device for immigration. But 2009 has seen Pearson Assessment enter the market with the Pearson Academic English Test. In an article in the Guardian, Max de Lotbiniere asks whether this is going to change the balance of power. We won't know the answer to this question until 2010 or beyond. But Pearson's size and history of buying up companies that give it the technology and infrastructure for rapid expansion means that the challenge is going to be strong. In an article in the New York Times, the Head of Language Testing at Pearson is quoted as saying "It's a fairly commercial, competitive market already. We're going to make it more so." It is significant that this appears in the Global Business section.

In 2009 Pearson has increased investment in educational institutions in India particularly, giving it a basis for challenging IELTS on the sub-continent. India is going through a period of rapid expansion, during which the growth in educational participation and the need for language tests is likely to increase just as quickly as in China.

The pre-launch hype for the Pearson Test of Academic English had been extensive, and reports that Pearson is deliberately setting out to break the monopoly of IELTS for both education and immigration in countries like Australia appeared more frequently in the press as the year went on. In other stories in this review we see some of the problems that arise when educational language tests are used for immigration. It is perhaps rather worrying therefore that Pearson started to angle for this market as well, even before the test was launched.

Automated Scoring

However, as the launch date of 26th October drew closer, the real concern of the media was who needs teachers when you've got robots? Automated scoring is widely used, but only as a second 'quality control' marker in tests for language learners. If you haven't come across this before, you can download an excellent introduction to this topic by Monaghan and Bridgeman from ETS. Unlike ETS, Pearson is using the computer as the sole marker, and this caused concern in papers as different as The Guardian and The Telegraph. Both of these stories were generally negative about the Pearson initiative, and offered quotations from academics and the National Union of Teachers to question the ability of machines to mark essays. Pearson spokespeople maintain that machines are preferable to humans because they are 'more reliable'. I recommend that you listen to and evaluate the argument for yourself, which is presented in this YouTube video by John de Jong.

What is my take on this? Here is the quotation I gave to the Guardian

    "A machine is only 'more reliable' in the trivial sense that it produces exactly the same score for the same writing sample because it is a machine," said Fulcher. "The real question is whether criteria used by the machine to award the score are similar to those used by humans, who are sensitive to the richness and nuances of academic writing in a range of genres, or are they surface-feature predictors that are easily coachable."

(you can read the entire article here)

In short, I think humans are used to support a validation claim, and then conveniently jettisoned for marketing purposes. So the Pearson literature sometimes claims that the reliability of human scoring is very high in order to make a validity claim, using correlations and visuals like this scatterplot from Pearson Technology. Then we are told that humans simply can't agree, which is why we need automated scoring. There is a fundamental contradiction in the argument here.

Churchill got a poor score!
So we need to know whether automated scoring systems are sensitive to the same qualities of academic writing as humans. If they aren't, the question is whether students can 'trick' the computer into awarding high marks. In the US a particular style of writing has emerged for automated tests known as 'schmoozing the computer'. Basically, teachers and learners have to work out what is it that the computer can recognize - things like sentence length, vocabulary, use of linking devices, paragraphing, and so on. When the reasearch into 'predcitive features' is published, this isn't too difficult. The claim is that learners then practice getting higher grades by manipulating these features, rather than learning how to write well for a variety of 'real' audiences, in a range of genres.

The issue is much wider than the Pearson tests. In the UK there is a plan to use automated scoring for A-level English tests, essentially the 'university entrance' examinations. And the media had a field day testing out the software.

The approach adopted has been to put great literature, including Churchill's speeches, into automated scoring systems. This is a link to a Channel 4 news item that reports what happened. As you will hear, the computer did not do too well at all. But this may not deter examination boards. Automated scoring is seen as the most effective way to speed up the time to issue scores, and reduce the costs associated with human marking. But given the current state of the technology, it may reduce the validity of score meaning and have a negative washback effect on the writing classroom.

And Channel 4 is not the only news programme to have been obsessed with automated scoring. Here is an extract from the radio programme Broadcasting House, first aired on 15th November, 2009.

As a final thought from a sceptic, I am reminded of Latham's discussion of the purpose of 'essays' in his 1877 book On the Action of Examinations. "If examiners wish to see the free play of thought in the candidates, they must set questions which admit of being treated discursively" he wrote (p. 190). We not only wish to see how particular features are realised, but how they are combined, and what their effects are. The point is that we are making inferences about ability from instances of performance. Latham again: "We must not always infer the absence of the powers wanted for good composition because they do not appear; and this mode of examining should only be used where great latitude can be left to the examiners, where, in fact, the whole examination is viewed simply as a means of arriving at an opinion on the merits of the candidate...." (p. 198). Note Latham's use of 'infer' here. Computer algorithms do not make inferences. Nor do they arrive at a balanced opinion regarding the overall quality of a piece of writing, or a verbal utterance, in relation to its communicative intent. That's what human's do.

An Untested Innovation?

Returning to the Pearson test, it looks as if they will also give the score users - normally university admissions tutors - a 30 second audio clip of the test taker speaking. At least this is what is being reported by the Independent. Why? Presumably so the tutors can make their own judgment about the speaking ability of the student? And even if this is not the intention, that is what is likely to happen. And these are not trained raters. So after the hype about 'reliable machines' vs. 'variable human judgment' the system may encourage untrained individuals to make assessments that will vary from university to university, department to department, and fluctuate over time.

The Growth of Language Testing

What is certain is that as we approach the end of the first decade of the 21st century, the use of language testing is expanding. Whether this is for tour guides in Korea, or taxi drivers in Australia, being able to communicate in another language (usually, we have to admit, English) is seen as an economic imperative. And so it is that language testing continues to be big business. Pearson entering the market is a clear sign of this, just as test preparation continues to generate huge revenues: just look at the profits of New Oriental for a quarter of the year. It is no wonder that sometimes, when the outcome of language testing is linked to economic migration, the testing industry generates dependent industries.

Like the arrangement of marriages so that spouses can enter countries with their student husbands or wives. This is perhaps the saddest story of the year, illustrating as it does the worst of the unintended consequences of using language testing as a surrogate for a more honest immigration policy.

Let us hope that there are no more stories like this in 2010.

Glenn Fulcher
December 2009