Rating Scales and the Halo Effect

This site designed and maintained by
Prof. Glenn Fulcher



Feature for May 2009
Rating Scales and the Halo Effect

When we test speaking and writing, it is common practice to ask raters (expert, trained judges) to make a decision about the quality of performance or product. In order to help raters make a judgment they are provided with a rating scale, which usually consists of a number of levels or bands, perhaps from 1 to 6 or 1 to 9. Each band is described using a prose descriptor (UK) or rubric (US). The rating scales can be classified into a number of broad types, as follows:

  • Holistic Scoring
    A single score is awarded, which is designed to reflect the overall quality of the performance. The descriptors or rubrics are general in nature, drawing on theories of communicative language ability that affect all language use.
  • Primary Trait Scoring
    As with holistic scoring, a single score is awarded. However, the descriptors or rubrics are developed for each individual prompt (or question) that is used in the test. Each prompt is developed to elicit a certain type of language, perhaps an argumentative essay in an academic context, for example. The rating scale would then reflect the specific qualities of such a writing sample at a number of levels, with samples of writing that exemplify each level.
  • Multiple Trait Scoring
    Unlike the two scale types already mentioned, mutliple trait scoring requires raters to award two or more scores for different features or traits of the speech or writing sample. The traits are normally prompt or prompt-type specific. The argument in favour of this type of scoring is that richer information is provided about each performance. In the case of an essay this may include traits like organization, coherence, cohesion, content, and so on. Such detailed scoring should also provide diagnostic information that is useful for both learner and teacher.

But there's a problem with multiple trait scoring. This is referred to as the Halo Effect. This was defined by Thorndike in 1920 as "a problem that arises in data collection when there is carry-over from one judgment to another." In other words, when raters are asked to make multiple judgments they really make one, and this affects all other judgments. If raters are given 5 scales each with 9 points, and they award a score of 5 on the first scale for a piece of writing, it is highly likely that they will score 5 on the second and subsequent scale, and be extremely reluctant to move too far away from this generally. As a result what we find is that profiles tend to be 'flat', defeating the aim of providing informative, rich information, on learner performance.

Listen to an Interview

The speaker is from the field of management. He is discussing a context in which business analysts are asked to rate the performance of companies on 9 different rating scales (he calls them "categories") to decide which is the most successful company for a given year.

As you listen, ask yourself the following questions:

  1. What does the speaker mean when he claims that one perception is recorded in nine ways?
  2. In language testing, how many alternative explanations can you think of for highly correlated scores on different rating scales?
  3. What does it mean to 'allow an overall impression to shape a particular judgment'?
  4. If the halo effect is at work in language testing, what impact might this have on the quality of the scores we report?

Research in the Language Testing Literature

There has been very little empirical investigation of the relationship between scales where multiple-trait rating is employed in language testing, but where it has, there is evidence of very high correlation between traits (Sawaki, 2007). The issue at stake is our current failure to establish divergent validity, that is, to show that the traits are being assessed independently of each other. Of course, it could be that the traits are really very highly correlated in our sample of learners. But it could just be that the halo effect is a much more serious problem for multiple-trait scoring than we actually realize. If the halo effect is at work, it makes it difficult to claim that our multiple-trait ratings reflect the rich, diverse constructs that they are designed to capture.

A Final Question

If you managed to list a number of alternative hypotheses to the second question above, can you devise a study that might help you to decide which hypothesis might be the most plausible in a given context?

Web Resources

The Virtual Assessment Center at the University of Minnesota has descriptions of different kinds of 'rubrics' or 'rating scales', with links to a variety of scales that are available on the internet.

Additional Reading

For a good introduction to the different types of rating scales used in performance testing, see:

Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In Hamp-Lyons, L. (Ed.) Assessing Second Language Writing in Academic Contexts. Norwood, NJ: Ablex, 241 - 276.

This text is reproduced with annotations and suggested activities in:

Fulcher, G. and Davidson, F. (2007). Testing and Assessment: An advanced resource book. London and New York: Routledge on pages 259 - 257.

The halo effect was first identified by Thorndike in the early 20th Century. See:

Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 4, 469-477.

This article is difficult to find, but you can download an orignal summary of the paper from my web site: Thorndike 1920. I created this pdf from a hard copy given to me some years ago. It is not clear where this was published as the original did not indicate a source.

A classic treatment of the topic in language testing is:

Yorozuya. R and J.W Oller (1980). Oral proficiency scales: construct validity and the halo effect. Language Learning, 30(1): 135-153. In this study the researchers found that "The contrast between agreement and indexes for scales rated on the same occasion versus scales rated on different occasions revealed a halo effect which tends to reduce the reliable variance in scales rated at a single hearing by about 3%."

For a recent treatment of this topic in language testing, see:

Sawaki, Y. (2007). Construct validation of analytic rating scales in a speaking assessment: reporting a score profile and a composite. Language Testing 24(3) 335 - 290.

For a general discussion of how to rate the quality of language performances, see:

Fulcher, G. (2008). Criteria for Evaluating Language Quality. In Shohamy, E. (Ed.) Encyclopedia of Language and Education: Volume 7, Language Testing and Assessment. New York: Springer, 157 - 176.

Glenn Fulcher
May 2009