Read the Introduction, then listen to the Podcast below
Pick up an examination paper from any country in the world, and you will find lots of multiple-choice questions (henceforth m/c). They are endemic to educational assessment. Many sources claim that the m/c question made its debut in Frederick J. Kelly's 1914 doctoral dissertation Teachers' Marks, Their Variability and Standardization. However, this is not correct. Kelly's dissertation does not mention the m/c item. The entire dissertation is an analysis of examination data from schools and colleges to demonstrate "...a very wide difference of rating upon the same paper among supposedly competent judges" (p. 51). On the same page he references the studies of F. Y. Edgeworth, from 1889 and 1890, in which the same phenomenon had been investigated in the Cambridge Tripos and the Indian Civil Service Examinations. This prepares the ground for a positive analysis of standardized scales, with particular emphasis on Thorndike's drawing and handwriting scales, and the Hillegas Composition scale. However, Kelly must have been working on the m/c item in 1914 or shortly afterwards, because it makes its first appearance on his Kansas Silent Reading Tests. His analysis of the tests appeared in the Journal of Educational Psychology in 1916. On p. 65 of this article, Kelly sets out the three m/c principles that have been repeated in every item writing guide since:
"First, the exercises must be subject to but one interpretation."
"Second, they must call for but one thing so that the answer given to them would be wholly right or wholly wrong, and not partly right and partly wrong."
Third, they must test the ability to get meaning from the printed page and must not depend for their difficulty upon obscure words nor upon any particular fund of information."
Before the end of the First World War it had become the item of choice in all standardized psychological tests, and by the 1920s it was used in most educational tests.
First of all, we will take a look at the various forms a m/c item can take. These examples are drawn from the Army Beta test developed during World War I, published by Yoakum and Yerkes in 1920.
The item is constructed of a stem that provides all the necessary information for a test taker to select the correct response, but no superfluous information. Three or four options are normally provided. The correct option is called the "key", and the incorrect options are called "distractors". After all, they are meant to "distract" from the correct response. Selecting the key scores 1, while selecting a distractor scores 0. A multiple choice item is therefore "dichotomous" in that the response is either right or wrong.
Kelly sought to address two problems with the m/c item. The first was the use of the teacher's subjective judgment when assessing learners. While it is true that scoring a key as "correct" is purely objective, the belief that the m/c item is objective is an illusion. As example 2 shows, it is possible to embed cultural and social assumptions into items that can result in responses that do not reflect the "true ability" of learners on the construct of interest (score contamination). It is also possible to construct m/c items where it is possible to imagine a context in which more than one response is correct. Writing m/c items is therefore extremely difficult.
Example 3 is a variant on the true/false theme, which is also an m/c dichotomous item. While this - along with the other forms - have frequently been criticized on the grounds that there is a high chance of getting items correct just by guessing, it is in fact fairly rare for learners to guess unless they start to run out of time. But all variants address Kelly's second problem. He was writing at a time when the education system was expanding rapidly and teachers simply did not have enough time to mark examinations. The m/c item was intended to be quick to score for teachers, and cheap to score for education authorities. The early 20th Century saw the development of the first modern accountability policies, and test scores were the means of implementation. So when Wood published his evaluation of what he called the "new type tests" in 1928, he produced a table of costs between traditional written examinations and multiple choice examinations, and simply concluded "these differences are too large to need comment." (You can download Wood's Table 45, originally on page 312, by clicking here). These were the heady days of efficiency drives. Industrial economies had realized that the Great War would be won or lost on well-organized munitions production as much as military strategy. Taylor's time studies were flavor of the month. What had been learned about efficiency and testing during the First World War impacted on testing and assessment throughout the inter-war years, and it was the m/c item that maximized industrial testing productivity in the army, and in the schools.
There is another reason for the longevity of the m/c item. The facility index of a dichotomous test item is simply the proportion of test takers who get the item correct. So if 50% score 1, p = .5. Thus, 50% get the item incorrect, and so q = .5 also (the proportion who get the item wrong). The item variance is therefore .5 x .5 = .25, which is the maximum variance any dichotomous item may have. This means that we maximize test variance by targeting most items around the mean of the score distribution on a normal curve (-1 to +1 standard deviations), and have fewer items that discriminate further away from the mean. By maximizing variance and discrimination - which is easier with m/c items than any other item type - it is possible to achieve high levels of reliability with moderate numbers of items in test forms that do not take a long time to administer.
The astute reader of the last paragraph will realise that there is a downside to this. The assumption is that 50% of test takers (or as near as we can get) do not answer the item correctly. But this depends entirely on the purpose of the test. If you have just finished teaching a language course and give an achievement test, you're probably going to be fairly upset if a significant number of students can't answer the items correctly! In other words, you want a high facility index for each item. If this happens, discrimination and reliability decreases. Think carefully about your purposes and the paradigms within which you work! Now to the podcast for a few more insights.
Listen to my Podcast on Multiple-Choice Items and How to Write them Well
Whatever the criticisms of the m/c item type, it is a cost-effective technology that has been tried and tested for over a hundred years. We know how to build high-quality standardized tests with the m/c item type. Even modern "communicative" tests still contain m/c items to boost test reliability! So this item type is highly likely to be a common feature of our tests in another hundred years.
Other Useful Reseources
Why not visit the Statistics Page where you can download Excel spreadsheets for the analysis of multiple choice items using Classical Test Theory?
References & Further Reading
Brown, J. D. (2012). Classical Test Theory. In Fulcher, G. and Davidson, F. (Eds.) The Routledge Handbook of Language Testing (pp. 323 - 335). London and New York: Routledge.
Lee, H. and Winkde, P. (2013). The differences among three- four-, and five-option-item formats in the context of a high-stakes English language listening test. Language Testing 30(1), 99 - 123.