Re-examining Language Testing

Home

About

Contact

Search

Citation

Re-examining Language Testing

Philosophical Trails, Reading and Activities

This site designed and maintained by
Prof. Glenn Fulcher

@languagetesting.info

Navigation

Winner of the 2016 SAGE/ILTA Book Award

"The epithets sociable, good-natured, humane, merciful, grateful, friendly, generous, beneficent, or their equivalents, are known in all languages, and universally express the highest merit, which human nature is capable of attaining."
(Hume, 1777, pp. 16 - 17)

This is a book about ideas. It argues for a Pragmatic view of language testing and assessment that draws heavily on an Enlightenment view of humankind and scientific endeavour. One of the principles of this Pragmatism is expressed by the epigram from David Hume. The text of the book reflects to a large extent my own struggle with ideas that lie beneath the practice of language testing and educational assessment more generally. Ideas are most often dormant. They sleep while we get on with the practical tasks of designing tests that help someone make decisions. But while ideas rest, so do our critical faculties. Ideas become inexplicit assumptions that cannot be interrogated. Testing may become an end in itself. We may even forget to recognize our common human nature, or overlook the fact that with access to educational opportunities anyone may succeed.

The Pragmatic worldview is fundamentally optimistic. The arguments in this book offer a vision of language testing that can contribute to progress for individuals and society. Communication is essential for advancement. Language knowledge and skills endow individuals with the freedom to participate in personal and collective growth.

"The creation of the field of psychometrics set out to reduce the uncertainty of measuring human abilities but not to deal with the underlying philosophical problem. Now, in this pioneering rethinking of the fundamental questions involved, Fulcher has finally tackled the basic issues. There can be no question as to the importance of this book."
Bernard Spolsky, Bar-Ilan University, Israel

"Fulcher's philosophical approach to language testing as a profession is both enlightening and thought provoking. It will present a major challenge for language testers in the years to come."
Yan Jin, Shanghai Jiao Tong University, China

Purchasing Details from Routledge

Share with Friends: Download a Flyer with Barcodes

On this page you'll find trails to additional web pages and external sites that further illustrate my reflections in the book. Each chapter and theme is represented by one of the figures from the cover of the book. You may wish to work out the link between my allocation of a character to a theme. And...yes! I did design the book cover. This had to be a cover-to-cover message!

Inference

The practice of language testing is concerned with making and justifying inferences. In this respect it is like diagnosis in medicine, where inferences are drawn about conditions from symptoms. The process of testing is one of collecting evidence, summarizing the evidence in a number or letter (a score), and then making an inference about the meaning of the score in terms of what a test taker is able to do in situations beyond the test. All non-deductive inferences are risky, and there are many factors that might lead us to make false inferences. Inductive reasoning, as Hume called it, is essentially probabilistic. A large part of language testing is concerned with making probabilistic statements about how test takers are likely to perform in specified communicative domains, given the limited evidence we obtain from their response to a limited selection of tasks. One method that has been widely used in language testing to reflect probabilistic statements of score meaning is Stephen Toulmin's argument model. This is particularly useful because it provides a framework within which the strength of claims and the nature of alternative explanations can be set out to create research agendas.

The Epicurus Trail

Fallacies

In the first chapter on inference we are fundamentally concerned with risky inferences. There are many ways in which we can arrive at an unsound conclusion because of faults in the way we think. Some of these are discussed in the chapter, such as begging the question, and the fallacy of affirming the consequent. These problems have been known since the time of Greek philosophy. Here is one list of fallacies, but you can google many others on the internet.

Logical Fallacies

Look at the list. Which of these are language testers most likely to commit, as well as those discussed in this chapter?

Association and Causation

Fallacies are particularly prevalent when interpreting statistical data. One of these is the fallacy of assuming causation when variables are merely associated. Language testers like all other social scientists have been led to commit this fallacy when using the statistical tool of correlation. Follow this path.

Correlation

Here you will find an explanation of the early use of correlation by Quetelet (who you will read about in chapter 2), before it was called correlation. You will find links to correlation tools, and critical readings in the misuse of correlation to claim causation.

Chance and Control

One of the strange facts about sound inference is that it is improved by introducing more chance into the research design. In chapter one, I discuss a research paper published by Peirce and Jastrow in 1884, and I say this: "The surprising idea underlying our practice is that we gain more control over our research and improve the soundness of our conclusions by introducing more randomness. Peirce (Peirce and Jastrow, 1884) was the first to improve experiments in this way...." Download the paper here.

On small differences of sensation

You can use this paper as the basis for a short seminar with colleagues. Address the questions (a) what is the primary reason for the use of randomization in research? And (b) what are the key functional elements of the paper that make it such a good report of the research described.

C. S. Peirce had a deep understanding of induction and probability. Here is a link to a humorous Radio 4 discussion of probability. Listen, and make notes of the role of probability and chance in inductive inference.

In the text I refer to Peirce frequently, and in the chapter on validity I draw on his interpretation of constructs to develop the notion of pragmatic validity. He is often referred to as the first American homegrown philosopher. If you have not come across Peirce before, you may like to learn something about his life and work.

Charles Sanders Peirce

Abduction

Abduction is described by Peirce (1877) as a process that begins with the "irritation of doubt", through inquiry, to the settlement of opinion. Doubt begins with "...a conjecture that explains a puzzling or interesting phenomenon" (Hacking, 1990, p. 207). The classic example of abduction in literature is Sherlock Holmes' excchange about data in The Adventure of the Copper Beeches.

Listen Here

Fulcher and Davidson (2007) suggest four principles by which language testers might arrive at the most satisfying interpretation of data in the validation process. Is this list exhaustive?

Simplicity: Choose the least complicated explanation (Occam's Razor)
Coherence: Choose an explanation that is in keeping with what we already know
Comprehensiveness: Choose an explanation that leaves as few facts as possible unexplained
Testability: Choose an explanation that makes testable predictions about future actions, behavioiur, or relations between variables

Occam's Razor

Measurement

Strong claims are frequently made for measurement in the social sciences, primarily because the methods have been borrowed from the natural sciences and astronomy in particular. Thus, many researchers in educational measurement claim that without invariant interval level measurement it is impossible for the social sciences to uncover the constants of human psychology and behaviour. In order to explain the reductionist approach that attempts to explain causes, the chapter outlines the way in which measurement entered social science research, and was adopted by language testers. Discussion focuses around figures such as Babbage, Galton, Quetelet, and Cattell. It is argued that the practice of language testing and educational assessment is modelled on experimental practices, such as weighing and measuring. These pay particular attention to controlling potentially confounding factors. Strong claims for absolute measurement today are made by Rasch proponents, among others. The problems we face are those of contingency and interactivity. Firstly, constructs in language testing are dependent upon social processes; secondly, they are inseparable from the humans, and they interact with contexts. The language in language testing therefore requires content and context are not relinquished to psychometric interpretation alone, as that would be a reductive human science.

The Euclid Trail

Experiment

The outcomes of experiments can change if the setup is different whenever it is repeated. This doesn't mean that the world is so complex that we can know nothing about it because of the sheer number of variables that come in to play. Rather, it means that we need to control some variables while others change, so that we might discover something about the phenomenon under investigation. An example of this is the weighing scale of Santorio, which Quetelet uses as an analogy in his A Treatise on Man. Visit the Santorio webpage to look at further images of his famous weighing machine.

Santorio Santorio (1561-1636)

He spent most of his life sitting in the machine and, if he hadn't, we would not have understood perspiration. After reading the chapter and studying the web page, consider each of the elements that is controlled in a language test through the standardization of administrative conditions. Cattell was an early test developer to argue that "The scientific and practical value of such tests would be much increased should a uniform system be adopted, so that determinations made at different times and places could be compared and combined." What benefits do we gain from treating tests like experiments? What would be lost if we did not?

The Curve

It was Sir Francis Galton who wrote in 1869: "I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the 'law of error'. A savage, if he could understand it, would worship it as a god. It reigns with severity in complete self-effacement amidst the wildest confusion. The huger the mob and the greater the anarchy the more perfect is its sway. Let a large sample of chaotic elements be taken and marshalled in order of their magnitudes, and then, however wildly irregular they appeared, an unexpected and most beautiful form of regularity proves to have been present all along." The technologies of language testing, like most quantitative social sciences, assume that data are normally distributed. To prove this, Galton invented an instrument called a Quincunx. How it works is linked to Pascal's triangle, as the following trail explains.

The Quincunx

Open this web page to experiment with the Quincunx for yourself.

Online Quincunx

The fact that most observations follow a normal distribution remains one of the most useful findings in social measurement - although the interpretation of the curve has not always been a happy one, as you will discover in the Aristotle Trail.

The Tale of the Scottish Chests

In this chapter we show how Galton referred to data on Scottish chest sizes to defend his use of the curve in social measurement. The original text of 1817 is only 5 pages long, and yet represents one of the most influential pieces of data collection in early social science measurement. I have tracked down a rare copy of this paper, scanned it, and made it available for download here.

Quetelet 1817

The reason for its importance is that it led Quetelet to move apply astronomical thinking to social science data. Hacking (1990, pp. 108 - 109) puts it like this: "Given a lot of measurement of heights, are these the measurements of the same individual? Or are they the measurements of different individuals? If and only if they are sufficiently like the distribution of figures derived from measurements on a single individual....at this exact point there occurred one of the fundamental transitions in thought, that was to determine the entire future of statistics....Here we pass from a real physical unknown, the height of one person, to a postulated reality, an objective property of a population....This postulated truth unknown value of the mean was thought of not as an arithmetical abstract of real heights, but as itself a number that objectively describes the population." That's a problem in itself - but dangerous when the abstraction is related back to individuals in social policy. This is one of the hidden ideas that can make test-based policy making perilous.

So, consider the following:

Is the mean (average) really a descriptive statistic or a theoretical claim?
Are we being misled by the concept of an "average man"?

Sinking Shafts?

In this chapter we consider the early assumptions of normative testing, which is epitomised by the work of Galton and Cattell. Here is the famous quotation discussed in the book:

One of the most important objects of measurement ... is to obtain a general knowledge of the capacities of a man by sinking shafts, as it were, at a few critical points. In order to ascertain the best points for the purpose, the sets of measures should be compared with an independent estimate of the man's powers. We thus may learn which of the measures are the most instructive.

Can we do this for individuals?
Can we do this for populations?

Cattell's Life and Work

Language

Although reductionism is an acceptable strategy in scientific investigation, content and context are important to understanding complex human communication. It is not accidental that the artificial intelligence debate has centred on language, from the Turing test to Searle's Chinese room. One of the battlegrounds is the resurgent interest in the construct of fluency. There are currently two broad approaches - one based upon cognitive science, the other education. In the former, observable variables such as pauses, hesitation phenomena, speed of delivery, and so on, are treated as indicators of the presence or absence of a specific L2 cognitive fluency. The focus is upon the individual's cognitive capacity to process language. Processing models are presented in flow charts, using a computing metaphor. Educational approaches, on the other hand, see the observable phenomena as indications of communicative intent and management within an interactive process between individuals. Automaticity of processing within an individual is clearly a factor - particularly for beginners - but the phenomena are in need of contextual interpretation. The approach selected not only affects how fluency and speech are assessed, but also betrays our understanding of human nature, and the role that language plays in defining what it is to be human.

The Socrates Trail

Machine Communication

This chapter begins with a consideration of what language contributes to our humanity. The debate is often framed in terms of artificial intelligence. Could an algorithm ever understand or produce language in such a way that a human would not know that they are speaking to a machine? In this audio clip from the programme Analysis you will hear two very different views.

Outline the key arguments on each side of the debate. Which side do you support, and why?

Evaluate Searle's Chinese Room thought experiment. Do you think that Searle's argument is sound?

Searle's Chinese Room

Fluency

Applied Linguistics is a young discipline, and we know very little about how language is processed. However, we do know that humans are capable of understanding very complex linguistic utterances, the meaning of which resides not only in linguistic form, but also in context and interaction. A pause in speech has many potential causes and meanings, but we are able to interpret these accurately in almost every case. This raises the question of whether counting the number and length of pauses as a predictor variable makes any sense at all. Counting is a low-inference activity, whereas understanding meaning is high-inference. Download this article, which is referred to in the chapter. Does the description of the construct of fluency match your own?

Fluency

The Turing Test

In the 1950 paper Turing sets out his test, and considers a number of arguements that may be made against the possibility of a computer passing it. Evaluate these arguments, and Turing's view that his test may one day be passed by a machine. Which arguments are more, and less, convincing?

Computing Machinery and Intelligence

Learn More about Turing

Doing Things with Words

It is not surprising that it was John Searle who constructed the Chinese Room thought experiment. Speech Act Theory demonstrates that meaning is not simply resident in the semantics of words and grammar of sentences. J. L. Austin's How to do things with words remains the classic text, which you can download here.

How to do things with words

Is it possible for a computer to understand illocutionary acts?

Numbers

A measurement model defines how performance on a language test is turned into a number. Presenting testing outcomes as numbers appears to be "scientific". However, these processes have been questioned on the grounds that psychological abstractions are not comparable to those of the natural world. Nevertheless, numbers are powerful, and cut scores can become iconic symbols of success. The most important question is how useful numbers are, rather than whether they reflect the "accurate" measurement of some property. They must of course be indexical of something construct that is relevant to the inferences a score user wishes to make, but the index need not be like a ruler. One such example is that of Fisher's scale book, which attached ascending numbers to benchmark essays that were thought to be similar with respect to what expert judges valued. New essay samples could then be marked in comparison with the "standard" example. This term was originally associated with criterion-referenced assessment, which provides a number with an external meaning. These are valuable when job decisions have to be made, and there is a tendency to reify the meaning of numbers, rather than treating them as indicative of likely performance. In language testing, numbers are imperfect.

The Pythagoras Trail

Cut Scores

Glass claims that "To my knowledge, every attempt to derive a criterion score is either blatantly arbitrary or derives from a set of arbitrary premises." Yet, establishing cut scores on tests for decision making purposes has grown into a major industry. It is variously known as "standard setting" or "setting performance standards". Download the Cut Score Primer and consider the approaches listed. Do you think that these methods are arbitrary, or not?

Cut Score Primer

Not Science as We Know It

While many psychometricians believe that cut scores arrived at by an agreed processes are inherently meaningful, educationalists are not easily convinced. Indeed, the link between a score that becomes iconic (like an IELTS 6.5 for University entrance) and what a learner can actually do is very difficult to establish. The Huffington Post has developed a reputation for being particularly antagonistic towards testers. Look at the news item below. Do you agree with the author?

Huffington Post

Criterion Referencing

Standard setting is not relating a score to a criterion, but to some arbitrary standard that is articulated in a standards document. These are not usually based upon any sound theoretical model of knowledge or performance in a domain, or any empirical evidence. Yet, many psychometricians and language testers refer to standard setting as relating scores to a criterion. This misuse of language dates back to the very early days of criterion-referenced testing theory. Download my paper on criterion-referenced testing to study its history and how the terms "criterion" and "standard" have changed over time.

Criterion-Referencing

Context

I argue that context is important for the production and understanding of meaning. Implication can only be understood if both speaker and hearer can process linguistic data, context and reference. In traditional language tests the focus is purely on language, but when communication is important, we may have to broaden our assessment strategy. I argue that this requires the use of "high inference categories", which require human judgment rather than merely counting observable surface features of performance. Not that the latter is of no importance; it is just that they must be interpreted by humans. The number that we arrive at in performance tests is therefore a guide to the quality of performance, rather than a result of some fundamental measurement. Consider the approach to scoring known as Performance Decision Trees (PDT). How does a PDT guide human decision making?

Performance Decision Trees

Multiple Choice

It is difficult to find a large-scale test that does not use multiple-choice items. There is a myth that these are "objective". This is a mistake that fails to distinguish between objectivity and scorability. The item type is no better than the theory upon which it is based, the test specifications, and the skill of the item writer. Follow the path below to see why multiple choice items do not produce "objective" numbers.

Multiple-Choice Items

Item Analysis

The technology underpinning multiple-choice items is 100 years old. It depends on curves of normal distribution, and correlational techniques (see the Epicurus trail). Item discrimination, for example, relies on the principle of an individual item correlating with test-total score. Many assumptions underly both classical and modern test theory. You can download spreadsheets that show you how this works by following the path below.

Item Statistics

Validity

The consensus on validity that existed following Messick's influential work of the 1980s and early 1990s is beginning to evaporate. In this analysis, five positions on validity are identified and critically analysed. The first is instrumentalism, which treats validation as establishing the usefulness of an assessment process for decision making. The second is constructionism, which posits that all meaning is locally created and transient. Score meaning is treated as essentially political in nature. Technicalism is a checklist approach to validation. A list of required qualities is produced, and through a programme of research, each item is ticked off. Realism is a return to a definition of validity that focuses attention on whether the test genuinely measures what it is claimed to measure. The construct must really exist, and variation in the construct must cause variation in scores. Finally, I argue for a pragmatic realist stance, which accepts contingency and interactivity, while maintaining that the reality of constructs lies in what Peirce calls more "primary substances", which in language testing is the linguistic realization of communicative intent.

The Heraclitus Trail

Messick and Constructs

In recent years Messick's work on validity has been critiqued for being "impractical" - not providing processes for going about validation. I think that this is unfair, but understandable, as we have moved away from a concern with meaning, and towards a reliance on legalistic process, as a justification for score use. But construct validity as understood by Messick can be difficult to grasp. Luckily, we have J. D. Brown on hand to demystify it.

Construct Validity Explained

Also read this short explanation of the evolution of validity from the 1950s to the work of Messick

Traditional and Modern Concepts

Kane and Utility

I characterize Kane's approach to validation as instrumentalist. Thus, I state in the book "Kane (2013b, p. 121) is clear that "We have no direct access to Truth." An interest in Truth with a capital T is not of great interest to instrumentalist. The concern is rather with what works. Does a theory, or a particular interpretation, lead to successful decision making?" This is why there is a movement in language from "validity" to "validation". In 2010 I was lucky enough to interview Mike Kane for the podcast Language Testing Bytes. Download the podcast. Is the lack of interest in constructs problematic? Or simply a recognition that Galton and Cattell were wrong about "sinking shafts"? (See the Euclid Trail).

Kane Interview

Realism and Constructionism

If there is a cline of belief about the existence of traits/constructs, at the one end is the realist claim that validity is about showing that a test measures what it is claimed to measure, and at the other, the belief that the construct is "constructed" by social or political forces, or even generated during the act of testing itself. The strong realist claim continues to hark back to Spearman's "g" - or a general intelligence factor that explains the correlation between scores on all tests. Borsboom puts it like this:

The construct g has been proposed as an explanation for the empirical phenomenon that interindividual differences on distinct intelligence tests are positively correlated: On average, people who score highly on verbal tests also score highly on spatial and numerical tests. This empirical phenomenon is called the positive manifold. The idea of g is that the positive manifold exists because individual differences on verbal, spatial, and numerical tests all originate from individual differences on one single latent dimension, and this latent dimension is called g. (Borsboom & Dolan (2006). Why g Is Not an Adaptation. Psychological Review 113(2), 433 - 437).

Download examples of Borsboom's realist work.

Test Validity in Cognitive Assessment

Why g Is Not an Adaptation

Compare this with Spearman's original work on g from the early 20th Century.

General Intelligence

Are you convinced by the "real realist" arguments? Or do you think that this is the reification of concepts through the use of labels and numbers?

Army Tests

One of the architects of the first large scale tests was Robert Mearns Yerkes. In the book I discuss his work in two places. In this chapter I consider his contribution to the practice of test development, which was considerable. However, in the final chapter I consider the values that drove the work, which are much more questionable. However, not to recognize his contribution to test design because of his beliefs and motivations would be to commit the intentional fallacy.

Intentional Fallacy

In 1920 Yerkes produced an edited book, in which he wrote a chapter entitled What psychology contributed to the war (pp. 364 - 389). This describes the process of test design in six steps that seem very modern even today. The claim to a valid interpretation of scores rests in accurate domain analysis, and the translation of that analysis into tasks. It can be claimed that this is an early form of criterion-referenced validation, even if its success can be questioned.

Read Yerkes' Chapter

This very short video clip of Yerkes is from a silent cine film taken at the 24th meeting of experimental psychologists, Yale University, April 5 - 7, 1928.

Wikipedia entry for Yerkes

Meritocracy

Language testing, like all educational assessment, is a social tool. Since the time of Plato, its primary purpose has been to select individuals for particular roles or positions on the basis of merit (although Plato would not have been particularly keen on social mobility). There are many criteria that societies can use for this purpose, but the test has become the most important because of its perceived fairness, as opposed to methods such as nepotism. The history of testing is one of the expansion of egalitarian principles and social mobility. Testing also makes possible the division of labour in modern economies, and the allocation of resources. This makes test scores a valuable commodity, and it is not surprising that there has always been a black market in the trade of scores through a "cheating industry". Additionally, once scores have market value, they will be used in accountability practices and market position, so that stakeholders can see whether they are getting value for money from schools, colleges and universities. The practices associated with the use of test scores are fraught with dangers, many of which are outlined in this chapter; but their use to organize and structure society is not going to go away.

The Plato Trail

High Stakes Testing

While classroom quizzes may not be particularly high stakes, most testing is. The primary purpose of testing is to discriminate between individuals under circumstances where only some may benefit from what society has to offer. The outcome of testing therefore has significant impact on the individuals tested, their families, institutions, and society at large. First read about distributive justice, and then follow the trail to my page on high stakes testing in China.

Distributive Justice

Testing in China

Can you conceive of other ways to make decisions about who enters University, other than through the use of tests?

Social Selection

The history of testing and assessment is closely linked to the desire to select in ways that do not perpetuate the interests of particular groups in society. Sometimes this is very deliberate, as was the case with the introduction of the Civil Service examinations, and school and University entrance examinations, in Victorian England. This was the topic of a lecture that gave recently. Follow the trail below to read more about social selection, and watch the lecture.

Assessment as Social Science

The principle being espoused was that "Promotion would be on the basis of merit not on the grounds of 'preferment, patronage or purchase'" - see the link below.

The Northcote-Trevelyan Report

Social Mobility

As time passed, the notion of social mobility became very important, particularly after World War II when a socialist government came to power in Britain for the first time. Social mobility is now a political ideal for all parties, and not to have policies to promote social mobility is not acceptable. Consider one of these - in which a politician advocated lowering the examination grades required to enter University for children from lower socio-economic backgrounds. The result was outrage, and shouts of "communist". Why do you think this happened?

Class War?

But if there is social mobility, it implies that it is possible to move down, as well as up. Indeed, the Victorian reformers wished to remove patronage from the aristocracy and make upward mobility possible for the new middle classes. Below is a link to an Analysis podcast on social mobility. Do you think that social mobility is (a) desirable, and (b) achievable?

Market Value

It is nothing new to observe that test scores have market value. Here is what Latham wrote in 1887.

Parents want something to shew for education; a place in an examination list seems to gauge the advantage which they have paid for, and besides it frequently has a positive market value as opening the door to some emolument or profession.

Often it is not the learning that is desired or valued, but the certificate that verifies that learning has taken place. When the test score and the certificate provide access to education and/or economic advantage, it can become more desirable than the effort required to obtain it. This is why cheating has been endemic for as long as there have been high stakes tests. Follow the path to my cheating feature. Do you believe that there is any solution to the phenomenon of cheating?

The Cheating Phenomenon

Values

As a social activity, testing practice is affected by the values of all those involved. Messick argued that "values and validity go hand in hand", and yet there is still a "consequential controversy". Many would like to exclude values from the debate all together. But the history of testing shows that score interpretation is dependent upon the contingent values of researchers working and living at particular periods of time. Many of these seem repugnant with hindsight, but they were part of the accepted intellectual climate of the day. Indeed, individuals who held these alien values explicitly argued that their purpose was to improve the human condition and the state of society. Even the commonly accepted values of the meritocracy that we uncritically accept today may be questioned - and frequently are. Yet, I argue that there are tell-tale signs of values that lead to unethical practices across generations. One of these is the unnecessary reification of constructs, and another a view that these constructs are hereditary rather than absent because of poverty or lack of opportunity. It is our understanding of humanity that must explicitly drive values and practices in any progressive testing theory.

The Aristotle Trail

Ethics

The International Language Testing Association has articulated its values in a Code of Ethics, available from its website. The code provides principles with explanatory annotations. These are designed to guide practitioners to execute their craft in ways that do not harm, and actively promote the wellbeing, of test takers. With reference to the historical examples provided in this chapter of the book, do you think that the principles are relevant to the past, as well as the present?

ILTA Code of Ethics

Politics

Tests also fulfill political as well as social functions. They may be used for control and maintenance of the status quo in society, or to manipulate institutions or countries. In some cases, tests force compliance with political agendas, which may include the creation of super-states. This happened in the time of Charlemagne, and some argue that it is taking place again in modern Europe. Do you think that tests play such a role, or is this just one big conspiracy theory?

Test Use and Political Theory

Tests for Harmonization

Common European Framework

Kantsaywhere

Sir Francis Galton certainly had a vision of society, and he believed that tests could help create it. That vision was expressed in a (very poorly written) dystopian novel that his estate tried to destroy after his death. Kantsaywhere is the story of a eugenics college that uses tests to select the fittest for education and leadership, and sends the weak to special work camps. Sound familiar? The novel is now available in pdf from the Galton archive at University College London.

Kantsaywhere 1

This reminds me of the film Gattaca. Set in the not-too-distant future all people are classified according to their genetic makeup, and allocated to suitable roles in society. Plato would have found this an ideal solution for his own vision of the world. I have placed a link below to the trailer. This film deserved a much larger audience than it got on release in 1997, and the punch line "there is no gene for the human spirit" remains a powerful message today, as some genetic research threatens to revive the Galtonian ambition.

An html copy of Kantsaywhere is also available here:

Kantsaywhere 2

Eugenics

The book does not dwell for too long on eugenics, or the history of the use of testing in eugenics. The story has been told many times, particularly well by Gould. However, it is important to consider the values that provide the fertile ground for eugenics to flourish, and recognize that at the time these were common views. That is why they were not readily questioned. And it is only by remembering that we can avoid making the same mistakes in the future.

The use of flawed tests based on unsound constructs, correlated with genetic data, is simply bad science. And bad science can produce strange values. What are the implications of newspaper reports like that reproduced below?

Genes Dictate Test Results

Here is a thoughtful treatment. But they also need a testing expert to tell them that there's no point in correlating anything with a test that lacks validity, or going on to extract a major primary factor that is interpreted as a major finding. We've been there, done that, and realised our error.

Discussion

Testing Values

Numbers are only as useful as the reality they are designed to represent. In social science research, they are never "pure", which is why I believe that even the arithmetic mean is not really descriptive in the same way it is in astronomy. The same point was made by Lippmann almost 100 years ago. And for "intelligence" also read any psychological or social construct.

"Because the results are expressed in numbers, it is easy to make the mistake of thinking that the intelligence test is a measure like a foot rule or a pair of scales. It is, of course, a quite different sort of measure. For length and weight are qualities which men have learned how to isolate no matter whether they are found in an army of soldiers, a heap of bricks, or a collection of chlorine molecules. Provided the foot rule and the scales agree with the arbitrarily accepted standard foot and standard pound in the Bureau of Standards at Washington they can be used with confidence. But 'intelligence' is not an abstraction like length and weight; it is an exceedingly complicated notion which nobody has as yet succeeded in defining" (Lippmann, 1922)

I conclude with a quotation that summarises the two principles that guided the writing of a book. The first principle encompasses the author's view of what humanity is. The second accounts for intercultural differences that require language learning and cultural sensitivity to realise common human bonds:

Unity of the human race. In the discussions of this chapter there is a basic assumption of and belief in the unity of all mankind. All races have the same origin and are capable of the same emotions and the same needs encompassing the whole range of human experience from hunger and the craving for food to theological inquiry and the seeking of God.

Cultural diversity of man. Within this genetic unity of mankind, different groups or communities of people have evolved different ways of life which facilitate their own in-group interrelations but at the same time set them apart from other groups. When these ways of life are sufficiently different from others they constitute separate cultures.

The writer was not a philosopher, but a language tester. Or perhaps he was both.

Lado, R. (1961). Language Testing. London: Longman, p. 276.