Language Testing Bytes is a podcast to accompany the SAGE journal Language Testing. Three or four times per year, we will release a podcast in which we discuss topics related to a particular issue of the journal. This may be an interview with a contributor to the journal, or another expert in the field. You can download the podcast from this website, from ltj.sagepub.com, or you can subscribe to the podcast through iTunes.
Coming Soon: In the next podcast I'll be talking to Fred Davidson and Cary Lin about the use of statistics in language testing, and how we teach statistics as part of the emerging language testing literacy debate.
This special issue of Language Testing explores raters’ evaluations of L2 proficiency and possible causes of variability of rater judgments. In addition to the analysis of rater behavior and rater consistency, we investigate the relationship between general measures of oral and written L2 performance concerning complexity, accuracy and fluency of L2 production and overall judgments of oral and written L2 performance by raters, based on holistic rating scales. Finally, the use of rating scales in different contexts and for different types of learners is also examined. In this introduction the three central themes presented in the various contributions are briefly discussed: rater behavior and rater consistency, rater judgments and measures of language performance, and the use of global rating scales.
One core area of research in Second Language Acquisition is the identification and definition of developmental stages in different L2s. For L2 French, Bartning and Schlyter (2004) presented a model of six morphosyntactic stages of development in the shape of grammatical profiles. The model formed the basis for the computer program Direkt Profil (Granfeldt et al., 2006), which carries out an automated analysis of the developmental stage of a learner text. The aim of the present study was to explore the relevance of Direkt Profil as a diagnostic assessment tool by comparing Direkt Profil’s automated profile analysis with assessment by trained language teachers.
Data for the present study come from the CEFLE corpus of written L2 French (Ågren, 2008). The learner texts were first analysed for developmental stage by the computer program Direkt Profil. In a second step, seven experienced language teachers of French assessed the same texts. The results indicated relatively high degrees of correlation and showed that the analysis of developmental stage by Direkt Profil could explain 73% of the variance in the teachers’ mean assessments (r 2 = 0.735). In addition, we concluded that the teachers were in agreement with each other and with the computer program when assessing texts at low and high proficiency levels respectively. The most important variation in the teachers’ assessments was found in texts at intermediate levels, due to an inconsistent use of grammar and vocabulary. One of the advantages of using Direkt Profil as a diagnostic assessment tool is that it provides immediate and detailed feedback indicating how certain types of linguistic structures, correct or incorrect, are related to different stages of development.
There is still relatively little research on how well the CEFR and similar holistic scales work when they are used to rate L2 texts. Using both multifaceted Rasch analyses and qualitative data from rater comments and interviews, the ratings obtained by using a CEFR-based writing scale and the Finnish National Core Curriculum scale for L2 writing were examined to validate the rating process used in the study of the linguistic basis of the CEFR in L2 Finnish and English. More specifically, we explored the quality of the ratings and the rating scales across different tasks and across the two languages. As the task is an integral part of the data-gathering procedure, the relationship of task peformance across the scales and languages was also examined. We believe the kinds of analyses reported here are also relevant to other SLA studies that use rating scales in their data-gathering process.
This study investigates the relationship in L2 writing between raters’ judgments of communicative adequacy and linguistic complexity by means of six-point Likert scales, and general measures of linguistic performance. The participants were 39 learners of Italian and 32 of Dutch, who wrote two short argumentative essays. The same writing tasks were administered to a control group of 18 native writers of Italian and 17 of Dutch. During a panel discussion raters were asked to verbalize for which reasons they assigned a text to a particular rating level. The results show that although raters’ judgements of communicative adequacy largely corresponded to their judgments of linguistic complexity, the findings for L2 and L1 turned out to be different. In L2 overall ratings of linguistic complexity were correlated with lexical diversity and accuracy, but not with syntactic complexity. In L1 hardly any correlations between raters’ judgements and general measures of syntactic complexity and lexical diversity were found. Furthermore, raters used different strategies when assessing high- and low-proficiency L2 writers or native writers, and seemed to attach more importance to textual features connected to communicative adequacy than to linguistic complexity and accuracy.
Oral fluency and foreign accent distinguish L2 from L1 speech production. In language testing practices, both fluency and accent are usually assessed by raters. This study investigates what exactly native raters of fluency and accent take into account when judging L2. Our aim is to explore the relationship between objectively measured temporal, segmental and suprasegmental properties of speech on the one hand, and fluency and accent as rated by native raters on the other hand. For 90 speech fragments from Turkish and English L2 learners of Dutch, several acoustic measures of fluency and accent were calculated. In Experiment 1, 20 native speakers of Dutch rated the L2 Dutch samples on fluency. In Experiment 2, 20 different untrained native speakers of Dutch judged the L2 Dutch samples on accentedness. Regression analyses revealed, first, that acoustic measures of fluency were good predictors of fluency ratings. Second, segmental and suprasegmental measures of accent could predict some variance of accent ratings. Third, perceived fluency and perceived accent were only weakly related. In conclusion, this study shows that fluency and perceived foreign accent can be judged as separate constructs.
This study examines the methodology of global foreign accent ratings in studies on L2 speech production. In three experiments, we test how variation in raters, range within speech samples, as well as instructions and procedures affects ratings of accent in predominantly monolingual speakers of German, non-native speakers of German, as well as long-term emigrants from Germany, that is, L1 attriters. The findings show that rater differences do not result in systematic changes in rating patterns. In contrast, range effects and effects of familiarity with accented speech lead to shifts in absolute and relative ratings. Including more strongly foreign-accented samples leads to lower judgements for the entire group of L2 speakers compared to natives. Similarly, lower familiarity with foreign accent results in more variable and more strongly foreign-accented judgements. We discuss the implications for research on L2 pronunciation as well as for the interpretation of nativeness in L2 studies and language testing more generally.
Language Testing is an international peer reviewed journal that
publishes original research on language testing and assessment. Since
1984 it has featured high impact papers covering theoretical issues,
empirical studies, and reviews. The journal's wide scope encompasses
first and second language testing and assessment of English and other
languages, and the use of tests and assessments as research and
evaluation tools. Many articles also contribute to methodological
innovation and the practical improvement of testing and assessment
internationally. In addition, the journal publishes submissions that
deal with policy issues, including the use of language tests and
assessments for high stakes decision making in fields as diverse as
education, employment and international mobility. The journal welcomes
the submission of papers that deal with ethical and philosophical issues
in language testing, as well as technical matters. Also of concern is
research into the washback and impact of language test use, and
ground-breaking uses of assessments for learning. Additionally, the
journal wishes to publish replication studies that help to embed and
extend our knowledge of generalisable findings in the field. Language
Testing is committed to encouraging interdisciplinary research, and is
keen to receive submissions which draw on theory and methodology from
different fields of applied linguistics, as well as educational
measurement, and other relevant disciplines.
How to put the podcast onto your iPod
Decide which of the podcasts below you would like to listen to. Right click on the link, and select 'save target as' to download it into a folder on your computer.
Open iTunes. Click on 'file' and then 'new playlist'. Name your playlist 'Language Testing Bytes'.
Click on the playlist from the iTunes menu.
Open the folder in which you saved the podcast, then drag the podcast from the folder and drop it into the playlist.
Syncronize your iPod.
When you next access your iPod go to the Language Testing Bytes playlist to play the podcast.
Alternatively, just pop it on whichever mp3 player you currently
use, or subscribe to the SAGE Podcast on iTunes.
Issue 18: Folkert Kuiken and Ineke Vedder from the University of Amsterdam discuss rater variability in the assessment of speaking and writing in a second language.
The third issue of the journal this year is a special on the scoring of performance tests. In this podcast the guest editors talk about some of the issues surrounding the rating of speaking and writing samples.
Issue 17: Ryo Nitta and Fumiyo Nakatsuhara on pre-task planning in paired speaking tests
The authors of our first paper in 31(2) are concerned with a very practical question. What is the effect of giving test-takers planning time prior to a paired-format speaking task? Does it affect what they say? Does it change the scores they get? The answers will inform the design of speaking tests not only in high stakes assessment contexts, but probably in classrooms as well.
Issue 16: Jodi Tommerdahl and Cynthia Kilpatrick on the reliability of morphological analyses in language samples
How large a language sample do we need in order to draw reliable conclusions about what we wish to assess? In issue 31(1) of Language Testing we are delighted to publish a paper by Jodi Tommerdahl and Cynthia Kilpatrick that addresses this important issue.
Issue 30(4) of the journal contains the first paper on eye-tracking studies to investigate the cognitive processes of learners taking reading tests. Stephen Bax joins us to explain the methodology and what it can tell us about how successful readers go about processing items and texts in reading tests.
Issue 30(3) commemorates the 30th Anniversary of the founding of the journal. We mark this milestone in the journal's history with a special issue on the topic of Assessment Literacy, guest edited by Ofra Inbar. A concern for the literacy needs of a wide range of stakeholders who use tests and test scores beyond the experts is a sign of a maturing profession. This issue takes the debate forward in new and exciting ways, some of which Ofra Inbar discusses on this podcast.
Issue 13: Paula Winke and Susan Gass on Rater Bias
Rater bias is something that language testers have known about for a long time, and have tried to control through training and the use of rating scales. But investigations into the source and nature of bias is relatively recent. In issue 30(2) of the journal Paula Winke, Susan Gass, and Caroly Myford share their research in this field, and the first two authors from Michigan State University join us on Language Testing Bytes to discuss rater bias.
Issue 12: Alan Davies on Assessing Academic English
In 2008 Alan Davies' book Assessing Academic English was published by Cambridge University Press. In issue 30(1) of Language Testing it is reviewed by Christine Coombe. With a strong historical narrative, the book raises many of the enduring issues in assessing English for study in English medium institutions. In this podcast we explore some of these with Professor Davies.
Issue 11: Ana Pellicer-Sanchez and Norbert Schmitt on Yes-No Vocabulary Tests
In this issue of the podcast we return to vocabulary testing, after the great introduction provided by John Read in Issue 5. This time, we welcome Ana Pellicer-Sanchez and Norbert Schmitt, to talk about the popular Yes-No Vocabuluary Test. Their recent research looks at scoring issues and potential solutions to problems that have plagued the test for years. Their paper in issue 29(4) of the journal contains the details, but in the podcast we discuss the key issues for vocabulary assessment.
Issue 10: Kathryn Hill on Classroom Based Assessment
Classroom Based Assessment is an increasingly important topic in language education, and in issue 29(3) of Language Testing we publish a paper by Kathryn Hill and Tim McNamara entitled "Developing a comprehensive, empirically based research framework for classroom-based assessment". The research in this paper is based on the first author's PhD dissertation, and so we asked Kathryn Hill to join us on Language Testing Bytes to talk about developments in the field.
Issue 9: Luke Harding on Accent in Listening Assessment
Issue 29(2) of the journal contains a paper entitled "Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective", by Luke Harding. In this podcast we explore why it is that most listening tests use a very narrow range of standard accents, rather than the many varieties that we are likely to encounter in real-world communication.
Issue 8: Tan Jin and Barley Mak on Confidence Scoring
In Issue 29(1) of the journal three authors from the Chinese University of Hong Kong have a paper on the application of fuzzy logic to scoring speaking tests. This is termed 'confidence scoring', and the first two authors join us on Language Testing Bytes to explain a little more about their novel approach.
Mark Wilson delivered the Messick Memorial Lecture at the Language Testing Research Colloquium in Melbourne, 2006, on new developments in measurement models to take into account the complexity of language testing. In Language Testing 28(4) we publish the paper based on this lecture, and Mark joins us on Language Testing Bytes to talk about his work in this area.
Issue 6: Craig Deville and Micheline Chalhoub-Deville on Standards-Based Testing
Standards-Based Testing is highly controversial for its social and educational impact on schools and bilingual communities, and the technical aspects that rely to a significant extent on expert judgment. In issue 28(3) we discuss the issues surrounding Standards-Based Testing in the United States with the guest editors of a special issue on this topic. The collection of papers that they have brought together, along with reviews of recent books on the topic, and test review, constitute a state of the art volume for the field.
The journal has seen a flurry of articles on vocabulary testing in recent months, and issue 28(2) is no exception, with Marta Fairclough's paper on the lexical recognition task. It seemed like an appropriate moment to conisder why vocabulary is receiving so much attention, and so we turned to Professor John Read of the University of Auckland, New Zealand, to give us an overview of current research and activity within the field.
Issue 4: Khaled Barkaoui and Melissa Bowles on Think Aloud Protocols
In Language Testing 28(1), 2011, Khaled Barkaoui has an article on the use of think-alouds to investigate rater processes and decisions as they rate essay samples. The focus is not on the raters, but on whether the research method is a useful tool for the purpose. In this podcast he explains his findings, and their importance. We are then joined by Melissa Bowles who has recently published The Think-Aloud Controversy in Second Language Research, to explain precisely what the problems and possibilities of think-alouds are in language testing research.
Language Testing 27(4), 2010, contains an article by Carol Chapelle and colleagues on testing productive grammatical ability. We thought this would be an excellent opportunity to look at what is going on in the field of assessing grammar, and what issues currently face the field. Jim Purpura agreed to talk to us on Language Testing Bytes.
Language Testing 27(3), 2010, is a special issue guest edited by Xiaoming Xi on the automated scoring of writing and speaking tests. In this podcast she talks about why the automated scoring of speaking and writing tests is such a hot topic, and explains the possibilities, limitations and current research issues in the field.
In Language Testing 27(2), 2010, Mike Kane contributed a response to an article on fairness in language testing. We thought this was an excellent opportunity to ask him about his approach to validation, and how he sees 'fairness' fitting into the picture.