Language Testing Bytes is a biannual podcast produced for SAGE publications to accompany the journal Language testing. In the podcast I discuss issues raised in the journal with authors and key figures in the field. You can download a podcast for your iPod or other device by right clicking on the download icon, or you can play a podcast directly from this page. Also available on iTunes.
Brent Bridgeman and Yeonsuk Cho on Predictive Validity [Fall, 2016]
Cathie Elder on LSP Testing and the OET [Spring 2016]
2010 - 2015
Issue 22: Eunice Jang on Diagnostic Language Testing.
Issue 32(2) of the Language Testing is a special on the current state of Diagnostic Language Testing. While this has traditionally been a neglected use of language tests, there is currently a great surge of interest and research in the field. Eunice Jang from the University of Toronto joins me to discuss current thinking in testing for diagnostic purposes.
The assessment of aviation English has become something of an icon of high stakes assessment in recent years. In Language Testing 32(2), we publish a paper by Hyejeong Kim and Cathie Elder, both from the University of Melbourne, which examines the construct of aviation English from the perspective of airline professionals in Korea.
In this issue of the podcast Martin East describes an assessment reform project in New Zealand. We're reminded very forcefully that when assessment and testing procedures within educational systems are changed, there are many complex factors to take into account. All stakeholders are going to take a view on the proposed reforms, and they aren't necessarily going to agree.
Issue 19: Fred Davidson and Cary Lin of the University of Illinois at Urbana-Champaign discuss the role of statistics in language testing.
The last issue of volume 31 contains a review of Rita Green's new book on statistics in language testing. We take the opportunity to talk about how things have changed in teaching statistics for students of language testing since Fred Davidson's The language tester's statistical toolbox was published in 2000.
Issue 18: Folkert Kuiken and Ineke Vedder from the University of Amsterdam discuss rater variability in the assessment of speaking and writing in a second language.
The third issue of the journal this year is a special on the scoring of performance tests. In this podcast the guest editors talk about some of the issues surrounding the rating of speaking and writing samples.
Issue 17: Ryo Nitta and Fumiyo Nakatsuhara on pre-task planning in paired speaking tests
The authors of our first paper in 31(2) are concerned with a very practical question. What is the effect of giving test-takers planning time prior to a paired-format speaking task? Does it affect what they say? Does it change the scores they get? The answers will inform the design of speaking tests not only in high stakes assessment contexts, but probably in classrooms as well.
Issue 16: Jodi Tommerdahl and Cynthia Kilpatrick on the reliability of morphological analyses in language samples
How large a language sample do we need in order to draw reliable conclusions about what we wish to assess? In issue 31(1) of Language Testing we are delighted to publish a paper by Jodi Tommerdahl and Cynthia Kilpatrick that addresses this important issue.
Issue 30(4) of the journal contains the first paper on eye-tracking studies to investigate the cognitive processes of learners taking reading tests. Stephen Bax joins us to explain the methodology and what it can tell us about how successful readers go about processing items and texts in reading tests.
Issue 30(3) commemorates the 30th Anniversary of the founding of the journal. We mark this milestone in the journal's history with a special issue on the topic of Assessment Literacy, guest edited by Ofra Inbar. A concern for the literacy needs of a wide range of stakeholders who use tests and test scores beyond the experts is a sign of a maturing profession. This issue takes the debate forward in new and exciting ways, some of which Ofra Inbar discusses on this podcast.
Issue 13: Paula Winke and Susan Gass on Rater Bias
Rater bias is something that language testers have known about for a long time, and have tried to control through training and the use of rating scales. But investigations into the source and nature of bias is relatively recent. In issue 30(2) of the journal Paula Winke, Susan Gass, and Caroly Myford share their research in this field, and the first two authors from Michigan State University join us on Language Testing Bytes to discuss rater bias.
Issue 12: Alan Davies on Assessing Academic English
In 2008 Alan Davies' book Assessing Academic English was published by Cambridge University Press. In issue 30(1) of Language Testing it is reviewed by Christine Coombe. With a strong historical narrative, the book raises many of the enduring issues in assessing English for study in English medium institutions. In this podcast we explore some of these with Professor Davies.
Issue 11: Ana Pellicer-Sanchez and Norbert Schmitt on Yes-No Vocabulary Tests
In this issue of the podcast we return to vocabulary testing, after the great introduction provided by John Read in Issue 5. This time, we welcome Ana Pellicer-Sanchez and Norbert Schmitt, to talk about the popular Yes-No Vocabuluary Test. Their recent research looks at scoring issues and potential solutions to problems that have plagued the test for years. Their paper in issue 29(4) of the journal contains the details, but in the podcast we discuss the key issues for vocabulary assessment.
Issue 10: Kathryn Hill on Classroom Based Assessment
Classroom Based Assessment is an increasingly important topic in language education, and in issue 29(3) of Language Testing we publish a paper by Kathryn Hill and Tim McNamara entitled "Developing a comprehensive, empirically based research framework for classroom-based assessment". The research in this paper is based on the first author's PhD dissertation, and so we asked Kathryn Hill to join us on Language Testing Bytes to talk about developments in the field.
Issue 9: Luke Harding on Accent in Listening Assessment
Issue 29(2) of the journal contains a paper entitled "Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective", by Luke Harding. In this podcast we explore why it is that most listening tests use a very narrow range of standard accents, rather than the many varieties that we are likely to encounter in real-world communication.
Issue 8: Tan Jin and Barley Mak on Confidence Scoring
In Issue 29(1) of the journal three authors from the Chinese University of Hong Kong have a paper on the application of fuzzy logic to scoring speaking tests. This is termed 'confidence scoring', and the first two authors join us on Language Testing Bytes to explain a little more about their novel approach.
Mark Wilson delivered the Messick Memorial Lecture at the Language Testing Research Colloquium in Melbourne, 2006, on new developments in measurement models to take into account the complexity of language testing. In Language Testing 28(4) we publish the paper based on this lecture, and Mark joins us on Language Testing Bytes to talk about his work in this area.
Issue 6: Craig Deville and Micheline Chalhoub-Deville on Standards-Based Testing
Standards-Based Testing is highly controversial for its social and educational impact on schools and bilingual communities, and the technical aspects that rely to a significant extent on expert judgment. In issue 28(3) we discuss the issues surrounding Standards-Based Testing in the United States with the guest editors of a special issue on this topic. The collection of papers that they have brought together, along with reviews of recent books on the topic, and test review, constitute a state of the art volume for the field.
The journal has seen a flurry of articles on vocabulary testing in recent months, and issue 28(2) is no exception, with Marta Fairclough's paper on the lexical recognition task. It seemed like an appropriate moment to conisder why vocabulary is receiving so much attention, and so we turned to Professor John Read of the University of Auckland, New Zealand, to give us an overview of current research and activity within the field.
Issue 4: Khaled Barkaoui and Melissa Bowles on Think Aloud Protocols
In Language Testing 28(1), 2011, Khaled Barkaoui has an article on the use of think-alouds to investigate rater processes and decisions as they rate essay samples. The focus is not on the raters, but on whether the research method is a useful tool for the purpose. In this podcast he explains his findings, and their importance. We are then joined by Melissa Bowles who has recently published The Think-Aloud Controversy in Second Language Research, to explain precisely what the problems and possibilities of think-alouds are in language testing research.
Language Testing 27(4), 2010, contains an article by Carol Chapelle and colleagues on testing productive grammatical ability. We thought this would be an excellent opportunity to look at what is going on in the field of assessing grammar, and what issues currently face the field. Jim Purpura agreed to talk to us on Language Testing Bytes.
Language Testing 27(3), 2010, is a special issue guest edited by Xiaoming Xi on the automated scoring of writing and speaking tests. In this podcast she talks about why the automated scoring of speaking and writing tests is such a hot topic, and explains the possibilities, limitations and current research issues in the field.
In Language Testing 27(2), 2010, Mike Kane contributed a response to an article on fairness in language testing. We thought this was an excellent opportunity to ask him about his approach to validation, and how he sees 'fairness' fitting into the picture.
Educational policies such as Race to the Top in the USA affirm a central role for testing systems in government-driven reform efforts. Such reform policies are often referred to as the global education reform movement (GERM). Changes observed with the GERM style of testing demand socially engaged validity theories that include consequential research. The article revisits the Standards and Kane’s interpretive argument (IA) and argues that the role envisioned for consequences remains impoverished. Guided by theory of action, the article presents a validity framework, which targets policy-driven assessments and incorporates a social role for consequences. The framework proposes a coherent system that makes explicit the interconnections among policy ambitions, testing functions, and the levels/sectors that are affected. The article calls for integrating consequences into technical quality documentation, demands a more realistic delineation of stakeholders and their roles, and compels engagement in policy research.
American Sign Language (ASL) is one of the most commonly taught languages in North America. Yet, few assessment instruments for ASL proficiency have been developed, none of which have adequately demonstrated validity. We propose that the American Sign Language Discrimination Test (ASL-DT), a recently developed measure of learners’ ability to discriminate phonological and morphophonological contrasts in ASL, provides an objective overall measure of ASL proficiency. In this study, the ASL-DT was administered to 194 participants at beginning, intermediate, and high levels of ASL proficiency, a subset of which (N = 57) also was administered the Sign Language Proficiency Interview (SLPI), a widely used subjective proficiency measure. Using Rasch analysis to model ASL-DT item difficulty and person ability, we tested the ability of the ASL-DT Rasch measure to detect participant proficiency group mean differences and compared its discriminant performance to the SLPI ratings for classifying individuals into their pre-assigned proficiency groups using resource operating characteristic statistics. The ASL-DT Rasch measure outperformed the SLPI ratings, indicating that the ASL-DT may provide a valid objective measure of overall ASL proficiency. As such, the ASL-DT Rasch measure may provide a useful complement to measures such as the SLPI in comprehensive sign language assessment programs.
Elicited imitation (EI) has been widely used to examine second language (L2) proficiency and development and was an especially popular method in the 1970s and early 1980s. However, as the field embraced more communicative approaches to both instruction and assessment, the use of EI diminished, and the construct-related validity of EI scores as a representation of language proficiency was called into question. Current uses of EI, while not discounting the importance of communicative activities and assessments, tend to focus on the importance of processing and automaticity. This study presents a systematic review of EI in an effort to clarify the construct and usefulness of EI tasks in L2 research.
The review underwent two phases: a narrative review and a meta-analysis. We surveyed 76 theoretical and empirical studies from 1970 to 2014, to investigate the use of EI in particular with respect to the research/assessment context and task features. The results of the narrative review provided a theoretical basis for the meta-analysis. The meta-analysis utilized 24 independent effect sizes based on 1089 participants obtained from 21 studies. To investigate evidence of construct-related validity for EI, we examined the following: (1) the ability of EI scores to distinguish speakers across proficiency levels; (2) correlations between scores on EI and other measures of language proficiency; and (3) key task features that moderate the sensitivity of EI.
Results of the review demonstrate that EI tasks vary greatly in terms of task features; however, EI tasks in general have a strong ability to discriminate between speakers across proficiency levels (Hedges’ g = 1.34). Additionally, construct, sentence length, and scoring method were identified as moderators for the sensitivity of EI. Findings of this study provide supportive construct-related validity evidence for EI as a measure of L2 proficiency and inform appropriate EI task development and administration in L2 research and assessment.
The present study used the mixed Rasch model (MRM) to identify subgroups of readers within a sample of students taking an EFL reading comprehension test. Six hundred and two (602) Chinese college students took a reading test and a lexico-grammatical knowledge test and completed a Metacognitive and Cognitive Strategy Use Questionnaire (MCSUQ) (Zhang, Goh, & Kunnan, 2014). MRM analysis revealed two latent classes. Class 1 was more likely to score highly on reading in-depth (RID) items. Students in this class had significantly higher general English proficiency, better lexico-grammatical knowledge, and reported using reading strategies more frequently, especially planning, monitoring, and integrating strategies. In contrast, Class 2 was more likely to score highly on skimming and scanning (SKSN) items, but had relatively lower mean scores for lexico-grammatical knowledge and general English proficiency; they also reported using strategies less frequently than did Class 1. The implications of these findings and further research are discussed.
Placement and screening tests serve important functions, not only with regard to placing learners at appropriate levels of language courses but also with a view to maximizing the effectiveness of administering test batteries. We examined two widely reported formats suitable for these purposes, the discrete decontextualized Yes/No vocabulary test and the embedded contextualized C-test format, in order to determine which format can explain more variance in measures of listening and reading comprehension. Our data stem from a large-scale assessment with over 3000 students in the German secondary educational context; the four measures relevant to our study were administered to a subsample of 559 students. Using regression analysis on observed scores and SEM on a latent level, we found that the C-test outperforms the Yes/No format in both methodological approaches. The contextualized nature of the C-test seems to be able to explain large amounts of variance in measures of receptive language skills. The C-test, being a reliable, economical and robust measure, appears to be an ideal candidate for placement and screening purposes. In a side-line of our study, we also explored different scoring approaches for the Yes–No format. We found that using the hit rate and the false-alarm rate as two separate indicators yielded the most reliable results. These indicators can be interpreted as measures for vocabulary breadth and as guessing factors respectively, and they allow controlling for guessing.
Literature provides consistent evidence that there is a strong relationship between language proficiency and math achievement. However, research results show conflicts supporting either an increasing or a decreasing longitudinal relationship between the two. This study explored a longitudinal data and adopted quantile regression analyses to overcome several limitations in past research. The goal of the study is to detect more accurate and richer information on the long-term relationship between language and math, taking into consideration the socioeconomic status, gender, and ethnicity background at the same time. Results confirmed a persistent relationship between math achievement and all the factors explored. More importantly, it revealed that the strength of the relationship between language and math differed for students with various abilities both within and across grades. Model comparison suggests that language demand contributes to the achievement gap between ELLs and non-ELLs in math. There also seems to be a disadvantage for the geographically isolated group in academic achievement. Interpretation and implications for teaching and assessment are discussed.