Language Testing Bytes is a podcast to accompany the SAGE journal Language Testing. Three or four times per year, we will release a podcast in which we discuss topics related to a particular issue of the journal. This may be an interview with a contributor to the journal, or another expert in the field. You can download the podcast from this website, from ltj.sagepub.com, or you can subscribe to the podcast through iTunes.
Coming Soon: The next podcast will accompany a special issue on Assessment Literacy. The guest editor, Ofra Inbar, will introduce the topic for us in issue 14 of LTB.
The oral fluency level of an L2 speaker is often used as a measure in assessing language proficiency. The present study reports on four experiments investigating the contributions of three fluency ...
The oral fluency level of an L2 speaker is often used as a measure in assessing language proficiency. The present study reports on four experiments investigating the contributions of three fluency aspects (pauses, speed and repairs) to perceived fluency. In Experiment 1 untrained raters evaluated the oral fluency of L2 Dutch speakers. Using specific acoustic measures of pause, speed and repair phenomena, linear regression analyses revealed that pause and speed measures best predicted the subjective fluency ratings, and that repair measures contributed only very little. A second research question sought to account for these results by investigating perceptual sensitivity to acoustic pause, speed and repair phenomena, possibly accounting for the results from Experiment 1. In Experiments 2–4 three new groups of untrained raters rated the same L2 speech materials from Experiment 1 on the use of pauses, speed and repairs. A comparison of the results from perceptual sensitivity (Experiments 2–4) with fluency perception (Experiment 1) showed that perceptual sensitivity alone could not account for the contributions of the three aspects to perceived fluency. We conclude that listeners weigh the importance of the perceived aspects of fluency to come to an overall judgment.
Partial dictation is a measure of EFL listening proficiency that can be easily constructed, administered, and scored by EFL teachers. However, it is controversial whether this form of test measures...
Partial dictation is a measure of EFL listening proficiency that can be easily constructed, administered, and scored by EFL teachers. However, it is controversial whether this form of test measures lower-order abilities exclusively or involves both lower- and higher-order abilities. In order to answer this question, a study was designed to examine the difference between partial dictation and test forms believed to measure more higher-order abilities. In a series of confirmatory factor analyses, the simplex, second-order, and bi-factor models were fitted to the scores of 367 college-level EFL learners in China in a listening test composed of partial dictation, gap-filling and constructed response tasks. The bi-factor model was identified as the best-fitting and this supports the view that partial dictation measures the same construct as test forms believed to measure more higher-order abilities. Concomitant statistical analyses also showed that the partial dictation tasks were suited to the ability level of the test takers and had high internal consistency.
Although a key concept in various writing textbooks, learning standards, and writing rubrics, voice remains a construct that is only loosely defined in the literature and impressionistically assess...
Although a key concept in various writing textbooks, learning standards, and writing rubrics, voice remains a construct that is only loosely defined in the literature and impressionistically assessed in practice. Few attempts have been made to formally investigate whether and how the strength of an author’s voice in written texts can be reliably measured. Using a mixed-method approach, this study develops and validates an analytic rubric that measures voice strength in second language (L2) argumentative writing. Factor analysis of ratings from six raters on voice strength in a total of 400 TOEFL® iBT writing samples, together with qualitative analysis of four raters’ in-depth think-aloud and interview data, points to an alternative conceptualization of voice that sees authorial voice in written discourse as being realized primarily through the following dimensions: (1) the presence and clarity of ideas in the content; (2) the manner of the presentation of ideas; and (3) the writer and reader presence. Implications of such results for L2 writing instruction and assessment are discussed.
Based on evidence that listeners may favor certain foreign accents over others (Gass & Varonis, 1984; Major, Fitzmaurice, Bunta, & Balasubramanian, 2002; Tauroza & Luk, 1997) and that language-test...
Based on evidence that listeners may favor certain foreign accents over others (Gass & Varonis, 1984; Major, Fitzmaurice, Bunta, & Balasubramanian, 2002; Tauroza & Luk, 1997) and that language-test raters may better comprehend and/or rate the speech of test takers whose native languages (L1s) are more familiar on some level (Carey, Mannell, & Dunn, 2011; Fayer & Krasinski, 1987; Scales, Wennerstrom, Richard, & Wu, 2006), we investigated whether accent familiarity (defined as having learned the test takers’ L1) leads to rater bias. We examined 107 raters’ ratings on 432 TOEFL iBTTM speech samples from 72 test takers. The raters of interest were L2 speakers of Spanish, Chinese, or Korean, while the test takers comprised three native-speaker groups (24 each) of Spanish, Chinese, and Korean. We analyzed the ratings using a multifaceted Rasch measurement approach. Results indicated that L2 Spanish raters were significantly more lenient with L1 Spanish test takers, as were L2 Chinese raters with L1 Chinese test takers. We conclude by concurring with Xi and Mollaun (2009, 2011) and Carey et al. that rater training should address raters’ linguistic background as a potential rater effect. Furthermore, we discuss the importance of recognizing rater L2 as a possible source of bias.
This study examines the development and evaluation of a bilingual Vocabulary Size Test (VST, Nation, 2006). A bilingual (English–Russian) test was developed and administered to 121 intermediate pro...
This study examines the development and evaluation of a bilingual Vocabulary Size Test (VST, Nation, 2006). A bilingual (English–Russian) test was developed and administered to 121 intermediate proficiency EFL learners (native speakers of Russian), alongside the original monolingual (English-only) version of the test. A comparison of the bilingual and monolingual test scores showed that participants achieved significantly higher scores on the bilingual version of the test. Accuracy of responses to individual test items was reliably higher when the meanings of test items were presented in the L1 (Russian) and when these items were cognates. The findings also revealed that the bilingual version is likely to be a more sensitive measure of written receptive vocabulary knowledge. Finally, analyses showed that the effect of using L1 for multiple-choice options is likely to be larger for low-proficiency learners and that the difference in response accuracy to cognates and non-cognates decreases as item frequency increases. The paper concludes with recommendations on developing and using bilingual vocabulary size tests.
Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, give...
Differential skill functioning (DSF) exists when examinees from different groups have different probabilities of successful performance in a certain subskill underlying the measured construct, given that they have the same ability on the overall construct. Using a DSF approach, this study examined the differences between two native language groups – a group with an East Asian language background and one with a Romance language background – in regard to reading subskills as represented in the Michigan English Language Assessment Battery (MELAB) reading test. Based on a combination of literature review and think-aloud reports from a sample of ESL students, hypotheses on reading subskill differences between the two groups were generated. These hypotheses were tested by first identifying the subskill profile of each examinee in a large MELAB database via the application of a previously determined item-skill Q-matrix to a Fusion Model of cognitive diagnostic modeling. The subskill profiles of the East Asian examinees were then compared against those of examinees with a Romance language background through logistic regression techniques. Some important DSFs were found between the two groups. Based on results of this study, instructional strategies were suggested to address some specific weaknesses in ESL learners’ reading subskills.
Language Testing is an international peer reviewed journal that
publishes original research on language testing and assessment. Since
1984 it has featured high impact papers covering theoretical issues,
empirical studies, and reviews. The journal's wide scope encompasses
first and second language testing and assessment of English and other
languages, and the use of tests and assessments as research and
evaluation tools. Many articles also contribute to methodological
innovation and the practical improvement of testing and assessment
internationally. In addition, the journal publishes submissions that
deal with policy issues, including the use of language tests and
assessments for high stakes decision making in fields as diverse as
education, employment and international mobility. The journal welcomes
the submission of papers that deal with ethical and philosophical issues
in language testing, as well as technical matters. Also of concern is
research into the washback and impact of language test use, and
ground-breaking uses of assessments for learning. Additionally, the
journal wishes to publish replication studies that help to embed and
extend our knowledge of generalisable findings in the field. Language
Testing is committed to encouraging interdisciplinary research, and is
keen to receive submissions which draw on theory and methodology from
different fields of applied linguistics, as well as educational
measurement, and other relevant disciplines.
How to put the podcast onto your iPod
Decide which of the podcasts below you would like to listen to. Right click on the link, and select 'save target as' to download it into a folder on your computer.
Open iTunes. Click on 'file' and then 'new playlist'. Name your playlist 'Language Testing Bytes'.
Click on the playlist from the iTunes menu.
Open the folder in which you saved the podcast, then drag the podcast from the folder and drop it into the playlist.
Syncronize your iPod.
When you next access your iPod go to the Language Testing Bytes playlist to play the podcast.
Alternatively, just pop it on whichever mp3 player you currently
use, or subscribe to the SAGE Podcast on iTunes.
Current Issue
Issue 13: Paula Winke and Susan Gass on Rater Bias
Rater bias is something that language testers have known about for a long time, and have tried to control through training and the use of rating scales. But investigations into the source and nature of bias is relatively recent. In issue 30(2) of the journal Paula Winke, Susan Gass, and Caroly Myford share their research in this field, and the first two authors from Michigan State University join us on Language Testing Bytes to discuss rater bias.
Issue 12: Alan Davies on Assessing Academic English
In 2008 Alan Davies' book Assessing Academic English was published by Cambridge University Press. In issue 30(1) of Language Testing it is reviewed by Christine Coombe. With a strong historical narrative, the book raises many of the enduring issues in assessing English for study in English medium institutions. In this podcast we explore some of these with Professor Davies.
Issue 11: Ana Pellicer-Sanchez and Norbert Schmitt on Yes-No Vocabulary Tests
In this issue of the podcast we return to vocabulary testing, after the great introduction provided by John Read in Issue 5. This time, we welcome Ana Pellicer-Sanchez and Norbert Schmitt, to talk about the popular Yes-No Vocabuluary Test. Their recent research looks at scoring issues and potential solutions to problems that have plagued the test for years. Their paper in issue 29(4) of the journal contains the details, but in the podcast we discuss the key issues for vocabulary assessment.
Issue 10: Kathryn Hill on Classroom Based Assessment
Classroom Based Assessment is an increasingly important topic in language education, and in issue 29(3) of Language Testing we publish a paper by Kathryn Hill and Tim McNamara entitled "Developing a comprehensive, empirically based research framework for classroom-based assessment". The research in this paper is based on the first author's PhD dissertation, and so we asked Kathryn Hill to join us on Language Testing Bytes to talk about developments in the field.
Issue 9: Luke Harding on Accent in Listening Assessment
Issue 29(2) of the journal contains a paper entitled "Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective", by Luke Harding. In this podcast we explore why it is that most listening tests use a very narrow range of standard accents, rather than the many varieties that we are likely to encounter in real-world communication.
Issue 8: Tan Jin and Barley Mak on Confidence Scoring
In Issue 29(1) of the journal three authors from the Chinese University of Hong Kong have a paper on the application of fuzzy logic to scoring speaking tests. This is termed 'confidence scoring', and the first two authors join us on Language Testing Bytes to explain a little more about their novel approach.
Mark Wilson delivered the Messick Memorial Lecture at the Language Testing Research Colloquium in Melbourne, 2006, on new developments in measurement models to take into account the complexity of language testing. In Language Testing 28(4) we publish the paper based on this lecture, and Mark joins us on Language Testing Bytes to talk about his work in this area.
Issue 6: Craig Deville and Micheline Chalhoub-Deville on Standards-Based Testing
Standards-Based Testing is highly controversial for its social and educational impact on schools and bilingual communities, and the technical aspects that rely to a significant extent on expert judgment. In issue 28(3) we discuss the issues surrounding Standards-Based Testing in the United States with the guest editors of a special issue on this topic. The collection of papers that they have brought together, along with reviews of recent books on the topic, and test review, constitute a state of the art volume for the field.
The journal has seen a flurry of articles on vocabulary testing in recent months, and issue 28(2) is no exception, with Marta Fairclough's paper on the lexical recognition task. It seemed like an appropriate moment to conisder why vocabulary is receiving so much attention, and so we turned to Professor John Read of the University of Auckland, New Zealand, to give us an overview of current research and activity within the field.
Issue 4: Khaled Barkaoui and Melissa Bowles on Think Aloud Protocols
In Language Testing 28(1), 2011, Khaled Barkaoui has an article on the use of think-alouds to investigate rater processes and decisions as they rate essay samples. The focus is not on the raters, but on whether the research method is a useful tool for the purpose. In this podcast he explains his findings, and their importance. We are then joined by Melissa Bowles who has recently published The Think-Aloud Controversy in Second Language Research, to explain precisely what the problems and possibilities of think-alouds are in language testing research.
Language Testing 27(4), 2010, contains an article by Carol Chapelle and colleagues on testing productive grammatical ability. We thought this would be an excellent opportunity to look at what is going on in the field of assessing grammar, and what issues currently face the field. Jim Purpura agreed to talk to us on Language Testing Bytes.
Language Testing 27(3), 2010, is a special issue guest edited by Xiaoming Xi on the automated scoring of writing and speaking tests. In this podcast she talks about why the automated scoring of speaking and writing tests is such a hot topic, and explains the possibilities, limitations and current research issues in the field.
In Language Testing 27(2), 2010, Mike Kane contributed a response to an article on fairness in language testing. We thought this was an excellent opportunity to ask him about his approach to validation, and how he sees 'fairness' fitting into the picture.