Language Testing Bytes is a podcast to accompany the SAGE journal Language Testing. Three or four times per year, we will release a podcast in which we discuss topics related to a particular issue of the journal. This may be an interview with a contributor to the journal, or another expert in the field. You can download the podcast from this website, from ltj.sagepub.com, or you can subscribe to the podcast through iTunes.
News: SAGE have decided to continue supporting Language Testing Bytes into 2016, but it will become a biannual production, rather than the 4 issues at present. Issue 22 is the final one for 2015, and the podcast will be relaunched in 2016. Stay tuned to this page for further details.
This paper describes the process of designing, administering, and assessing a language-sensitive and culture-specific lexical test of Labrador Inuttitut (a dialect of Inuktitut, an Eskimo-Aleut language). This process presented numerous challenges, from choosing citation forms in a polysynthetic language to dealing with a lack of word frequency data. Twenty heritage receptive bilinguals (RBs) with very limited production skills in Inuttitut (their first language) and a comparison group of eight fluent bilinguals (FBs) participated in our study. Since the RBs lacked production skills in Inuttitut, the lexical test required participants to translate a carefully compiled list of Inuttitut nouns and verbs into English. The results revealed that RBs had good comprehension of basic vocabulary (85% accuracy), but differed significantly from FBs, mostly because the RBs had a number of partially accurate translations. The three lowest scoring RBs had the highest number of such translations as well as inaccurate translations based on phonological associations, as is common in emergent lexicons. This lexical test correlates with grammatical proficiency measures, pointing to its potential value as a quick placement and diagnostic test in revitalization programs for Inuttitut as well as other languages in a language loss situation.
The paper examines the results of the CEFR alignment project for the Slovenian national examinations in English. The authors aim to validate externally the standard-setting procedures by adopting a socio-cognitive model of validation (Khalifa & Weir, 2009; Weir, 2005) to analyse the scoring, context and cognitive validity of three reading subtests: the Slovenian B2 national examination and the international examinations FCE and CAE, aligned with B2 and C1 respectively. The relative comparability between the three subtests is determined by analysing the results of tests that have been administered to a group of 80 test-takers (expected CEFR level: B2). The placement of the test-takers also reveals to what extent the judgements of the Slovenian panellists about CEFR levels coincide with those reported for FCE and CAE. The study thus also explores whether the high degree of agreement between the judges on the alignment panel can be solely attributed to their adequate and precise understanding of CEFR descriptors – which is directly mirrored in their setting of the cut scores and relating the examination to relevant CEFR levels – or whether it can also be ascribed to their shared educational, national and cultural background. The answers to these questions are paramount because they reveal the descriptive adequacy of CEFR descriptors and because different interpretations of CEFR levels can significantly affect national testing policies and, consequently, language teaching and testing.
Investigating how visuals affect test takers’ performance on video-based L2 listening tests has been the focus of many recent studies. While most existing research has been based on test scores and self-reported verbal data, few studies have examined test takers’ viewing behavior (Ockey, 2007; Wagner, 2007, 2010a). To address this gap, in the present study I employ eye-tracking technology to record the eye movements of 33 test takers during the Video-based Academic Listening Test (VALT). Specifically, I aim to explore test takers’ oculomotor engagement with two types of videos – context videos and content videos – from the VALT, and the relationship between the test takers’ viewing behavior and test performance. Eye-tracking measures comprising fixation rate, dwell rate, and the total dwell time for context and content videos were compared using paired-samples t-tests. Additionally, each measure was correlated with test scores for items associated with each video type. Results revealed statistically significant differences between fixation rates and between total dwell time values, but no difference between the dwell rates for context and content videos. No statistically significant relationship was found between the three eye-tracking measures and the test scores. Directions for future research on video-based L2 listening assessment are discussed.
This study examines elicited imitation (EI) both as a measure of implicit grammatical knowledge and more global semantic and syntactic knowledge. It also examines whether length affects the difficulty of EI tests when they contain both grammatical and ungrammatical items. Fifty language learners took an EI test and an oral narrative task. The data were analyzed once for the accuracy of a single target structure, third person ‘-s’, and then for the global semantic and syntactic accuracy. Moderate correlations were recorded between the learners’ third person accuracy scores on the oral narrative task and the grammatical and ungrammatical items of the EI test (r = 0.62 and r = 0.66 respectively). A moderate correlation was also found between the learners’ global semantic and syntactic scores on the EI test and the oral narrative task (r = 0.51). Furthermore, unlike the grammatical items, no significant correlation was recorded between length and the performance of the learners on the ungrammatical items. The findings underscore the need to validate EI against other tests of productive skills especially when EI is used as a measure of global language proficiency. Suggestion is made to explore structural-specific factors that contribute to difficulty level of EI tests.
This study compared three common vocabulary test formats, the Yes/No test, the Vocabulary Knowledge Scale (VKS), and the Vocabulary Levels Test (VLT), as measures of vocabulary difficulty. Vocabulary difficulty was defined as the item difficulty estimated through Item Response Theory (IRT) analysis. Three tests were given to 165 Japanese students, resulting in five measures of vocabulary knowledge and four measures of word difficulty. Analyses included item and score factor analysis, unidimensionality, local independence, and correlations. Results indicate that these are reliable tests. Tests of unidimensionality suggest these tests are essentially measuring one major latent trait, which can be interpreted as a factor for word knowledge. Strong correlations of the scores with each other provide evidence of concurrent validity, and for the interpretation of the scores as indicative of word knowledge. Correlations with other methods of estimating word difficulty, such as transformed frequency, length of word, or number of syllables, suggest that of these methods, the log of frequencies from very large corpora gives the best estimate of word difficulty. However, direct testing of vocabulary difficulty appears to, in the words of Kreuz (1987), "provide a better account of recognition latencies than do counts based on printed word frequency" (p. 159).
Considering scoring validity as encompassing both reliable rating scale use and valid descriptor interpretation, this study reports on the validation of a CEFR-based scale that was co-constructed and used by novice raters. The research questions this paper wishes to answer are (a) whether it is possible to construct a CEFR-based rating scale with novice raters that yields reliable ratings and (b) allows for a uniform interpretation of the descriptors. Additionally, this study focuses on the question whether co-constructing a rating scale with novice raters helps to stimulate a shared interpretation of the descriptors over time. For this study, six novice raters employed a CEFR-based scale that had been co-constructed by themselves and 14 peers to rate 200 spoken and written performances in a missing data design. The quantitative data were analysed using item response theory, classical test theory and principal component analysis. The focus group data, collected after the rating process, were transcribed and coded using both a priori and inductive coding. The results indicate that novice raters can reliably use the CEFR-based rating scale, but that the interpretations of the descriptors, in spite of training and co-construction, are not as homogeneous as the inter-rater reliability would suggest.
Language Testing is an international peer reviewed journal that
publishes original research on language testing and assessment. Since
1984 it has featured high impact papers covering theoretical issues,
empirical studies, and reviews. The journal's wide scope encompasses
first and second language testing and assessment of English and other
languages, and the use of tests and assessments as research and
evaluation tools. Many articles also contribute to methodological
innovation and the practical improvement of testing and assessment
internationally. In addition, the journal publishes submissions that
deal with policy issues, including the use of language tests and
assessments for high stakes decision making in fields as diverse as
education, employment and international mobility. The journal welcomes
the submission of papers that deal with ethical and philosophical issues
in language testing, as well as technical matters. Also of concern is
research into the washback and impact of language test use, and
ground-breaking uses of assessments for learning. Additionally, the
journal wishes to publish replication studies that help to embed and
extend our knowledge of generalisable findings in the field. Language
Testing is committed to encouraging interdisciplinary research, and is
keen to receive submissions which draw on theory and methodology from
different fields of applied linguistics, as well as educational
measurement, and other relevant disciplines.
How to put the podcast onto your iPod
Decide which of the podcasts below you would like to listen to. Right click on the link, and select 'save target as' to download it into a folder on your computer.
Open iTunes. Click on 'file' and then 'new playlist'. Name your playlist 'Language Testing Bytes'.
Click on the playlist from the iTunes menu.
Open the folder in which you saved the podcast, then drag the podcast from the folder and drop it into the playlist.
Syncronize your iPod.
When you next access your iPod go to the Language Testing Bytes playlist to play the podcast.
Alternatively, just pop it on whichever mp3 player you currently
use, or subscribe to the SAGE Podcast on iTunes.
Issue 22: Eunice Jang on Diagnostic Language Testing.
Issue 32(2) of the Language Testing is a special on the current state of Diagnostic Language Testing. While this has traditionally been a neglected use of language tests, there is currently a great surge of interest and research in the field. Eunice Jang from the University of Toronto joins me to discuss current thinking in testing for diagnostic purposes.
The assessment of aviation English has become something of an icon of high stakes assessment in recent years. In Language Testing 32(2), we publish a paper by Hyejeong Kim and Cathie Elder, both from the University of Melbourne, which examines the construct of aviation English from the perspective of airline professionals in Korea.
In this issue of the podcast Martin East describes an assessment reform project in New Zealand. We're reminded very forcefully that when assessment and testing procedures within educational systems are changed, there are many complex factors to take into account. All stakeholders are going to take a view on the proposed reforms, and they aren't necessarily going to agree.
Issue 19: Fred Davidson and Cary Lin of the University of Illinois at Urbana-Champaign discuss the role of statistics in language testing.
The last issue of volume 31 contains a review of Rita Green's new book on statistics in language testing. We take the opportunity to talk about how things have changed in teaching statistics for students of language testing since Fred Davidson's The language tester's statistical toolbox was published in 2000.
Issue 18: Folkert Kuiken and Ineke Vedder from the University of Amsterdam discuss rater variability in the assessment of speaking and writing in a second language.
The third issue of the journal this year is a special on the scoring of performance tests. In this podcast the guest editors talk about some of the issues surrounding the rating of speaking and writing samples.
Issue 17: Ryo Nitta and Fumiyo Nakatsuhara on pre-task planning in paired speaking tests
The authors of our first paper in 31(2) are concerned with a very practical question. What is the effect of giving test-takers planning time prior to a paired-format speaking task? Does it affect what they say? Does it change the scores they get? The answers will inform the design of speaking tests not only in high stakes assessment contexts, but probably in classrooms as well.
Issue 16: Jodi Tommerdahl and Cynthia Kilpatrick on the reliability of morphological analyses in language samples
How large a language sample do we need in order to draw reliable conclusions about what we wish to assess? In issue 31(1) of Language Testing we are delighted to publish a paper by Jodi Tommerdahl and Cynthia Kilpatrick that addresses this important issue.
Issue 30(4) of the journal contains the first paper on eye-tracking studies to investigate the cognitive processes of learners taking reading tests. Stephen Bax joins us to explain the methodology and what it can tell us about how successful readers go about processing items and texts in reading tests.
Issue 30(3) commemorates the 30th Anniversary of the founding of the journal. We mark this milestone in the journal's history with a special issue on the topic of Assessment Literacy, guest edited by Ofra Inbar. A concern for the literacy needs of a wide range of stakeholders who use tests and test scores beyond the experts is a sign of a maturing profession. This issue takes the debate forward in new and exciting ways, some of which Ofra Inbar discusses on this podcast.
Issue 13: Paula Winke and Susan Gass on Rater Bias
Rater bias is something that language testers have known about for a long time, and have tried to control through training and the use of rating scales. But investigations into the source and nature of bias is relatively recent. In issue 30(2) of the journal Paula Winke, Susan Gass, and Caroly Myford share their research in this field, and the first two authors from Michigan State University join us on Language Testing Bytes to discuss rater bias.
Issue 12: Alan Davies on Assessing Academic English
In 2008 Alan Davies' book Assessing Academic English was published by Cambridge University Press. In issue 30(1) of Language Testing it is reviewed by Christine Coombe. With a strong historical narrative, the book raises many of the enduring issues in assessing English for study in English medium institutions. In this podcast we explore some of these with Professor Davies.
Issue 11: Ana Pellicer-Sanchez and Norbert Schmitt on Yes-No Vocabulary Tests
In this issue of the podcast we return to vocabulary testing, after the great introduction provided by John Read in Issue 5. This time, we welcome Ana Pellicer-Sanchez and Norbert Schmitt, to talk about the popular Yes-No Vocabuluary Test. Their recent research looks at scoring issues and potential solutions to problems that have plagued the test for years. Their paper in issue 29(4) of the journal contains the details, but in the podcast we discuss the key issues for vocabulary assessment.
Issue 10: Kathryn Hill on Classroom Based Assessment
Classroom Based Assessment is an increasingly important topic in language education, and in issue 29(3) of Language Testing we publish a paper by Kathryn Hill and Tim McNamara entitled "Developing a comprehensive, empirically based research framework for classroom-based assessment". The research in this paper is based on the first author's PhD dissertation, and so we asked Kathryn Hill to join us on Language Testing Bytes to talk about developments in the field.
Issue 9: Luke Harding on Accent in Listening Assessment
Issue 29(2) of the journal contains a paper entitled "Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective", by Luke Harding. In this podcast we explore why it is that most listening tests use a very narrow range of standard accents, rather than the many varieties that we are likely to encounter in real-world communication.
Issue 8: Tan Jin and Barley Mak on Confidence Scoring
In Issue 29(1) of the journal three authors from the Chinese University of Hong Kong have a paper on the application of fuzzy logic to scoring speaking tests. This is termed 'confidence scoring', and the first two authors join us on Language Testing Bytes to explain a little more about their novel approach.
Mark Wilson delivered the Messick Memorial Lecture at the Language Testing Research Colloquium in Melbourne, 2006, on new developments in measurement models to take into account the complexity of language testing. In Language Testing 28(4) we publish the paper based on this lecture, and Mark joins us on Language Testing Bytes to talk about his work in this area.
Issue 6: Craig Deville and Micheline Chalhoub-Deville on Standards-Based Testing
Standards-Based Testing is highly controversial for its social and educational impact on schools and bilingual communities, and the technical aspects that rely to a significant extent on expert judgment. In issue 28(3) we discuss the issues surrounding Standards-Based Testing in the United States with the guest editors of a special issue on this topic. The collection of papers that they have brought together, along with reviews of recent books on the topic, and test review, constitute a state of the art volume for the field.
The journal has seen a flurry of articles on vocabulary testing in recent months, and issue 28(2) is no exception, with Marta Fairclough's paper on the lexical recognition task. It seemed like an appropriate moment to conisder why vocabulary is receiving so much attention, and so we turned to Professor John Read of the University of Auckland, New Zealand, to give us an overview of current research and activity within the field.
Issue 4: Khaled Barkaoui and Melissa Bowles on Think Aloud Protocols
In Language Testing 28(1), 2011, Khaled Barkaoui has an article on the use of think-alouds to investigate rater processes and decisions as they rate essay samples. The focus is not on the raters, but on whether the research method is a useful tool for the purpose. In this podcast he explains his findings, and their importance. We are then joined by Melissa Bowles who has recently published The Think-Aloud Controversy in Second Language Research, to explain precisely what the problems and possibilities of think-alouds are in language testing research.
Language Testing 27(4), 2010, contains an article by Carol Chapelle and colleagues on testing productive grammatical ability. We thought this would be an excellent opportunity to look at what is going on in the field of assessing grammar, and what issues currently face the field. Jim Purpura agreed to talk to us on Language Testing Bytes.
Language Testing 27(3), 2010, is a special issue guest edited by Xiaoming Xi on the automated scoring of writing and speaking tests. In this podcast she talks about why the automated scoring of speaking and writing tests is such a hot topic, and explains the possibilities, limitations and current research issues in the field.
In Language Testing 27(2), 2010, Mike Kane contributed a response to an article on fairness in language testing. We thought this was an excellent opportunity to ask him about his approach to validation, and how he sees 'fairness' fitting into the picture.