Language Testing Bytes is a podcast to accompany the SAGE journal Language Testing. Three or four times per year, we will release a podcast in which we discuss topics related to a particular issue of the journal. This may be an interview with a contributor to the journal, or another expert in the field. You can download the podcast from this website, from ltj.sagepub.com, or you can subscribe to the podcast through iTunes.
Coming Soon: The next podcast will be released in April 2014, and will feature Ryo Nitta and Fumiyo Nakatsuhara on the provision of pre-task planning in paired speaking tests.
It is currently unclear to what extent a spontaneous language sample of a given number of utterances is representative of a child’s ability in morphology and syntax. This lack of information about the regularity of children’s linguistic productions and the reliability of spontaneous language samples have serious implications for language testing based upon natural language. This study investigates the reliability of children’s spontaneous language samples by using a test-retest procedure to examine repeated samples of various lengths (50, 100, 150, and 200 utterances) in regard to morpheme production in 23 typically developing children aged 2;6 to 3;6. Analyses indicate that out of five morphosyntactic categories studied, one of these (the contracted auxiliary) achieves an ICC for absolute agreement over .6 using 100 utterances while most others (past tense, third-person singular and the uncontracted ‘be’ in an auxiliary form) fail to reach a correlation above .52 even when samples of 200 utterances are compared. The study indicates that (1) 200-utterance samples did not provide a significantly greater degree of reliability than 100 utterance samples; (2) several structures that children were able to produce did not show up in a 200-utterance sample; and (3) earlier acquired morphemes were not used more reliably than more recently acquired items. The notion of reliability and its importance in the area of spontaneous language samples and language testing are also discussed.
The Katzenberger Hebrew Language Assessment for Preschool Children (henceforth: the KHLA) is the first comprehensive, standardized language assessment tool developed in Hebrew specifically for older preschoolers (4;0–5;11 years). The KHLA is a norm-referenced, Hebrew specific assessment, based on well-established psycholinguistic principles, as well as on the established knowledge in the field of normal language development in the preschool years. The main goal of the study is to evaluate the KHLA as a tool for identification of language-impaired Hebrew-speaking preschoolers and to find out whether the test distinguishes between typically developing (TDL) and language-impaired children. The aim of the application of the KHLA is to characterize the language skills of Hebrew-speaking children with specific language impairment (SLI). The tasks comprised in the assessment are considered in the literature to be the sensitive areas of language skills appropriate for assessing children with SLI. Participants included 454 (383 TDL and 71 SLI) mid–high SES, monolingual native speakers of Hebrew, aged 4;0–5;11 years. The assessment included six subtests (with a total of 171 items): Auditory Processing, Lexicon, Grammar, Phonological Awareness, Semantic Categorization, and Narration of Picture Series. The study focuses on the psychometric aspect of the test. The KHLA was found useful for distinguishing between TDL and SLI when the identification is based on the total Z-score or at least two of the subtest-specific Z-scores in –1.25 SD cutoff points. The results provide a ranking order for assessment: Grammar, Auditory Processing, Semantic Categorization, Narration of Picture Series/Lexicon, and Phonological Awareness. The main clinical implications of this study are to consider the optimal cutoff point of –1.25 SD for diagnosis of SLI children and to apply the entire test for assessment. In cases when the clinician may decide to assess only two or three subtests, it is recommended that the ranking order be applied as described in the study.
Testlets are subsets of test items that are based on the same stimulus and are administered together. Tests that contain testlets are in widespread use in language testing, but they also share a fundamental problem: Items within a testlet are locally dependent with possibly adverse consequences for test score interpretation and use. Building on testlet response theory (Wainer, Bradlow, & Wang, 2007), the listening section of the Test of German as a Foreign Language (TestDaF) was analyzed to determine whether, and to which extent, testlet effects were present. Three listening passages (i.e., three testlets) with 8, 10, and 7 items, respectively, were analyzed using a two-parameter logistic testlet response model. The data came from two live exams administered in April 2010 (N = 2859) and November 2010 (N = 2214). Results indicated moderate effects for one testlet, and small effects for the other two testlets. As compared to a standard IRT analysis, neglecting these testlet effects led to an overestimation of test reliability and an underestimation of the standard error of ability estimates. Item difficulty and item discrimination estimates remained largely unaffected. Implications for the analysis and evaluation of testlet-based tests are discussed.
Newer statistical procedures are typically introduced to help address the limitations of those already in practice or to deal with emerging research needs. Quantile regression (QR) is introduced in this paper as a relatively new methodology, which is intended to overcome some of the limitations of least squares mean regression (LMR). QR is more appropriate when assumptions of normality and homoscedasticity are violated. Also QR has been recommended as a good alternative when the research literature suggests that explorations of the relationship between variables need to move from a focus on average performance, that is, the central tendency, to exploring various locations along the entire distribution. Although QR has long been used in other fields, it has only recently gained popularity in educational statistics. For example, in the ongoing push for accountability and the need to document student improvement, the calculation of student growth percentiles (SGP) utilizes QR to document the amount of growth a student has made. Despite its proven advantages and its utility, QR has not been utilized in areas such as language testing research. This paper seeks to introduce the field to basic QR concepts, procedures, and interpretations. Researchers familiar with LMR will find the comparisons made between the two methodologies helpful to anchor the new information. Finally, an application with real data is employed to demonstrate the various analyses (the code is also appended) and to explicate the interpretations of results.
In this study, differential item functioning (DIF) trends were examined for English language learners (ELLs) versus non-ELL students in third and tenth grades on a large-scale reading assessment. To facilitate the analyses, a meta-analytic DIF technique was employed. The results revealed that items requiring knowledge of words and phrases in context favored non-ELLs in grade 3, whereas items requiring evaluation skills favored ELLs in grade 10. However, inconsistent patterns were found across gender and ethnicity. Educational implications are discussed.
This study investigated the relationship between latent components of academic English language ability and test takers’ study-abroad and classroom learning experiences through a structural equation modeling approach in the context of TOEFL iBT® testing. Data from the TOEFL iBT public dataset were used. The results showed that test takers’ performance on the test’s four skill sections, namely listening, reading, writing, and speaking, could be accounted for by two correlated latent components: the ability to listen, read, and write, and the ability to speak English. This two-factor model held equivalently across two groups of test takers, with one group having been exposed to an English-speaking environment and the other without such experience. Imposing a mean structure on the factor model led to the finding that the groups did not differ in terms of their standings on the factor means. The relationship between learning contexts and the latent ability components was further examined in structural regression models. The results of this study suggested an alternative characterization of the ability construct of the TOEFL test-taking population, and supported the comparability of the language ability developed in the home-country and the study-abroad groups. The results also shed light on the impact of studying abroad and home-country learning on language ability development.
Language Testing is an international peer reviewed journal that
publishes original research on language testing and assessment. Since
1984 it has featured high impact papers covering theoretical issues,
empirical studies, and reviews. The journal's wide scope encompasses
first and second language testing and assessment of English and other
languages, and the use of tests and assessments as research and
evaluation tools. Many articles also contribute to methodological
innovation and the practical improvement of testing and assessment
internationally. In addition, the journal publishes submissions that
deal with policy issues, including the use of language tests and
assessments for high stakes decision making in fields as diverse as
education, employment and international mobility. The journal welcomes
the submission of papers that deal with ethical and philosophical issues
in language testing, as well as technical matters. Also of concern is
research into the washback and impact of language test use, and
ground-breaking uses of assessments for learning. Additionally, the
journal wishes to publish replication studies that help to embed and
extend our knowledge of generalisable findings in the field. Language
Testing is committed to encouraging interdisciplinary research, and is
keen to receive submissions which draw on theory and methodology from
different fields of applied linguistics, as well as educational
measurement, and other relevant disciplines.
How to put the podcast onto your iPod
Decide which of the podcasts below you would like to listen to. Right click on the link, and select 'save target as' to download it into a folder on your computer.
Open iTunes. Click on 'file' and then 'new playlist'. Name your playlist 'Language Testing Bytes'.
Click on the playlist from the iTunes menu.
Open the folder in which you saved the podcast, then drag the podcast from the folder and drop it into the playlist.
Syncronize your iPod.
When you next access your iPod go to the Language Testing Bytes playlist to play the podcast.
Alternatively, just pop it on whichever mp3 player you currently
use, or subscribe to the SAGE Podcast on iTunes.
Issue 16: Jodi Tommerdahl and Cynthia Kilpatrick on the reliability of morphological analyses in language samples
How large a language sample do we need in order to draw reliable conclusions about what we wish to assess? In issue 31(1) of Language Testing we are delighted to publish a paper by Jodi Tommerdahl and Cynthia Kilpatrick that addresses this important issue.
Issue 30(4) of the journal contains the first paper on eye-tracking studies to investigate the cognitive processes of learners taking reading tests. Stephen Bax joins us to explain the methodology and what it can tell us about how successful readers go about processing items and texts in reading tests.
Issue 30(3) commemorates the 30th Anniversary of the founding of the journal. We mark this milestone in the journal's history with a special issue on the topic of Assessment Literacy, guest edited by Ofra Inbar. A concern for the literacy needs of a wide range of stakeholders who use tests and test scores beyond the experts is a sign of a maturing profession. This issue takes the debate forward in new and exciting ways, some of which Ofra Inbar discusses on this podcast.
Issue 13: Paula Winke and Susan Gass on Rater Bias
Rater bias is something that language testers have known about for a long time, and have tried to control through training and the use of rating scales. But investigations into the source and nature of bias is relatively recent. In issue 30(2) of the journal Paula Winke, Susan Gass, and Caroly Myford share their research in this field, and the first two authors from Michigan State University join us on Language Testing Bytes to discuss rater bias.
Issue 12: Alan Davies on Assessing Academic English
In 2008 Alan Davies' book Assessing Academic English was published by Cambridge University Press. In issue 30(1) of Language Testing it is reviewed by Christine Coombe. With a strong historical narrative, the book raises many of the enduring issues in assessing English for study in English medium institutions. In this podcast we explore some of these with Professor Davies.
Issue 11: Ana Pellicer-Sanchez and Norbert Schmitt on Yes-No Vocabulary Tests
In this issue of the podcast we return to vocabulary testing, after the great introduction provided by John Read in Issue 5. This time, we welcome Ana Pellicer-Sanchez and Norbert Schmitt, to talk about the popular Yes-No Vocabuluary Test. Their recent research looks at scoring issues and potential solutions to problems that have plagued the test for years. Their paper in issue 29(4) of the journal contains the details, but in the podcast we discuss the key issues for vocabulary assessment.
Issue 10: Kathryn Hill on Classroom Based Assessment
Classroom Based Assessment is an increasingly important topic in language education, and in issue 29(3) of Language Testing we publish a paper by Kathryn Hill and Tim McNamara entitled "Developing a comprehensive, empirically based research framework for classroom-based assessment". The research in this paper is based on the first author's PhD dissertation, and so we asked Kathryn Hill to join us on Language Testing Bytes to talk about developments in the field.
Issue 9: Luke Harding on Accent in Listening Assessment
Issue 29(2) of the journal contains a paper entitled "Accent, listening assessment and the potential for a shared-L1 advantage: A DIF perspective", by Luke Harding. In this podcast we explore why it is that most listening tests use a very narrow range of standard accents, rather than the many varieties that we are likely to encounter in real-world communication.
Issue 8: Tan Jin and Barley Mak on Confidence Scoring
In Issue 29(1) of the journal three authors from the Chinese University of Hong Kong have a paper on the application of fuzzy logic to scoring speaking tests. This is termed 'confidence scoring', and the first two authors join us on Language Testing Bytes to explain a little more about their novel approach.
Mark Wilson delivered the Messick Memorial Lecture at the Language Testing Research Colloquium in Melbourne, 2006, on new developments in measurement models to take into account the complexity of language testing. In Language Testing 28(4) we publish the paper based on this lecture, and Mark joins us on Language Testing Bytes to talk about his work in this area.
Issue 6: Craig Deville and Micheline Chalhoub-Deville on Standards-Based Testing
Standards-Based Testing is highly controversial for its social and educational impact on schools and bilingual communities, and the technical aspects that rely to a significant extent on expert judgment. In issue 28(3) we discuss the issues surrounding Standards-Based Testing in the United States with the guest editors of a special issue on this topic. The collection of papers that they have brought together, along with reviews of recent books on the topic, and test review, constitute a state of the art volume for the field.
The journal has seen a flurry of articles on vocabulary testing in recent months, and issue 28(2) is no exception, with Marta Fairclough's paper on the lexical recognition task. It seemed like an appropriate moment to conisder why vocabulary is receiving so much attention, and so we turned to Professor John Read of the University of Auckland, New Zealand, to give us an overview of current research and activity within the field.
Issue 4: Khaled Barkaoui and Melissa Bowles on Think Aloud Protocols
In Language Testing 28(1), 2011, Khaled Barkaoui has an article on the use of think-alouds to investigate rater processes and decisions as they rate essay samples. The focus is not on the raters, but on whether the research method is a useful tool for the purpose. In this podcast he explains his findings, and their importance. We are then joined by Melissa Bowles who has recently published The Think-Aloud Controversy in Second Language Research, to explain precisely what the problems and possibilities of think-alouds are in language testing research.
Language Testing 27(4), 2010, contains an article by Carol Chapelle and colleagues on testing productive grammatical ability. We thought this would be an excellent opportunity to look at what is going on in the field of assessing grammar, and what issues currently face the field. Jim Purpura agreed to talk to us on Language Testing Bytes.
Language Testing 27(3), 2010, is a special issue guest edited by Xiaoming Xi on the automated scoring of writing and speaking tests. In this podcast she talks about why the automated scoring of speaking and writing tests is such a hot topic, and explains the possibilities, limitations and current research issues in the field.
In Language Testing 27(2), 2010, Mike Kane contributed a response to an article on fairness in language testing. We thought this was an excellent opportunity to ask him about his approach to validation, and how he sees 'fairness' fitting into the picture.