bright
bright
bright
bright
bright
tl
 
tr
lunder
Recent Content of Language Testing
This site designed and maintained by
Dr Glenn Fulcher

@languagetesting.info

runder
lcunder
rcunder
lnavl
Resources
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
lnavl
Information
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
lnavl
Tools
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
 
 

 

 
Titles and Abstracts
     
 
Language Testing

  • How do we go about investigating test fairness?
    by Xiaoming Xi,

    Previous test fairness frameworks have greatly expanded the scope of fairness, but do not provide a means to fully integrate fairness investigations and set priorities. This article proposes an approach to guide practitioners on fairness research and practices. This approach treats fairness as an aspect of validity and conceptualizes it as comparable validity for all relevant groups. Anything that weakens fairness compromises the validity of a test. This conceptualization expands the scope and enriches the interpretations of fairness by drawing on well-defined validity theories while enhancing the meaning of validity by integrating fairness in a principled way. TOEFL® iBTTM is then used to illustrate how a fairness argument may be established and supported in a validity argument. The fairness argument consists of a series of rebuttals to the validity argument that would compromise the comparability of score-based interpretations and uses for relevant groups, and it provides a logical mechanism for identifying critical research areas and setting research priorities. This approach will hopefully inspire more investigations motivated by and built on a central fairness argument. It may also foster a deeper understanding and expanded explorations of actions based on test results and social consequences, as impartiality and justice of actions and comparability of test consequences are at the core of fairness.



  • Test fairness: a response
    by Davies, A.

  • Validity and fairness
    by Kane, M.

  • Test fairness and Toulmin's argument structure
    by Kunnan, A. J.

  • Empiricism versus connoisseurship: Establishing the appropriacy of texts in t...
    by Green, A., Unaldi, A., Weir, C.

    Providers of tests of languages for academic purposes generally claim to provide evidence on the extent to which students are likely to be able to cope with the future demands of reading in specified real-life contexts. Such claims need to be supported by evidence that the texts employed in the test reflect salient features of the texts the test takers will encounter in the target situation as well as demonstrating the comparability of the cognitive processing demands of accompanying test tasks with target reading activities. This paper will focus on the issue of text comparability. For reasons of practicality, evidence relating to text characteristics is generally based on the expert judgement of individual test writers, arrived at through a holistic interpretation of test specifications. However, advances in automated textual analysis and a better understanding of the value of pooled qualitative judgement have now made it feasible to provide more quantitative approaches focusing analytically on a wide range of individual characteristics. This paper will employ these techniques to explore the comparability of texts used in a test of academic reading comprehension and key texts used by first-year undergraduates at a British university. It offers a principled means for test providers and test users to evaluate this aspect of test validity.



  • An investigation of four writing traits and two tasks across two languages
    by Bae, J., Bachman, L. F.

    This study investigated the validity of four theoretically motivated traits of writing ability across English and Korean, based on elementary school students’ responses to letter- and story-writing tasks. Their responses were scored analytically and analyzed using confirmatory factor analysis. The findings include the following. A model of writing ability that includes the influence of four primary trait factors (grammar, content, spelling, and text length), which are influenced by a higher-order trait factor, and of the effect of test methods (letter- and story-writing tasks) provides a reasonable explanation for differences in writing performance among students. The trait effects are central while the method effects peripheral and inconsistent. A higher-order factor explains the correlations among the four primary factors, whose uniqueness is retained even while removing the effect of the general factor. This research expands our understanding of writing performance in the following unique aspects: models of writing with the four traits and the two tasks across two languages, largely unverified aspects of writing in prior factorial studies; a CFA-correlated-uniqueness approach for trait-method investigation.



  • A multi-method analysis of evaluation criteria used to assess the speaking ...
    by Plough, I. C., Briggs, S. L., Van Bonn, S.

    The study reported here examined the evaluation criteria used to assess the proficiency and effectiveness of the language produced in an oral performance test of English conducted in an American university context. Empirical methods were used to analyze qualitatively and quantitatively transcriptions of the Oral English Tests (OET) of 44 prospective Graduate Student Instructors (GSI). The language required to complete the tasks on the test was conceptualized from the functional perspective of transactional and interactional language use as defined by Brown and Yule (1989). Listening comprehension and pronunciation were also analyzed and scored. Stepwise logistic regression was used to determine the extent to which these linguistic features contributed to final ratings. These quantitative findings were then compared to ‘real-time’ written comments made by evaluators during the tests. Intuitive methods were then used to further explore those features of candidate performance attended to by evaluators: interviews were conducted with experienced evaluators to determine the features they judged necessary for communicating effectively in instructional settings. Results indicate that the three data sources converge on two main features — pronunciation and listening comprehension — that are important in describing and evaluating the proficiency of prospective GSIs.



  • Investigating the decision-making process of standard setting participants
    by Papageorgiou, S.

    Despite the growing interest of the language testing community in standard setting, primarily due to the use of the Common European Framework of Reference (CEFR-Council of Europe, 2001), the participants’ decision-making process in the CEFR standard setting context remains unexplored. This study attempts to fill in this gap by analyzing these participants’ group discussions during a CEFR standard setting research project. Using an inductively and deductively-built analytical framework, it was found that decision-making was affected by factors that were irrelevant to the judgment task and that setting CEFR cut scores was not without problems for the participants. Given that examination results are nowadays reported with specific reference to the language ability levels presented in the CEFR, these results have implications for examination providers and score users.



  • Book Review: McNamara, T. and Roever, C. Language testing: The social dimensi...
    by Douglas, D.

  • Book Review: Chapelle, C. A., Enright, M. K. and Jamieson, J. M. (Eds) Buildi...
    by Cumming, A.

  • The effects of self-assessment among young learners of English
    by Goto Butler, Y., Lee, J.

    This study examined the effectiveness of self-assessment among 254 young learners of English as a foreign language. This study looked at 6th grade students in South Korea, who were asked to perform self-assessments on a regular basis for a semester during their English classes. The students improved their ability to self-assess their performance over time. A series of quantitative analyses found some positive effects of self-assessment on the students’ English performance as well as their confidence in learning English, though the effect sizes were rather small. The study also found that teachers and students perceived the effectiveness of self-assessment differently depending on their teaching/learning contexts. Individual teachers’ views towards assessment also influenced their perceived effectiveness in implementing the new self-assessment practice. A number of interesting insights were discovered through interviews with teachers regarding how best to implement self-assessment as part of foreign language instruction in contexts where teacher-centered teaching and measurement-driven assessment have been traditionally valued.



  • Washback of an oral assessment system in the EFL classroom
    by Munoz, A. P., Alvarez, M. E.

    This article reports the results of a research study to determine the washback effect of an oral assessment system on some areas of the teaching and learning of English as a Foreign Language (EFL). The research combined quantitative and qualitative research methods within a comparative study between an experimental group and a comparison group. Fourteen EFL teachers and 110 college students participated in the study. Data were collected by means of teacher and student surveys, class observations, and external evaluations of students’ oral performance. The data were analyzed using descriptive statistics for qualitative information and inferential statistics to compare the mean scores of the two groups by One Way Anova. Results showed positive washback in some of the areas examined. The implications for the classroom are that constant guidance and support over time are essential in order to help teachers use the system appropriately and therefore create positive washback.



  • A survey of aviation English tests
    by Alderson, J. C.

    The Lancaster Language Testing Research Group was commissioned in 2006 by the European Organisation for the Safety of Air Navigation (Eurocontrol) to conduct a validation study of the development of a test called ELPAC (English Language Proficiency for Aeronautical Communication), intended to assess the language proficiency of air traffic controllers. As part of that study, Internet searches for other tests of air traffic control identified a number of tests but found very little evidence available to attest to the quality of these tests. Therefore, it was decided to conduct an independent survey of tests of aviation English, since the consequences of inadequate language tests being used in licensing pilots, air traffic controllers and other aviation personnel are potentially very serious. A questionnaire was developed, based on the Guidelines for Good Practice of the European Association for Language Testing and Assessment (EALTA, 2006), and sent to numerous organizations whose tests were thought to be used for licensure of pilots and air traffic controllers. Twenty-two responses were received, which varied considerably in quantity and quality. This probably reflects a variation in the quality of the tests, in the availability of evidence to support claims of quality, and in low awareness of appropriate procedures for test development, maintenance and validation.

    We conclude that we can have little confidence in the meaningfulness, reliability, and validity of several of the aviation language tests currently available for licensure. We therefore recommend that the quality of language tests used in aviation be monitored to ensure they follow accepted professional standards for language tests and assessment procedures.



  • Aspects of performance on line graph description tasks: influenced by graph f...
    by Xiaoming Xi,

    Motivated by cognitive theories of graph comprehension, this study systematically manipulated characteristics of a line graph description task in a speaking test in ways to mitigate the influence of graph familiarity, a potential source of construct-irrelevant variance. It extends Xi (2005), which found that the differences in holistic scores on graph tasks with varying characteristics, although significant, were small. Using an analytic scoring approach, it re-examined how visual chunks in line graphs and planning time influenced some specific components of examinees’ performance on line graph description tasks. The analytic dimensions examined were determined based on results of previous studies and hypotheses about the relationships among visual chunks, planning, graph familiarity, and features of elicited discourse. It was found that participants less familiar with graphs described the line graphs in a less organized fashion and that their descriptions were weaker in content. Graph familiarity thus introduced construct-irrelevant variance. However, providing planning time and using less complex graphical displays improved the fluency, organization and content of the elicited oral discourse and helped lessen the influence of graph familiarity, thus enhancing the validity of this task. The theoretical and practical implications of the findings are discussed.



  • A Rasch-based validation of the Vocabulary Size Test
    by Beglar, D.

    The primary purpose of this study was to provide preliminary validity evidence for a 140-item form of the Vocabulary Size Test, which is designed to measure written receptive knowledge of the first 14,000 words of English. Nineteen native speakers of English and 178 native speakers of Japanese participated in the study. Analyses based on the Rasch model were focused on several aspects of Messick’s validation framework. The findings indicated that (1) the items and examinees generally performed as predicted by a priori hypotheses, (2) the overwhelming majority of the items displayed good fit to the Rasch model, (3) the items displayed a high degree of unidimensionality with the Rasch model accounting for 85.6% of the variance, (4) the items showed a strong degree of measurement invariance with disattenuated Pearson correlations for person measures estimated with different sets of items of 0.91 and 0.96, and (5) various combinations of items provided precise measurement for this sample of examinees as indicated by Rasch reliability indices >0.96. The Vocabulary Size Test provides teachers and researchers with a new instrument that greatly extends the range of measurement provided by other measures of written receptive vocabulary size.