bright
bright
bright
bright
bright
tl
 
tr
lunder
Online First papers to appear in Language Testing
This site designed and maintained by
Dr Glenn Fulcher

@languagetesting.info

runder
lcunder
rcunder
lnavl
Resources
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
lnavl
Information
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
lnavl
Tools
lnavr
navtop
lnavs
rnavs
lnavs
rnavs
lnavs
rnavs
bottomnav
 
 

 

 
Titles and Abstracts
     
 
Language Testing

  • The Word Associates Format: Validation Evidence
    by Schmitt, N., Ng, J. W. C., Garras, J.
    Posted on 2 Aug, 2010

    Although the Word Associates Format (WAF) is becoming more frequently used as a depth-of-knowledge measure, relatively little validation has been carried out on it. This report of two validation studies tackles various important WAF issues yet to be satisfactorily resolved.

    Study 1 conducted introspective interviews regarding students’ WAF test-taking behavior along with interviews on featured target words to determine how accurately the most common scoring system for the WAF reflects the examinees’ actual knowledge of the words. Analysis is provided concerning WAF accuracy and item answering strategies and patterns.

    Study 2 repeated the interview procedures from Study 1 with several modifications, including the addition of a receptive dimension in the word knowledge interview. The various WAF-scoring methods were compared, and the format types (6 and 8 option), distractor types, and distribution of answers were examined in depth.

    Both studies indicate that the WAF reflects true lexical knowledge fairly well at the extremes of the scoring scale while scores in the middle do not lead to any reliable interpretation. Furthermore, there is the likelihood that the WAF may both underestimate and overestimate vocabulary knowledge. Suggestions regarding item construction and use of the WAF are given to improve its accuracy and reliability.



  • Towards a computer-delivered test of productive grammatical ability
    by Chapelle, C. A., Chung, Y.-R., Hegelheimer, V., Pendar, N., Xu, J.
    Posted on 8 Jul, 2010

    This study piloted test items that will be used in a computer-delivered and scored test of productive grammatical ability in English as a second language (ESL). Findings from research on learners’ development of morphosyntactic, syntactic, and functional knowledge were synthesized to create a framework of grammatical features. We outline the interpretive argument and present results from four pilot test administrations in terms of (a) reliability, (b) relationships between item difficulties and developmental stages, (c) correlations with other English tests, and (d) predictability of test scores in relation to proficiency levels. The results support the potential of assessing productive ESL grammatical ability by targeting areas identified in SLA research, and the plausibility of moving forward with computer delivery and scoring.



  • Use of tree-based regression in the analyses of L2 reading test items
    by Gao, L., Rogers, W. T.
    Posted on 8 Jul, 2010

    The purpose of this study was to explore whether the results of Tree Based Regression (TBR) analyses, informed by a validated cognitive model, would enhance the interpretation of item difficulties in terms of the cognitive processes involved in answering the reading items included in two forms of the Michigan English Language Assessment Battery (MELAB). A cognitive model was first generated to explain the performance of the MELAB reading items, and then validated by expert judgment and student verbal protocols. Next, the validated model was used in the TBR analyses to obtain the final trees for each form. Finally, the cognitive processes (i.e., reading processes and testing strategies) measured by each item were traced back for each item in the terminal nodes of each tree. The results revealed that TBR, informed by a supportable cognitive theory, appears to be a promising addition to statistical item analysis that can be effectively used to enhance the interpretation of item analyses results.



  • Explaining ESL essay holistic scores: A multilevel modeling approach
    by Barkaoui, K.
    Posted on 2 Jul, 2010

    This study adopted a multilevel modeling (MLM) approach to examine the contribution of rater and essay factors to variability in ESL essay holistic scores. Previous research aiming to explain variability in essay holistic scores has focused on either rater or essay factors. The few studies that have examined the contribution of more than one factor to variability in essay scores relied on analytic techniques that do not reflect the nested structure of essay ratings. One goal for this article is to illustrate the use and potential contributions of MLM to research on essay score variability. The study included 31 experienced and 29 novice raters who each rated a set of essays holistically and analytically. Scores were analyzed using MLM to examine the associations between essay features and holistic scores and the impact of rater experience on both essay holistic scores and these associations. The experienced raters assigned lower scores and gave more importance to linguistic accuracy than did the novices. Novices gave more importance to argumentation and their scores exhibited more variability. The article concludes by highlighting the value of MLM in identifying and estimating the contributions of various individual, textual and contextual factors in the rating context to variability in ESL essay scores.



  • Interaction in group oral assessment: A case study of higher- and lower-scoring students
    by Gan, Z.
    Posted on 2 Jul, 2010

    This article examines the interactional work in which two groups of secondary ESL students engaged to achieve and sustain participation in group oral assessment, which is designed to assess a student’s interactive communication skills in a school-based assessment context. The in-depth observation of the ways in which participants co-constructed talk-in-interaction led to the discovery of the particular pattern of speech exchange within each group. Within the higher-scoring group, the students engaged constructively and contingently with one another’s ideas, demonstrating a range of speech functions such as suggestions, agreement or disagreement, explanations, and challenges, which resulted in opportunities for substantive conversation and genuine communication to be engineered. Within the lower-scoring group, the resulting interactions appeared more structured, apparently as a result of the pre-set prompts that were originally set for the purpose of facilitating within-group discussion. However, a picture emerges of lower-scoring group members naturally engaging in negotiation of meaning over linguistic impasses, which turned out to serve as the stimulus to collaborative dialogue. There is also evidence of lower-scoring group members assisting each other through co-construction both to find the right linguistic forms and to express meaning. The nature of these interactions suggests that the group oral assessment format, as operationalized in this context, can authentically reflect students’ interactional skills and their moment-by-moment construction of social and linguistic identity. However, the lack of contingent development of topical talk within the lower-scoring group implies that the assessor’s good intentions in providing pre-set prompts may end up restricting students’ performance. The risk of such task/topic-related effects on the quality of student discourse and interaction warrants further research.



  • Judgments of oral proficiency by non-native and native English speaking teacher raters: competing or complementary constructs?
    by Zhang, Y., Elder, C.
    Posted on 15 Apr, 2010

    This paper reports the findings of an empirical study on ESL/EFL teachers’ evaluation and interpretation of oral English proficiency as elicited by the national College English Test-Spoken English Test (CET-SET) of China. Informed by debates on the issue of native speaker (NS) norms which have become the focus of attention in recent years, this study addresses the question of whether judgments of language proficiency by non-native English speaking (NNES) teachers, who are currently used to assess performance on the CET-SET, correspond to those of native English speaking (NES) teachers or whether the two groups draw on different constructs of oral proficiency. Data for the study were derived from two sources: unguided holistic ratings given by a group of 19 NES and 20 NNES teachers to CET-SET speech samples from 30 test-takers, and written comments to justify the ratings assigned. Results yielded by both quantitative (MFRM) and qualitative analyses of teacher data, revealed no significant difference in raters’ holistic judgments of the speech samples and a broad level of agreement between groups on the construct components of oral English proficiency. However, the analysis of raters’ comments revealed both quantitative and qualitative differences in the way NES and NNES teachers weighed various features of the oral proficiency construct in justifying the decisions made. The paper concludes by considering the implications of the study’s findings for debates about the native speaker norm as the target for language learners and test-takers.



  • Effective rating scale development for speaking tests: Performance decision trees
    by Fulcher, G., Davidson, F., Kemp, J.
    Posted on 7 Apr, 2010

    Rating scale design and development for testing speaking is generally conducted using one of two approaches: the measurement-driven approach or the performance data driven approach. The measurement-driven approach prioritizes the ordering of descriptors onto a single scale. Meaning is derived from the scaling methodology and the agreement of trained judges as to the place of any descriptor on the scale. The performance data-driven approach, on the other hand, places primary value upon observations of language performance, and attempts to describe performance in sufficient detail to generate descriptors that bear a direct relationship with the original observations of language use. Meaning is derived from the link between performance and description. We argue that measurement-driven approaches generate impoverished descriptions of communication, while performance data-driven approaches have the potential to provide richer descriptions that offer sounder inferences from score meaning to performance in specified domains. With reference to original data and the literature on travel service encounters, we devise a new scoring instrument, a Performance Decision Tree (PDT). This instrument prioritizes what we term ‘performance effect’ by explicitly valuing and incorporating performance data from a specific communicative context. We argue that this avoids the reification of ordered scale descriptors which we find in measurement-driven scale construction for speaking tests.



  • The place of language testing and assessment in the professional preparation of foreign language teachers in China
    by Jin, Y.
    Posted on 7 Apr, 2010

    Since the late 1970s, following the major economic reforms that opened the country to the outside world, China has witnessed a growing interest in foreign language education and, hence, a substantial surge in the number of foreign language learners and practitioners. However, the situation of the professional preparation of the practitioners has been unclear due to the paucity of relevant research and literature. This study, therefore, set out to investigate the training of tertiary level foreign language teachers in China with a focus on language testing and assessment courses. A nationwide survey was conducted among 86 instructors of such courses for an overview of the current situation in terms of the instructors, teaching content, teaching methodology, student perceptions of the courses, and teaching materials. The results revealed that the courses adequately covered essential aspects of theory and practice of language testing. However, educational and psychological measurement and student classroom practice received significantly less attention. Comparison of the teaching content of the different types of courses did not show major differences. Suggestions were provided to highlight some under-addressed aspects of the teaching content and to set up a network of teacher-testers to create opportunities for practitioners to exchange experiences, professional knowledge and skills.



  • The effect of the multiple-choice item format on the measurement of knowledge of language structure
    by Currie, M., Chiramanee, T.
    Posted on 10 Mar, 2010

    Noting the widespread use of multiple-choice items in tests in English language education in Thailand, this study compared their effect against that of constructed-response items. One hundred and fifty-two university undergraduates took a test of English structure first in constructed-response format, and later in three, stem-equivalent multiple-choice formats, with the distractors based on incorrect answers from the constructed-response test. A significant and substantial increase in mean, and generally in individual scores between the two tests was found although the scores in the tests were quite closely correlated, often taken to indicate that a similar construct was measured by the two test formats. However, direct comparison of the responses to the items in the two tests showed that only 26% of the responses were the same, suggesting that most of what the multiple-choice items measured was directly dependent on the item format. The study found remarkable consistency in the response patterns between the tests among three experimental groups of participants, who sat different option number formats of the multiple choice test, pointing to the possibility of a general effect of multiple-choice items in testing the learning of structure in second and foreign languages.



  • Sensitivity of narrative organization measures using narrative retells produced by young school-age children
    by Heilmann, J., Miller, J. F., Nockerts, A.
    Posted on 10 Mar, 2010

    Analysis of children’s productions of oral narratives provides a rich description of children’s oral language skills. However, measures of narrative organization can be directly affected by both developmental and task-based performance constraints which can make a measure insensitive and inappropriate for a particular population and/or sampling method. This study critically reviewed four methods of evaluating children’tive organization skills and revealed that the Narrative Scoring Scheme (NSS) was the most developmentally sensitive measure for a group of 129 5–7-year-old children who completed a narrative retell. Upon comparing the methods of assessing narrative organization skills, the NSS was unique in its incorporation of higher-level narrative features and its scoring rules, which required examiners to make subjective judgments across seven aspects of the narrative process. The discussion surrounded issues of measuring children’s narrative organization skills and, more broadly, issues surrounding sensitivity of criterion referenced assessment measures.



  • The effect of the use of video texts on ESL listening test-taker performance
    by Wagner, E.
    Posted on 10 Mar, 2010

    Video is widely used in the teaching of L2 listening, and SLA researchers have argued that the visual components of spoken texts are useful for the listener in comprehending aural information. Yet video texts are rarely used on tests of L2 listening ability, perhaps in part due to the belief that including the visual channel involves assessing something beyond listening ability. In this study, a quasi-experimental design was used to compare the performance of two groups of learners on an ESL listening test. The control group took a listening test with audio-only texts. The experimental group took the same listening test, except that test-takers received the input through the use of video texts. Multi-variate Analysis of Covariance (MANCOVA) was used to compare the two groups’ performance, and it was found that the video (experimental) group scored 6.5% higher than the audio-only (control) group on the overall post-test, and this difference was statistically significant. The results of the study suggest that the non-verbal information in the video texts contributed to the video group’s superior performance.



  • The challenge of validation: Assessing the performance of a test of productive vocabulary
    by Fitzpatrick, T., Clenton, J.
    Posted on 10 Mar, 2010

    This paper assesses the performance of a vocabulary test designed to measure second language productive vocabulary knowledge. The test, Lex30, uses a word association task to elicit vocabulary, and uses word frequency data to measure the vocabulary produced. Here we report firstly on the reliability of the test as measured by a test-retest study, a parallel test forms experiment and an internal consistency measure. We then investigate the construct validity of the test by looking at changes in test performance over time, analyses of correlations with scores on similar tests, and comparison of spoken and written test performance. Last, we examine the theoretical bases of the two main test components: eliciting vocabulary and measuring vocabulary. Interpretations of our findings are discussed in the context of test validation research literature. We conclude that the findings reported here present a robust argument for the validity of the test as a research tool, and encourage further investigation of its validity in an instructional context.