|
|
|
|
|
| |
|
|
| |
- Distinguishing between Net and Global DIF in Polytomous Items
by Randall D. Penfield
In this article, I address two competing conceptions of differential item functioning (DIF) in polytomously scored items. The first conception, referred to as net DIF, concerns between-group differences in the conditional expected value of the polytomous response variable. The second conception, referred to as global DIF, concerns the conditional dependence of group membership and the polytomous response variable. The distinction between net and global DIF is important because different DIF evaluation methods are appropriate for net and global DIF; no currently available method is universally the best for detecting both net and global DIF. Net and global DIF definitions are presented under two different, yet compatible, modeling frameworks: a traditional item response theory (IRT) framework, and a differential step functioning (DSF) framework. The theoretical relationship between the IRT and DSF frameworks is presented. Available methods for evaluating net and global DIF are described, and an applied example of net and global DIF is presented.
- How Often Do Subscores Have Added Value? Results from Operational and Simulat...
by Sandip Sinharay
Recently, there has been an increasing level of interest in subscores for their potential diagnostic value. Haberman suggested a method based on classical test theory to determine whether subscores have added value over total scores. In this article I first provide a rich collection of results regarding when subscores were found to have added value for several operational data sets. Following that I provide results from a detailed simulation study that examines what properties subscores should possess in order to have added value. The results indicate that subscores have to satisfy strict standards of reliability and correlation to have added value. A weighted average of the subscore and the total score was found to have added value more often.
- Random-Groups Equating with Samples of 50 to 400 Test Takers
by Samuel A. Livingston, Sooyeon Kim
Five methods for equating in a random groups design were investigated in a series of resampling studies with samples of 400, 200, 100, and 50 test takers. Six operational test forms, each taken by 9,000 or more test takers, were used as item pools to construct pairs of forms to be equated. The criterion equating was the direct equipercentile equating in the group of all test takers. Equating accuracy was indicated by the root-mean-squared deviation, over 1,000 replications, of the sample equatings from the criterion equating. The methods investigated were equipercentile equating of smoothed distributions, linear equating, mean equating, symmetric circle-arc equating, and simplified circle-arc equating. The circle-arc methods produced the most accurate results for all sample sizes investigated, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.
- Investigating the Effectiveness of Equating Designs for Constructed-Response ...
by Sooyeon Kim, Michael E. Walker, Frederick McHale
Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error.
- Stratified and Maximum Information Item Selection Procedures in Computer Adap...
by Hui Deng, Timothy Ansley, Hua-Hua Chang
In this study we evaluated and compared three item selection procedures: the maximum Fisher information procedure (F), the a-stratified multistage computer adaptive testing (CAT) (STR), and a refined stratification procedure that allows more items to be selected from the high a strata and fewer items from the low a strata (USTR), along with completely random item selection (RAN). The comparisons were with respect to error variances, reliability of ability estimates and item usage through CATs simulated under nine test conditions of various practical constraints and item selection space. The results showed that F had an apparent precision advantage over STR and USTR under unconstrained item selection, but with very poor item usage. USTR reduced error variances for STR under various conditions, with small compromises in item usage. Compared to F, USTR enhanced item usage while achieving comparable precision in ability estimates; it achieved a precision level similar to F with improved item usage when items were selected under exposure control and with limited item selection space. The results provide implications for choosing an appropriate item selection procedure in applied settings.
- Factors Affecting the Item Parameter Estimation and Classification Accuracy o...
by Jimmy de la Torre, Yuan Hong, Weiling Deng
To better understand the statistical properties of the deterministic inputs, noisy "and" gate cognitive diagnosis (DINA) model, the impact of several factors on the quality of the item parameter estimates and classification accuracy was investigated. Results of the simulation study indicate that the fully Bayes approach is most accurate when the prior distribution matches the latent class structure. However, when the latent classes are of indefinite structure, the empirical Bayes method in conjunction with an unstructured prior distribution provides much better estimates and classification accuracy. Moreover, using empirical Bayes with an unstructured prior does not lead to extremely poor results as other prior-estimation method combinations do. The simulation results also show that increasing the sample size reduces the variability, and to some extent the bias, of item parameter estimates, whereas lower level of guessing and slip parameter is associated with higher quality item parameter estimation and classification accuracy.
- Introduction to Applied Bayesian Statistics and Estimation for Social Scienti...
by Lihua Yao
- Linking and Aligning Scores and Scales by Dorans, N.J., Pommerich, M., & Holl...
by Isaac I. Bejar
|
|
|
| |
|
|
|
|
| |
| |
|
| |
|