SYMPOSIUM: Continuous Testing: Psychometric Challenges and Opportunities

Alvaro J. Arce; Rense Lange; Marina Fronton; Okan Bulut; Avi Allalouf; Tony Gutentag; Michal Baumer

Open Conference Systems, ITC 2016 Conference

Alvaro J. Arce, Rense Lange, Marina Fronton, Okan Bulut, Avi Allalouf, Tony Gutentag, Michal Baumer

Building: Pinnacle
Room: 2F-Port of Vancouver
Date: 2016-07-02 03:30 PM – 05:00 PM
Last modified: 2016-08-29

Abstract

Abstract: Testing organizations around the world are expanding their services to offer test administration in a continuous mode. Â Continuous test administration is a prevalent practice in employment and credentialing testing, and it is gradually expanding into other testing areas, such as vocational training, college admission, and high school diploma. Â Continuous testing brings important benefits to examinees and testing professionals. Offering aÂ limitedÂ number of test administration windows during the year could have unintended consequencesÂ on test-takers lives, such as a reduced number of opportunities to attempt the examinations, which translates into missing critical dates for job employment, high school graduation, and college admission applications, for example. Â On the other hand, continuous testing presents important challenges to testing professionals and practices. For example, smallÂ sample sizes are often pervasive on certification programs with low volumes of applicants and the re-use of test forms raises test exposure concerns on test score uses and interpretations.

Altogether, the session brings five presenters from four countries and provides valuable exposure to some of the most important issues and psychometric solutions for continuous testing programs. The first paper presents the results of an innovative approach to field testing multiple-choice and open-ended items with computer adaptive tests.Â The second paper provides results of a study to improve the performance of differential item functioning with data collection designs prevalent in certification testing. The third paper brings findings from a research study on effects of adaptive administrations of tests on rates of exposure for continuous testing as part of college admission testing. The fourth paper presents a state-of-the-art research on the validity of sub-scores to predict pass/fail rates for high school diploma with continuous mode. Â The fifth paper presents the development of an automated quality control system to routinely monitor tests administered under continuous mode.

*****

Paper 1: Simulating the Calibration of Multiple-Choice and Open-Ended Items During CAT Using PROX Based Targeting
Rense Lange

Introduction: Continuous testing requires that many new items be developed and piloted, especially when using computer adaptive testing (CAT). Traditionally, multiple-choice items (MC) were the major burden in this respect â€“ MC here covers any item type of that is immediately computer gradable. There now exist mixed CAT systems that combine MC items with open-ended (OE) questions, both of which are graded in real-time and both contribute to the person estimates. It is customary to include some pilot into an operational CAT session, thereby avoiding the need for external pilot studies. Earlier computer simulations showed how the PROX method can be used to speed the calibration of MC items by targeting such items to test-takersâ€™ performance levels. Realistically, fewer OE than MC items can be included in the CAT session, and it thus appears that using PROX targeting has greater potential benefits for OE items.

Objective. This presentation reports on simulating various mixtures of MC and OE pilot items to be calibrated during CAT piloting, using PROX based preliminary targeting. The purpose is to identify the conditions that achieve optimal results under realistic conditions. Of particular relevance is the grading of OE pilot items (in real time, during CAT) in the absence of reliable mechanisms to do so.

Method. A program implemented in Python 2.7 was written to simulate standard uni-dimensional CAT based on the Rasch model, allowing for a number of parameters to be set: (a) the distribution of student abilities, (b) the distribution of item difficulties, (c) the stopping criterion (including SE, and maximum number of items), (d) the number of pilot items to be administered, (e) the numbers of OE and MC items to be included for piloting, (f) the difficulty levels of the pilot items, (g) the extent to which OE items can be graded during piloting, given that grading rules are typically lacking.

Results. Although the complete simulation system is currently not fully implemented, preliminary runs confirm the basic hypothesis that PROX targeting benefits the calibration of OE as well as MC items. More detailed simulations are in progress, which vary the reliability with which OE pilot items can be graded (by computer, in real-time) during piloting.

Paper 2: Improving MH DIF for Use in Certification Testing
Alvaro J. Arce

Introduction: Researchers and practitioners have been aware of the effects of sample sizes on the validity of MH DIF and ultimately item audit outcomes (Camilli, 2006). In licensure and certification testing, multiple administrations have to take place before DIF analysis takes place for low incidence tests.Â In the Single Group Nearly Equivalent Test (SiGNET), base forms of a test incorporate new test forms to allow for greater numbers of examinees on the new test forms, while the base forms are operationally rotated (Puhan et al., 2009). MH DIF analyses use data pooled at the test level.

Objective: This study examines the performance of MH DIF, Rasch MH DIF, and Bayesian MH DIF approaches with the SiGNET data pooling strategy.

Methodology: The study involved synthetic and real data set scenarios. The Monte Carlo simulation used a base test (120 multiple-choice items long) and three forms each 80-items long developed from the base test.Â The base test brings four operational parts (A, B, C, and D) and two pilot parts (E and F). Data sets were simulated under conditions of DIF and no-DIF to evaluate hit and Type I error rates.Â The study also inspected the relative performance of the three approaches with real data sets from certification testing. The outcomes of interest in the comparisons were the number of flagged items and type of DIF flag.Â The analysis used typical DIF flagging rules.

Results and Conclusions: Preliminary results showed the SiGNET Rasch MH DIF to be a better option than the SiGNET MH DIF. Its reliance on incomplete data structures contributed to a greater efficacy and effectiveness to identify true DIF. Ongoing analysis scheduled for completion by early February and final results to be presented and discussed at the conference.

Paper 3: Does Retaking an Adaptive Test with the Same Test Form Give Examinees an Advantage?
Marina Fronton, Tony Gutentag, Avi Allalouf

Introduction: This study sheds new light on retesting in the context of adaptive (as opposed to linear) tests â€“ a subject that has not been extensively studied. Scores are known to improve with retesting in linear tests, and this effect intensifies when the same test form is used. In linear tests, receiving the same test form twice means that examinees are exposed to exactly the same test items. This, however, is not the case with an adaptive test.

Objectives: To determine whether retaking an Adaptive Test with the same test form gives examinees an advantage.

Design: The computerized AMIRAM English test has been used in this study. It is taken by applicants to Israeli academic institutions in a continuous mode. In this adaptive (3-parameter IRT model) test, an examinee is given 21-28 items from a random test form. This test form is in fact a pool of approximately 200 items. Therefore, receiving the same test form on two different administrations does not necessarily mean receiving the same items. The research was based on a group of repeat examinees who took the AMIRAM test between 1992 and 2013. Eleven test forms were given to 199,674 examinees, 23.4% of whom were repeat examinees.

Results: The findings showed that scores among repeat examinees who took the same test form on two consecutive administrations were significantly higher than the scores of repeat examinees who received a different form. The effect, however, was very small.

Conclusions and Future research: Although score increase is small, efforts should nonetheless be made to further reduce â€“ or prevent altogether â€“ cases of repeat examinees being given the same test form. An analysis of repeat exams at the item level (proportion of identical items received by repeat examinees & impact on scores) is planned.

Paper 4: Criterion-Related Validity of Subscores in High School Diploma Examinations
Okan Bulut

Introduction: Recently, there has been increasing interest in reporting subscores, because subscores can provide diagnostically useful information for students, teachers, parents, and other stakeholders. As reporting subscores from educational assessments has been a common practice, some criticism has emerged regarding the lack of added value of subscores over total test in predicting student outcomes (Haberman, 2008).

Objectives: This study aims to investigate the utility of subscores in predicting two criterion variables: Pass/fail rates from an operational high school diploma examination and studentsâ€™ high school GPA.

Methodology: Davison et al. (2015) introduced a new methodology called criterion-related profile analysis (CPA) to assess whether subscores of a test can account for more variance in a criterion variable than does the overall test score. CPA is a useful tool that encompasses subscore reliability, validity, distinctiveness, and fairness within the evaluation of subscores. In this study, Davison et al.â€™s (2015) CPA method is used to assess whether subscores add to the prediction of the pass/fail decision in a high school diploma examination and high school GPA over and above the contribution of a single total score. Scores from Science, Mathematics, and Social Studies tests are used to examine whether these scores uniquely provide more explanation in predicting studentsâ€™ pass/fail rates in the diploma examination and their high school GPA.

Results and Conclusions: While most analyses of subscores are internal to the subscores, the present analysis is an external analysis addressing the question of whether subscores add over and above a total score in predicting an external criterion. Findings of this study indicate that there is a great difference among the contribution of subscores in predicting the pass/fail rates from the diploma exam and high school GPA.

Paper 5: An Automated Quality Control System to Routinely Monitor Continuous Mode Tests
Avi Allalouf, Michal Baumer, Tony Gutentag

Introduction: Quality control (QC) in the field of educational measurement is vital because: (1) the testing process is lengthy and consists of multiple stages, at any of which an error may occur; (2) statistical assumptions used in equating and scoring are only partially met; and (3) there are junctures at which judgment calls are needed. QC on tests, usually computerized, which are administered to small groups of examinees on many administration dates using a wide array of test forms (continuous mode tests â€“ CMT) is vital, since it becomes more prevalent throughout the world.

Objectives: A growing need for rapid and routine monitoring of test delivery and scores reporting arises for CMT. By implementing statistical methods used in various fields, among them the manufacturing industry, it assists in the real-time identification of irregularities in relatively small amounts of data that are streamed in frequently. An automated system has been developed to ensure that test delivery and scoring are acceptable, all in the short time available between test administration and reporting of scores.

The System: The system continuously monitors test delivery and scores; data is extracted on a weekly basis, statistical reports summarize the results and flag irregular events, and Shewhart control charts are displayed.Â Data is reviewed by a trained expert prior to the scores being reported. As the system accumulates data, the next step in system development will be to further fine-tune quality standards to enable the system to distinguish between common- and special-cause variation in the monitored parameters. The system will then generate a periodic short report that flags results which require further assessment, while minimizing the number of false alarms.

An account with this site is required in order to view papers. Click here to create an account.