Educational assessment and model building using process data: Issues of open science and replication

Johannes Naumann, Malte Elson, & Frank Goldhammer

Session 4A, 11:20 - 12:05, HAGEN 2

Data delivered by Large Scale Assessments (LSAs) are not only used to describe student performance, link performance to background variables on the student, school, and system level, and thus inform educational policy. Rather, LSA data is also increasingly being used for theory building in substantive educational and psychological research. One advantage of using LSA data for substantive research is that research grounded on LSAs already addresses many of the problems recently raised concerning the openness and replicability of educational and psychological research (“replication debate”; e.g. Makel & Plucker, 2014), given large samples and cyclic repetition, which can be utilized for a disentangling of exploratory and confirmatory research, or direct replications. As LSAs are increasingly carried out as computer-based assessments (CBAs), this extends to models requiring data on the task solution process, when log files of student behavior can be mined for psychologically meaningful behavioral indicators. Only few attempts however have been made to date to replicate research using LSA process data.

In the present research, PISA 2012 CBA data was used for an attempt to replicate recently published results that had been obtained using data from the (optional) PISA 2009 Digital Reading Assessment (Naumann & Goldhammer, 2017). In these authors’ research, a dual-processing account of reading digital text (Shiffrin & Schneider, 1977; Walczyk, 2000) was tested through an examination of items’ difficulties and persons’ skills effects on time-on-task effects on performance in digital reading, employing a GLMM-framework. Consistent with a dual processing account, the authors found strong positive time-on-task effects in weak digital readers and hard items, while time-on-task effects were negative in easy items and null in skilled digital readers. Thus, negative correlations emerged between random item and person intercepts, and random item and person specific time-on-task slopes, respectively. Also in line with a dual-processing account, items’ navigational demands and persons’ comprehension skills, modeled as fixed effects, moderated time- on-task effects thus that time-on-task effects were positive especially in weak comprehenders and in tasks with high navigation demands.

These results were only partly replicated in PISA 2012. While the median correlation between person intercepts and slopes across 17 countries participating in both cycles was -.61 in 2009, pointing to much stronger time-on-task effects in weaker digital readers, the corresponding median correlation in 2012 was only -.30. Also, this correlation was lower in 2012 than 2009 in each individual country. The median correlation between item intercepts and slopes was -.61 in 2009, and -.50 in 2012. Similarly, while the median interaction effect between time on task and comprehension skill in 2009 was -0.07, it was only -0.03 in 2012, and the median interaction between time on task and navigation demands was 0.26 in 2012, while it was 0.48 in 2009.

These results indicate that replicability of substantive results that were obtained using LSA- data must not be taken for granted despite large samples and standardized testing procedures. The present results are discussed in the context of changes in the test design that had occurred between the 2009 and 2012 PISA CBAs.

Published Sep. 5, 2018 1:47 PM - Last modified Sep. 5, 2018 1:47 PM