Differential item functioning (DIF) analysis examines whether test items function differentially towards two test taker groups after controlling for the overall ability level of the two groups. Many scholars and test standards have advocated this method for the purpose of detecting construct-irrelevant and biased test items so as to improve test validity and fairness (e.g., Camilli & Shepard, 1994; ITC, 2000; Kunnan, 1997, 2000, 2004; TCTP, 1988, 2005; Xi, 2010). Existing studies mainly focus on DIF effect towards test taker groups classified by native languages, gender, age, and academic majors in tests for mature learners (e.g., Aryadoust, 2012; Aryadoust, Goh & Kim, 2011; Banerjee & Papageorgiou, 2016; Geranpayeh & Kunnan, 2007; Grover & Ercikan, 2017; Oliveri, Lawless, Robin & Bridgeman, 2018). Very few, if any, have examined grade DIF in tests for children. Since higher graders tend to be more cognitively developed, they are more likely to be favoured in a test in comparison with lower graders even after conditioning on the overall ability level. The correctness of this hypothesis, however, is unknown due to the lack of empirical studies.
To address this issue, the current study examined grade DIF in GEPT-Kids listening. Quantitative data were collected from 791 pupils (Grade 5: 398; Grade 6: 393) from eight cities in mainland China, Taiwan, and Hong Kong, and qualitative data were collected from two primary school teachers in mainland China. Two R packages ‘difR’ and ‘difNLR’ (Magis, Beland, Tuerlinckx & De Boeck, 2010; Drabinova & Martinkova, 2017) were used to perform five types of DIF analysis (2 PL IRT based Lord's and Raju’s, Mantel-Haenszel, Breslow-Day, and Non-Linear Regression DIF methods) on the pupils’ test results and expert judgement was taken to examine DIF sources of the flagged items. The DIF analysis altogether flagged over half of the test items and most of the DIF items tended to favor Grade 6 students as hypothesized. Expert judgement only revealed that a potential reason for certain items favoring Grade 6 is that the test content is not or newly learned by the Grade 5 students. In light of these findings, future studies are suggested to conduct post-test interviews with test takers in order to find out DIF sources and improve test validity and fairness.