Cloze Distillation Improves Psychometric Predictive Power

Abstract

Training models on next word prediction (NWP) has led to significant developments in general-purpose language models (LMs). However, this objective lacks cognitive motivation under theories of language processing that show human prediction to be graded, parallel, and scaffolded. Here, we present new evidence that challenge the ability of the NWP task to generate LMs with human-like predictive capacities. First, we compare state-of-the-art transformer LMs with a simple LSTM model, and show that while the transformer models achieve better performance in both NWP and prediction of human cloze completions for the Provo corpus, the LSTM model provides a better account of human reading times on the same corpus. This reveals a disassociation between human language processing and NWP. On that basis, we propose Cloze Distillation: a novel method for distilling the linguistic information that is implicit in human cloze completions into pre-trained LMs. We apply this method to the LSTM model and show substantial improvement in reading time prediction, word frequency estimation, and generalization to held-out human cloze data. Our results identify the direct modeling of human psychometric data as an effective potential method for creating more psychometrically-valid LMs.

Publication
Cloze Distillation Improves Psychometric Predictive Power