Applying the MIDAS Touch: How to Handle Missing Values in Large and Complex Data

Abstract

Principled methods for analyzing missing values, based chiefly on multiple imputation, have become increasingly popular among political scientists yet can struggle to handle the kinds of large and complex data that are also becoming common. We propose an accurate, fast, and scalable approach to multiple imputation, which we call MIDAS (Multiple Imputation with Denoising Autoencoders). MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are designed to reduce dimensionality by corrupting and attempting to reconstruct a subset of data. We repurpose denoising autoencoders for multiple imputation by treating missing values as an additional portion of corrupted data and drawing imputations from a model trained to minimize the reconstruction error on the originally observed portion. Systematic tests involving simulated as well as real political data illustrate MIDAS’s accuracy and efficiency across a range of settings. We provide open-source software for implementing MIDAS.

Publication
Under Review