Abstract
It is a daunting task to impute a large dataset that has majority of data missing from its tens of thousands of predictors, but still to preserve a known cluster structure in the imputed results. In this study, we propose a novel two-step approach for this task. First, we use simple linear classification models to derive the global and local features of cluster structures from a template ground truth (i.e. known gold data). Second, we integrate the cluster features extracted from each cluster into a neural network architecture and its training responses for guiding the imputation process. Since our neural network utilizes the global and local features of gold data in training the imputation network, we refer our neural network as GLIN (Global-Local Imputation Network). We test our imputation method on two high-dimensional datasets: a single cell dataset and a movie rating dataset, that have up to of 95% missing rates. Finally, we use four evaluation metrics: distance, correlation, data distribution, and predictability difference, to evaluate how well the cluster structures of the gold data are preserved in the imputed results.