fbpx

How to deal with missing data

C. Supakorn and V. Cave

 

Missing data is a common problem, even in well-designed and managed research studies, and can have a significant effect on the conclusions from the data. It results in a loss of precision and statistical power, has the potential to cause bias and often complicates the statistical analysis.
The following figure illustrates data from a study on the intelligence quotient (IQ) of students living in a city, town, or village. A random selection of students was recruited into the study. Unfortunately, IQ measurements could not be obtained for all students in the study, resulting in “missing data” (denoted by NA). In order to make valid inference, it is very important that we understand why this data is missing.

 

 

There are three main types of missing data:

  • Missing completely at random

Here missingness has nothing to do with the subject being studied. i.e. the missing data is a random subset of the data. For example, if the student IQ measurements were missing because the IQ test results were accidentally deleted by the researcher.

  • Missing at random

Here missingness can be predicted from other observed data on the subject. For example, if age was recorded, we might see that IQ measurements were missing for all students under 10 years of age.

  • Missing not at random

Here missingness is related to the unobserved data. For example, if the student IQ measurements were missing because of age, but age information was not recorded.

 

There are several ways to handle missing data in the analysis including: omitting variables which have many missing values, omitting individuals who do not have complete data, and imputing (i.e. estimating) values for the missing data[1]. Imputation methods include:

  1. Mean substitution: missing values are replaced with the mean value of the observed data.
  2. Regression imputation: missing values are replaced with values estimated from a regression analysis.
  3. The last observation carried forward: in longitudinal or time-series data, missing values can be replaced with the last observed value from the same subject.

 

Many statistical methods, including maximum likelihood, expected maximization, and Bayesian models, can handle data with missing values. However, great care needs to be taken to ensure that biased inference does not result.

 

The best possible method cure for missing data is to prevent the problem! Carefully planning and managing your study will help to minimize the amount of missing data. Statistical techniques for accommodating missing data should be the last resort.

 

[1] Altman D.G. and J. M. Bland. 2007. Missing data. BMJ. 334: 424.