TDSM 3.17

From The Data Science Design Manual Wikia
Jump to: navigation, search

General Steps for treating Missing Data

  • Identify the patterns/reasons for missing values correctly.
  • Understand distribution of missing data, do they follow certain distribution?
  • Decide on the best method of analysis and treat the values.

Here are some techniques to treat missing data

Deletion:

  • If the nature of missing values is completely random and with enough data we can simply delete the data points with missing values.

Imputation

  • Popular Averaging Techniques: Mean, median and mode are the most popular averaging techniques, which can be used to infer missing values and can be used to replace them.
  • Predictive Techniques: imputation of missing values from predictive techniques assumes that the nature of such missing observations are not observed completely at random and the variables were chosen to impute such missing observations have some relationship with it, else it could yield imprecise estimates. Many regression techniques can be used for this.
  • Random Value Imputation: One might also repeatedly impute randomly selected values to evaluate the impact of imputation.

Below are some of the ways to handle missing values from dataset:

  • We can make guess from other data points, what the missing values can be.
  • We can impute the missing values by our knowledge from outside world like if a day is missing from data then it should be from Mon to Sun.
  • We can sometimes fill the missing values with mean, median or mode of the rest of the data.
  • We can extrapolate or interpolate the other data points to find out the missing values.
  • We can also delete the missing values if it feels OK.
  • We can fill the missing values with a value which is present maximum number of times for the rest of that data.
  • We can fill it with random values so as to maintain the randomness of the data points.