How do you deal with missing data when using linear regression modelling strategies? What is the best way to inform business users about how this might impact the validity and reliability of the regression model's findings?

112 viewscircle icon3 Upvotescircle icon3 Comments
Sort by:
Founder, CEO in Services (non-Government)2 years ago

This can be tricky but the optimal approach to handle missing data depends on a few factors. Examples

1.       How much do you know of the space – is the behaviour and shape consistent or not? i.e., do you understand the general behaviour, shape of the context from which the data is coming from? Always skewed-one way, always normal etc.

 

2.       How many data points are missing? If the missing data points are few, less than 5% of the data-set, I can pull a random sample that matches a minimum level of confidence that is practical to the problem being solved (let us say 95%). I have done this once, where a randomized pull gave me zero missing values off the population. Then run the sample vs. the whole data set, assess the shape, key stat summaries. If all OK, then this may be a viable option.

 

3.       Depending on shape of the data set, the option to replace missing values with Mean, Median may not bias your outcomes enough to inform a different decision.

 

4.       Decision sensitivity. Do you need directional or precise clarity?

President & Chief Data Officer in Services (non-Government)2 years ago

I pretty much agree with Rajesh's comment.

It depends on how much data is missing, if the data is missing at random or if there is a systematic pattern (bias), the amount of variability in the data and the presence of outliers.  If there is a systematic pattern in the missing data, that could be problematic.  One thing that can be helpful is to eliminate columns with missing data, especially if there is multicollinearity in your dataset (the column with missing data is strongly correlated with other columns in your dataset). I would recommend running the analysis with the missing data omitted (rows and or columns) and again with imputation and compare the results.

Lightbulb on1
Associate Director, Data Science & Analytics in Travel and Hospitality2 years ago

There is no easy way out here, unfortunately. Linear regression cannot handle missing values, so you have to either impute the missing values, or drop the entire row with any missing value. Both of these approaches can bias any inference from the model.

You will have to take a judgment call after analyzing why the values are missing.

Are there any patterns with the missing values? Then it is better to impute.

Do only few columns have missing values, and that too only a few of them? Then you may just drop the rows with missing values.

There are whole chapters written about handling missing values, but no conclusion that you can directly use.

Content you might like

Cost of RPA products24%

Lack of developers who can code RPA applications43%

Amount of customization needed to automate business processes27%

Lack of RPA code maintenance resources4%

View Results

Yes, more jobs created53%

Yes, jobs lost to AI32%

Other (please comment)13%

View Results