How do you deal with missing data when using linear regression modelling strategies? What is the best way to inform business users about how this might impact the validity and reliability of the regression model's findings?


49 views4 Upvotes3 Comments

Associate Director, Data Science & Analytics in Travel and Hospitality, 501 - 1,000 employees
There is no easy way out here, unfortunately. Linear regression cannot handle missing values, so you have to either impute the missing values, or drop the entire row with any missing value. Both of these approaches can bias any inference from the model.

You will have to take a judgment call after analyzing why the values are missing.

Are there any patterns with the missing values? Then it is better to impute.

Do only few columns have missing values, and that too only a few of them? Then you may just drop the rows with missing values.

There are whole chapters written about handling missing values, but no conclusion that you can directly use.
President, CEO, & CDAO in Services (non-Government), Self-employed
I pretty much agree with Rajesh's comment.

It depends on how much data is missing, if the data is missing at random or if there is a systematic pattern (bias), the amount of variability in the data and the presence of outliers.  If there is a systematic pattern in the missing data, that could be problematic.  One thing that can be helpful is to eliminate columns with missing data, especially if there is multicollinearity in your dataset (the column with missing data is strongly correlated with other columns in your dataset). I would recommend running the analysis with the missing data omitted (rows and or columns) and again with imputation and compare the results.
1
VP of Data in Banking, 10,001+ employees
This can be tricky but the optimal approach to handle missing data depends on a few factors. Examples

1.       How much do you know of the space – is the behaviour and shape consistent or not? i.e., do you understand the general behaviour, shape of the context from which the data is coming from? Always skewed-one way, always normal etc.

 

2.       How many data points are missing? If the missing data points are few, less than 5% of the data-set, I can pull a random sample that matches a minimum level of confidence that is practical to the problem being solved (let us say 95%). I have done this once, where a randomized pull gave me zero missing values off the population. Then run the sample vs. the whole data set, assess the shape, key stat summaries. If all OK, then this may be a viable option.

 

3.       Depending on shape of the data set, the option to replace missing values with Mean, Median may not bias your outcomes enough to inform a different decision.

 

4.       Decision sensitivity. Do you need directional or precise clarity?

Content you might like

Free Version75%

Paid Version (Chat GPT Plus)34%

Enterprise Version (or plan to use)8%


331 PARTICIPANTS

14k views4 Upvotes1 Comment

High policy intervention: AI is a serious threat and should be controlled even at the expense of accessibility26%

Moderate policy intervention: AI has the potential for misuse, and policy interventions are the only way to ensure safety71%

Low policy intervention: Developers and users of the technology can be trusted to implement most of the required safety measures2%


87 PARTICIPANTS

768 views