I'm trying to have my data scientists focus on ways to spend less time cleaning data, but they always blame our business partners for poor data quality. Besides attacking data quality and blaming others, what are some good initiatives to evaluate that could empower my team to deliver faster / better insights?


194 views2 Upvotes5 Comments

Director of Operations in Services (non-Government), Self-employed
Part of the role of a data scientist is cleaning data, and if there’s a lot of time spent on cleaning then a system should have been developed that automated the data cleanup process.

That’s to say if the messy data is always coming in the same or similar formats. If not, it might be valuable to get on a meet and hash air each side’s grievances, potentially finding a solution or option that streamlined data cleanup.
Director of Data Architecture in Media, 5,001 - 10,000 employees
I would start with Culture Change from Data Scientists vs Business Partners ==>  Data scientists + business partners vs Data quality ... Everyone is responsible for data quality.

This could be a lengthy post so I will put some of the highlights as bullet points for technical initiatives that can help (depending on the organization operating mode some orgs have MLE and DS as separate entities, some orgs combine these roles, some orgs have DS and Data Engineering...etc)

-Make sure data is stored in the right storage objects (cost control, latency and discovery)
-Establishing Feature Store

-MLOps 
-Terminating model pipelines jobs upon unit testing failure vs handling DQ downstream.
-CICD for Model pipelines
-Data Contracts 
-Model Observability 

1 Reply
CMO in Services (non-Government), 2 - 10 employees

Good suggestion  . Thank you. 

Head of Data Strategy in Software, 51 - 200 employees
If we agree that the most common dimensions of data quality are accuracy, timeliness, uniqueness, and completeness of data, then I think what you will find is a relatively small portion of time for your DS team is actual quality issues.  What they are calling 'poor quality' is likely more a function of different structures, formats, standards, semantics, etc.  These issues are caused by the fact the data in source systems is optimized for operational use cases, not analytical use cases - and this will never change.  

The best thing you can do is drive a culture change in your team to have them realize that business stakeholders are acting with positive intentions, and that source data exists as it does because conscious decisions were made to optimize that data for non-analytical uses.  

While wrangling data may be drudgery and the worst part of a data scientists' job, it will never be eliminated.   Blaming the business, when the business is operating with the intention of maximizing profits, is counterproductive and disempowering.  
Director of BI & Insights in Services (non-Government), 1,001 - 5,000 employees
Addressing data quality asap is crucial for any Data science/Analytics/BI team which can drastically increase their value. They should spend time creating and optimising models, instead of cleaning.

There are several proactive initiatives that you can consider to empower your data science team to deliver faster and better insights:

-Automated Data Cleaning Tools: Invest in data cleaning and preprocessing tools that can automate routine tasks, such as missing value imputation, outlier detection, and standardization. 

-Data Quality Framework: Develop frameworks that defines data quality metrics, processes, and, most importantly, responsibilities (who is responsible for what). This framework can help establish clear standards for data accuracy, completeness, and consistency, reducing the potential for poor data quality.

-Collaborative Data Governance: Cross-functional teams, including data engineers, data scientists and business partners should raise data issues as early as possible, so to resolve them.

-Education and Training: Provide ongoing training to your team (Data + Business) in data quality best practices, advanced data cleaning techniques, and tools.

-Standardized Data Collection Processes: Work with business partners to establish standardized & automated data collection processes. This helps prevent inconsistent data entry and reduces the need for extensive data cleaning downstream.

-Feedback Loops: Establish regular feedback loops with business partners to collaboratively address data quality issues. Foster a culture of continuous improvement.

By combining these initiatives, you can create an environment where your data science team is focused on doing its core job, delivering faster and better insights, rather than performing mundane data cleaning tasks again and again

Content you might like

IT support22%

Customer service support39%

Marketing / Sales24%

Operations / Distribution13%

Human Resources1%

Legal1%

Other (comment below)0%


257 PARTICIPANTS

972 views1 Upvote

Lack of Flexibility20%

Vendor lock-in26%

Lack of knowledge19%

Can't integrate with current infrastructure (another data silo)10%

Not in the budget13%

Not Applicable12%

Other (please specify)1%


238 PARTICIPANTS

1.7k views1 Upvote1 Comment

CTO in Software, 201 - 500 employees
Without a doubt - Technical Debt! It's a ball and chain that creates an ever increasing drag on any organization, stifles innovation, and prevents transformation.
Read More Comments
42.2k views131 Upvotes319 Comments