I'm trying to have my data scientists focus on ways to spend less time cleaning data, but they always blame our business partners for poor data quality. Besides attacking data quality and blaming others, what are some good initiatives to evaluate that could empower my team to deliver faster / better insights?

351 views2 Upvotes7 Comments

Sort by:

Director of Data in Healthcare and Biotech2 years ago

Besides what has been mentioned already, also believe that it is key to focus on building the accountability across the organisation for data quality. Data quality is everyone’s problems so everyone should be part of the solution. Some suggestions include the following:
- Data literacy programs for all newcomers and existing individuals in the business that focuses not only on the analytics but also on the data cycle and the importance of data processing to enable analytics and how important the principle is of “rubbish in, rubbish out”
- Build accountability for those who input data into systems by ensuring that data quality is a KPI/OKR/MBO across the business. This will require you getting the exec on board with this suggestion and focusing on making data quality everyone’s focus area for a few years until it becomes part of the culture.

Good luck!

Senior Data and Analytics Leader in Government2 years ago

In addition to what has been mentioned in other comments, establishing a centralized data catalog is key for data scientists to understand data sources, definitions, and transformations, making it easier for them to work with the data effectively. Additionally, exploring opportunities to automate data integration processes by implementing data integration tools and workflows, you can streamline the process of combining data from various sources. This reduces the manual effort required for data cleaning and allows your data scientists to focus more on analysis and insights generation.

Director of BI & Insights in Services (non-Government)2 years ago

Addressing data quality asap is crucial for any Data science/Analytics/BI team which can drastically increase their value. They should spend time creating and optimising models, instead of cleaning.

There are several proactive initiatives that you can consider to empower your data science team to deliver faster and better insights:

-Automated Data Cleaning Tools: Invest in data cleaning and preprocessing tools that can automate routine tasks, such as missing value imputation, outlier detection, and standardization.

-Data Quality Framework: Develop frameworks that defines data quality metrics, processes, and, most importantly, responsibilities (who is responsible for what). This framework can help establish clear standards for data accuracy, completeness, and consistency, reducing the potential for poor data quality.

-Collaborative Data Governance: Cross-functional teams, including data engineers, data scientists and business partners should raise data issues as early as possible, so to resolve them.

-Education and Training: Provide ongoing training to your team (Data + Business) in data quality best practices, advanced data cleaning techniques, and tools.

-Standardized Data Collection Processes: Work with business partners to establish standardized & automated data collection processes. This helps prevent inconsistent data entry and reduces the need for extensive data cleaning downstream.

-Feedback Loops: Establish regular feedback loops with business partners to collaboratively address data quality issues. Foster a culture of continuous improvement.

By combining these initiatives, you can create an environment where your data science team is focused on doing its core job, delivering faster and better insights, rather than performing mundane data cleaning tasks again and again

Chief Data Officer in Software2 years ago

If we agree that the most common dimensions of data quality are accuracy, timeliness, uniqueness, and completeness of data, then I think what you will find is a relatively small portion of time for your DS team is actual quality issues. What they are calling 'poor quality' is likely more a function of different structures, formats, standards, semantics, etc. These issues are caused by the fact the data in source systems is optimized for operational use cases, not analytical use cases - and this will never change.

The best thing you can do is drive a culture change in your team to have them realize that business stakeholders are acting with positive intentions, and that source data exists as it does because conscious decisions were made to optimize that data for non-analytical uses.

While wrangling data may be drudgery and the worst part of a data scientists' job, it will never be eliminated. Blaming the business, when the business is operating with the intention of maximizing profits, is counterproductive and disempowering.

Director of Data Architecture in Media2 years ago

I would start with Culture Change from Data Scientists vs Business Partners ==> Data scientists + business partners vs Data quality ... Everyone is responsible for data quality.

This could be a lengthy post so I will put some of the highlights as bullet points for technical initiatives that can help (depending on the organization operating mode some orgs have MLE and DS as separate entities, some orgs combine these roles, some orgs have DS and Data Engineering...etc)

-Make sure data is stored in the right storage objects (cost control, latency and discovery)
-Establishing Feature Store

-MLOps
-Terminating model pipelines jobs upon unit testing failure vs handling DQ downstream.
-CICD for Model pipelines
-Data Contracts
-Model Observability

1 Reply

no title2 years ago

Good suggestion <mention id="649cd4a2bcc032000156281d" displayname="Zaki Elt"></mention> . Thank you.