Any advice on how to implement data lineage automation? Have you found areas where introducing automation is easiest or most useful?
Sort by:
I will answer the second question first. Data lineage automation is certainly useful, collecting lineage data manually is time consuming and error-prone. Data lineage will fulfill several data governance requirements such documenting data security rules, who used the data, version tracking, etc. Another advantage is the ability to collect more lineage data and quickly compared to the manual process. This will enhance reporting, analysis, and planning.
I don't know what you mean by easiest, but I believe that depends on the automation tool used and there are variety of tools available for data lineage automation. I googled "Lineage automation tools review" and I got several helpful articles. Avoid sponsored links :)
There are open source and commercial tools foe lineage gathering and management. To implement it in your environment you can either use the vendor to help you or work with your technical team.
It will depend on the technology you are using. Collibra has automated lineage tools that integrate with our AWS datalake.
To begin the initiative, I focused on the most critical data elements for our enterprise first. They were financial and regulatory metrics. That kicked off the first initiative. Then I spoke with the analytics teams to understand their perspective on which data flows were most important for their executive dashboards and BI. That got us started, and then it was just working through the implementation list and adjusting priority as needed. As Philip noted, I have also had success with Collibra when working with AWS.