Any best practices this group can offer on data & analytics organization structures in 2024 and beyond? Is a more centralized model still popular? How can D&A orgs maintain control and support and grow of all the data? Especially if it's behind GenAI use cases
Sort by:
I believe the Data Catalog has to be central even in large organizations. It can and should be fed by multiple data sources including sub-catalogs and de-centralized data sources/lakes i.e. there does not need to be one central lake. Ideally there are standards for the type of metadata captured from each of the downstream sources and datasets. Then looking above the catalog in the hierarchical data architecture, the AI and other Analytics platforms can be decentralized as long as their data reaches/pipelines are standardized to use/ go through the central data catalog.
The best approach to building all of this is putting data discovery and mapping into the data catalog as a first priority. As you build connections to data sources, they can be APIs routing through API tools or directly or to data repositories with scheduled updates. The important thing is the metadata showing the quality of the data and the catalog informing how to access. Then at the analytics level it is best to have a portal with links to defined aggregated or singular meaningful datasets.
Multiple options for analytics can be presented to the users including using AI LLMs as alternatives to consume the same data traditionally consumed by PowerBI or Tableau or Spark. The user can see the data in visual blocks and be presented tools in links. There should be a process for adding new meaningful data aggregations (that will all use the catalog) so users can consume with analytics/AI tools. Again it is fine to have multiple of these Analytics platforms but ideally they will all be set up to pull data from the same data catalog.
If you keep users to accessing the analytics platforms described above and its links to LLM models, the user has already chosen the dataset and your design has controlled the data. Potentially you could then take away the paperclip attach feature in that link and restrict the user to training only on the data in this block forcing the user to make a request if they want a new block with additional or different data where you can build review and approval processes.
It *depends* - small to medium companies have no option than to centralise. How many data architects and engineers can you afford? In the large orgnisations they have no option than to decentralise and federate to principles of data mesh.
The important part in both is to democratise the access to the data so you can scale analytics across the business through data products. Here at Lonza we centralise and then we scale the analytics in the business to maximise the value...
It highly depends on the state of D&A adoption in the organization. If data literacy is high, and data is generally trustworthy, a decentralized (hub and spoke) organizational setup will work well. You’ll want to maintain a central team (we call it a platform team) to handle generic items like compliance frameworks, artifact repositories, technical standards, and platform lifecycle management. Data Governance will always be a centralized function, but the operating model could be federated if ownership is taken by the various clusters.
Both GenAI and traditional AI require a strong data foundation. Small use cases like copilots can be handled in isolation, but more complex scenarios will require central steering. The dynamics of AI Governance are different from Data Governance: merging them into one entity seems to be a step too far. AI needs to become a commodity first.
We are working on maturing our AI operating model towards the same state as our general Data operating model. This means replicating best practices like FinOps and asset tagging, as they help us understand and steer the cost of both the data and the AI algorithms running against them.