The growing hype surrounding data lakes is causing substantial confusion in the information management space. Data lakes focus on storing data from disparate sources, but ignore how or why data is used.
In his session on navigating the data lake at the Gartner Business Intelligence & Analytics Summit, Nick Heudecker, research director at Gartner, said information leaders must understand the gaps in the data lake concept and take necessary precautions.
“The data lake concept promises a centralized pool of disparate data sources in one location, and treats alignment as a technical exercise,” said Mr. Heudecker. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it’s available for analysis by everyone in the organization.”
Even with the multiple benefits that data lakes provide, there are substantial risks.
The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.
Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still emerging. These issues will not be addressed if left to non-IT personnel.
Finally, performance aspects should not be overlooked. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.
“There is always value to be found in data, but the question that has to be addressed is this — do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalize to a degree that effort, and try to sustain the value-generating skills we develop?” said Mr. Heudecker. “If the option is the former, it is quite likely that a data lake will appeal. If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy.”
All recorded sessions from the Gartner Business Intelligence & Analytics Summit can be viewed at Gartner Events on Demand at www.gartnereventsondemand.com/event/bi13.