The growing hype surrounding data lakes is causing substantial confusion in the information management space. Data lakes focus on storing data from disparate sources, but ignore how or why data is used. In his session on navigating the data lake at the Gartner Business Intelligence & Analytics Summit, Nick Heudecker, research director at Gartner, said information leaders must understand the gaps in the data lake concept and take necessary precautions.
“The data lake concept promises a centralized pool of disparate data sources in one location, and treats alignment as a technical exercise,” said Mr. Heudecker. “The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organization.”
Even with the multiple benefits that data lakes provide, there are substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.
Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still emerging. These issues will not be addressed if left to non-IT personnel.