'Big Data' and Content Will Challenge IT Across the Board
The impact of "big data" is extremely broad, for both the business and information management and utilization. We discuss a diverse set of analytic impacts which affect some of the most sensitive IT initiatives in your organization.
"Big data" forces organizations to address the variety of information assets and how fast these new asset types are changing information management demands. It is important to understand what the impacts of big data are on the existing information management approach, in order to develop a plan to respond to these issues.
- IT and business professionals integrating "big data" structured assets with content must increase their business requirement identification skills.
- IT support teams will be tasked with supporting end-user-deployed big data solutions, and allocation of funding for support will be contentious.
- Enterprise data warehouses will undergo major revisions to address big data, or face being decommissioned.
- Business analysts using context-aware algorithmic analysis of big data must address the fidelity and contract aspects of extreme information management, or false analysis output could actually drive customers away.
- CIOs and IT leaders should investigate the opportunities for training their existing staff regarding the challenges of big data tools and solution architectures.
- Allocate budgeting and staff for taking over end-user clusters deployed for big data analytics in your IT planning.
- Consider introducing access to an existing MapReduce cluster and do not focus only on consolidated repositories.
- Align technology which analyzes big data assets with at least one pilot business personalization initiative.
Enterprise architects, information managers, and data management and integration leaders often delve into the challenge of big data and find that the volume of data represents only one aspect of the problem. Clients and vendors increasingly encounter a phenomenon they call "big data," but the term is sometimes misleading because the challenge has many dimensions beyond the volume of data under management. Gartner has identified 12 dimensions in three categories: quantification, access enablement and control, and information qualification and assurance (see "'Big Data' Is Only the Beginning of Extreme Information Management"). These dimensions interact with each other to exacerbate the challenges of next-generation information management. IT leaders must recognize all these challenges, design information architectures and management strategies to address them, and then deploy new technologies and practices to manage data extremes — because traditional methods will fail. Failure to plan for all of the extreme dimensions in systems deployed over the next three years will force a massive redesign for more expansive capabilities within two or three years. However, processing matters, too: a complex statistical model can make a 300GB database "seem" bigger than a 110TB database, even if both are running on multicore, distributed parallel processing platforms.
In 2012, big data has reached a point of inflection. Gartner inquiries note the increasing incidence of big data as part of the issue: in 2011, more than 2,000 end-user inquiries included some aspect of the topic. Tools are now offered by a variety of vendors for implementing MapReduce as one solution.
- The Apache Hadoop open-source project offers a variety of tools which can be self-deployed or implemented via managed distribution.
- Major vendors such as IBM and Microsoft are developing, or offering their own products for, certain components of a MapReduce implementation.
- Some traditional data aggregators and analytics vendors also offer big data solutions, although not necessarily MapReduce; for example, LexisNexis.
- Smaller vendors such as Cloudera offer a combination of managed Hadoop distributions coupled with professional services for implementation.
- Some large vendors are partnering to support MapReduce technology (for example, Oracle's offering of Cloudera as part of its Oracle Big Data Appliance).
However, MapReduce is a technology approach, not a product, and is not equal to big data — which is some combination of volume, variety and velocity issues. Big data is also not equal to the Hadoop solution approach. Graph is a part of big data analytics, and big data issues also abound in text, document and media analysis. Certain infrastructure as a service (IaaS) vendors are also offering big data processing and analysis solutions. Finally, NoSQL solutions — including key value, graph, document and column-style data stores — are also increasing in analytic use cases.
Source: Gartner (February 2012)
Impact: IT and business professionals integrating "big data" structured assets with content must increase their business requirement identification skills.
The broader context of big data challenges existing practices of selecting which data to integrate, with the proposition that all information can be integrated and that technology should be developed to support this. As a new issue driving requirements (that demands a new approach), the breaching of traditional boundaries will occur extremely fast because the many sources of new information assets are increasing geometrically (for example, desktops became notebooks and now tablets; portable data is everywhere and in multiple context formats) and this is causing exponential increases in data volumes. Additionally, the information assets include the entire spectrum of information content: from fully undetermined structure ("content") to fully documented and traditionally accessed structures ("structured"). As a result, organizations will seek to address the full spectrum of extreme information management issues, and will use this as differentiation from their competitors to become leaders in their markets in the next two to five years. Big data is, therefore, a current issue (focused on combinations of volume, velocity, variety and complexity of data), which highlights a much larger extreme information management topic that demands almost immediate solutions. Gartner estimates that organizations which have introduced the full spectrum of extreme information management issues to their information management strategies by 2015, will begin to outperform their unprepared competitors within their industry sectors by 20% in every available financial metric.
- Identify a large volume of datasets and content assets that can form a pilot implementation for distributed processing, such as MapReduce. Enterprises already using portals as a business delivery channel should leverage the opportunity to combine geospatial, demographic, economic and engagement preferences data in analyzing their operations, and/or to leverage this type of data in developing new evaluation models.
- Particular focus on supply chain situations — which include location tracking through route and time and which can be combined with business process tracking — is a good starting point. The life sciences industry will also be able to leverage big data: for example, large content volumes in clinical trials or genomic research and environmental analysis as contributing factors to health conditions.
- CIOs and IT leaders should utilize opportunities to train their existing staff in the challenges of extreme information management. Staff will then be able to deliver big data solutions directly, or supervise their delivery.
Impact: IT support teams will be tasked with supporting end-user-deployed big data solutions, and allocation of funding for support will be contentious.
End users have deployed MapReduce clusters using their departmental budgets and discretionary funds over the past two to four years. As the analytics output from these deployments demonstrates their valuable decision-support capabilities, business executives will want to leverage both the infrastructure and the analysis processes. IT will be asked to develop a strategy for supporting these expanded use cases. However, these are custom deployments that are not tools-based. Additionally, business unit budgets are not accessible to IT (either generally or to allocate funds to maintain and support these deployments); these funds are allocated on a project basis and considered one-time investments. At the same time, attempts to leverage these personal and departmental clusters will be met with resistance: because the users deploying these systems have neither the time to support leveraging these systems in an enterprise manner, nor the budget to introduce enterprise-class infrastructure. Control of the systems will also be contentious because the end users have grown accustomed to using these clusters as dedicated, personal systems. IT will once again have the task of identifying tools, versions and distribution standards, publishing those standards and then encouraging business-managed deployments to follow the standards — even if they appear to be rogue projects.
- Allocate budgeting and staff for taking over end-user deployed MapReduce clusters in your IT planning. Budgeted amounts should include travel and training for IT specialists to learn how the software tools and hardware infrastructure operates. If your organization requires funding reallocation, plan on the transfer of these funds out of the IT budget and into business unit accounting categories.
- IT should only plan to assume control and support of these deployments where three or more business units are already leveraging the cluster(s). IT should avoid taking over the management of any deployments which are being used by one or two business organizational units, and the "owning" units should remain responsible for both staffing and budgets. If the business units refuse to continue funding and staffing such units they should be decommissioned on the basis that, "You have to pay to play."
- Identify which software is being used and develop a standardized approach for what IT will support. This can be a managed distribution, a vendor product, or even a contract with a professional services provider; IT only assumes the management of the relationship.
Impact: Enterprise data warehouses will undergo major revisions to address big data, or face being decommissioned.
Over the years, the various options (centralized enterprise data warehouses [DWs], federated marts, the hub-and-spoke array of central warehouse with dependent marts, and the virtual warehouse) have all served to emphasize certain aspects of the service expectations for a DW. The common thread running through all styles is that they were repository-oriented. A repository-only style of warehouse will be completely overwhelmed by the simultaneous increases in volume, variety and velocity of data assets, and the demand for data integration toward a repository as the only strategy will fail. This, however, is changing: the DW is evolving from competing repository concepts to include a fully-enabled data management and information-processing platform. This new warehouse forces a complete rethink of how data is manipulated, and where in the architecture each type of processing occurs to support transformation and integration. It also introduces a governance model that is only loosely coupled with data models and file structures, as opposed to the very tight, physical orientation previously used.
This new type of warehouse — the logical data warehouse (LDW) — is a series of information management and access engines that takes an architectural approach, and which de-emphasizes repositories in favor of new guidelines:
- The LDW follows a semantic directive to orchestrate the consolidation and sharing of information assets, as opposed to one that focuses exclusively on storing integrated datasets in dedicated repositories. The LDW is highly dependent upon the introduction of information semantic services.
- The semantics are described by governance rules from data creation and use-case business processes in a data management layer, instead of via a negotiated, static transformation process located within individual tools or platforms. These semantics include leveraging externally managed processing clusters for MapReduce, Graph and other big data processes.
- Integration leverages both steady-state data assets in repositories and services in a flexible, audited model via the best available optimization and comprehension solution.
- Start your evolution toward an LDW by identifying data assets that are not easily addressed by traditional data integration approaches and/or easily supported by a "single version of the truth." Consider all technology options for data access and do not focus only on consolidated repositories. This is especially relevant to big data issues.
- Identify pilot projects in which to use LDW concepts, by focusing on highly volatile and significantly interdependent business processes.
- Use an LDW to create a single, logically consistent information resource independent of any semantic layer that is specific to an analytic platform. The LDW should manage reused semantics and reused data.
Impact: Business analysts using context-aware algorithmic analysis of big data must address the fidelity and contract aspects of extreme information management, or false analysis output could actually drive customers away.
In a brave new world of transparency and customer fairness legislation, financial institutions and other online services (offered to consumers in other industries) could, quite conceivably, be providing biased interfaces (including supposed transparency) based on the provider's interpretation of what should be transparent (for example, the information that its computerized algorithms have generated), and the consumer wouldn't know the difference between viewing all the information or only selected information. This could impact product pricing, terms and conditions, trust, and service levels, and not necessarily for the true benefit of the customer. Also, inappropriate use of this information could be extremely damaging (see "Consumer Trust in Leading Contactless Card Solution Damaged as Customer Data Fails Privacy Test"). Sometimes, just the perception that a bank has used data a customer doesn't want them to use can damage trust and the brand.
In the world of hyperdigitized and algorithmic decisions, the customer might think he or she has made the best online/mobile personalized choices based on the filtering and openness of the information available. However, this openness and the filtering techniques were being defined and controlled by the marketer via particular search and decision algorithms that IT developed for the business to support a more personalized customer experience. In other words, the algorithms begin to tune the experience, based on the individual's ability to use the interfaces and information, instead of optimizing the use for all options available. Any online provider could, in theory, enable such algorithm-defined, context-aware personalization, and control that personalization via computerized algorithms without the customers being aware. (Anecdotal evidence reveals that filtering differences exist for even simple financial services search terms.)
- Reconstitute your big data personalization tools. Organizations need to be aware of the potentially significant negative brand impact from incorrectly applied algorithm-oriented, context-aware personalization technologies.
- Align technology which analyzes big data assets with at least one pilot business personalization initiative. Organizations need to ensure that these personalization and context-aware capabilities align with customer expectations and legal/regulatory requirements.
- Review and revise the policies and codes of conduct around the use of nonmanually intermediated decision making and personalization. The use of computerized search, filtering and personalization needs to include more than just relevance as the core part of the information collection, analysis, decisioning and dissemination.
- impactappraisal... (36KB)