Cool Vendors in Information Infrastructure and Big Data, 2012
Emerging vendors in information infrastructure are tackling various aspects of "big data." Information managers and architects should consider the need for such vendors and technologies to complement their main information management technology investments.
- MapReduce is not the only viable approach to parallelizing the processing of high volumes of data in clustered computing environments; alternatives such as HPCC Systems are also available.
- NoSQL databases are proving valuable for scaling out cloud and on-premises uses of numerous content types, and document-oriented open-source solutions are emerging as one of the leading choices.
- The acquisition of new hardware/software clusters to process "big data" can be expensive and inefficient. New hybrid solutions are emerging to minimize data movement while supporting both batch and online use on the same platform.
- Gaps in the Apache Hadoop stack's Hadoop Distributed File System (HDFS) component that lead to availability and performance challenges are being addressed in newer Apache distribution versions as well as by competitors such as MapR.
- Organizations seeking scale-out solutions for processing large volumes of unstructured or multistructured data should not limit their search to MapReduce-based solutions. HPCC solutions is a viable alternative and worthy of consideration.
- Development organizations looking for a NoSQL database solution for building Web-scale applications with familiar languages, with the flexible schema capability of a document-style database and full consistency, could consider MongoDB.
- Enterprise customers concerned about the availability and performance of storage in the Apache Hadoop stack should examine alternatives along with the newest versions of HDFS. MapR's distributions are candidates for workloads that require higher availability and performance.
- Very-high-volume event processing applications can be implemented with an Apache Hadoop stack. Companies experimenting in this environment could consider HStreaming as a potential tool.
- Combining structured and unstructured data on the same hardware stack is possible and can reduce expensive duplication. Organizations that expect such use cases to occur could consider installing Hadapt in their Apache Hadoop clusters.
Table of Contents
This research does not constitute an exhaustive list of vendors in any given technology area, but rather is designed to highlight interesting, new and innovative vendors, products and services. Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Data processing has always represented a spectrum of use cases from transaction/interaction processing to analytics, and many combinations of the two. Big data processing is no different, and new use cases have emerged that leverage both new and existing data not well-served by legacy solutions that were built decades ago in different computing environments. This was before massive scale-out architectures were commonplace and the variety of data types now being deployed existed. New product categories have sprung up, designed to fit new needs at lower cost and with better alignment to new development tools and practices. Products and vendors are continually emerging to offer more purpose-built alternatives; these are immature, but often worth assessing.
One of the drivers behind these offerings is the need to leverage parallel architectures constructed from less costly servers, using multiple threads of execution distributed across networks to perform analysis. The Apache Hadoop stack, a platform for the MapReduce Java programming framework, has garnered early support in the marketplace, and substitute components are already appearing at different layers. In this research we note the appearance of an alternative to MapReduce from HPCC, a substitute for the HDFS file system from MapR and a database from Hadapt designed to compete with Apache HBase, to provide a side-by-side environment for both structured and unstructured data in the cluster. We also highlight HStreaming, which seeks to bring event processing to the clustered stack.
As always with Cool Vendor reports, this discussion highlights only a few of many, and all are in relatively early stages. Caution is advised, but organizations seeking to explore the new opportunities for innovation created by the big data phenomenon will find much here to drive experimentation and new application creation.
Palo Alto, California (www.10gen.com)
Analysis by Merv Adrian
Why Cool: 10gen is the commercial provider for MongoDB, an open-source document-oriented, fully consistent NoSQL database for cloud and on-premises use, offering production support, training and consulting. Its automation, no-downtime design and broad language support have driven wide and growing usage. The company's Subscriber Edition adds Simple Network Management Protocol support as well as testing and certification to the free open-source Affero General Public License (AGPL) version. 10gen's rapidly growing revenue (Gartner estimates $10 million in 2011) comes primarily from its subscriptions, with the remainder from its training and short-engagement consulting business. It grew from roughly 25 employees at the beginning 2011 to over 100 at year-end and expects to double in 2012. 10gen now has 15 direct salespeople and plans to double that number this year. MongoDB boasts a list of marquee customers including mainstream names like Disney, Forbes, O2 (Telefonica), SAP, and the U.K. government along with Web-2.0-style firms like foursquare, Viber Media and craigslist for managing websites, product catalogs and other frequently changed data.
MongoDB provides no-downtime auto-sharding for distributed writes, replica sets, single-instance durability (in addition to replication-based high availability) and automated failover. The product also supports MapReduce on its own data, and 10gen planning adapters for Hadoop for batch processing of data and aggregation operations. MongoDB is used with Java, C#, Ruby, Python, PHP, Erlang and other languages. MongoDB is also available through (independent) hosting players Engine Yard, MongoHQ and MongoLab, letting developers create databases across three Amazon EC2 boxes in three different availability zones.
In September 2011, 10gen won a $20 million round of funding, led by Sequoia Capital and with participation from 10gen's other existing investors Flybridge Capital Partners and Union Square Ventures. The company has partnerships with numerous technology firms including Amazon, Fusion-io, Jaspersoft, Microsoft Windows Azure, NetApp, Rackspace, Red Hat, and VMware.
Challenges: MongoDB is still a relatively immature product with a growing list of customer needs that will require feature/function additions such as a representational state transfer (REST) interface, improved concurrency, better operational monitoring, and tighter integration with the growing Hadoop base. It had some failures that drew attention in the very talkative open-source community with technical problems in 2011. These will be helped by the recently introduced MongoDB Monitoring Service, a software as a service offering that will help subscribers anticipate and respond to developing issues in their systems.
As it moves into the mainstream, 10gen is finding that growth will drive significant operational changes. On the plus side, deals are growing larger, but, on the other hand, such customers have strong requirements for SLAs. The company has opened a support center in Dublin and plans to open one this year in Asia to ease 24-hour-a-day support delivery; learning to manage and operate them will be a new execution hurdle. Simply growing the organization at the current rate is an operational challenge.
Similarly, the sales model is evolving. Increasingly, 10gen finds itself competing with incumbent, major RDBMS vendors like Oracle instead of NoSQL players like DataStax (Cassandra) and Couchbase. Until now, the pipeline has been filled by prospects choosing and testing "free" products before engaging with 10gen, and the sale was little more than a formalization of a selection process that took place with no customer contact. As it seeks to engage mainstream prospects, the company will need to grow its sales force and build messaging that is appropriate for a different buyer.
Who Should Care: CIOs, IT architects, and developers looking for a persistence solution for large-scale Web applications designed for agile development and flexible schema design, where "eventual consistency" will not be sufficient, and for whom developer productivity is a key decision variable.
Cambridge, Massachusetts (www.hadapt.com)
Analysis by Merv Adrian
Why Cool: Hadapt is a "cloud-ready" RDBMS designed to run directly on Apache Hadoop clusters, providing improved performance over alternatives like Hive and HDFS for use cases where structured and unstructured data are used together. Based on research from a team including chief scientist Daniel Abadi from Yale University, one of the researchers whose work was commercialized as Vertica (see "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads" published in 2009), the company was formed to provide a single system for applications that rely on both structured and multistructured data. In December 2011, Hadapt 1.0 began its Early Access Program for beta customers, on the heels of a $9.5 million Series A financing round. The product is expected to enter general availability in early 3Q12. Company executives tell Gartner there are some 24 early access customers in insurance, retail, banking, consumer packaged goods and the Internet.
Hadapt believes that in clustered compute environments HDFS is the file system and Hadoop is the OS. It seeks to make Hadapt the analytical DBMS of choice, by commercializing the proposition that building query plans entirely in advance in a cloud environment cannot provide the best performance due to the dynamic nature of cloud resources during execution. It places an RDBMS instance on every node in the cluster in order to improve performance of queries over the structured part of the data, and uses data partitioning techniques to eliminate unnecessary data movement.
Hadapt extends Hadoop's design for data replication to insulate from node failure without data loss by using "adaptive query execution," accommodating varying node workloads by shifting work during execution to drive consistent high performance even in the face of node failure. Keeping an RDBMS data store on each node of a cluster creates a hybrid structured/unstructured fabric of data. Early results show that performance is an order of magnitude faster than a Hive-based interface to HDFS. This can be supplemented with continued use of MapReduce on data that is left in its raw form via a patent-pending "split query" approach. This eliminates the need for connector-based data movement across machines into expensive silos, and permits the use of SQL-based business intelligence (BI) tools without the "impedance mismatch" of connecting online-oriented tools to a batch-based system. Hadapt supports considerably more SQL syntax than Hive. This is because it does not need to translate into MapReduce calls, and delivers database-speed (not file scan-speed) results on its data, while using MapReduce on the data remaining in HDFS.
Hadapt is making code contributions to Apache, but is not a major contributor. It is designed to coexist on the same node, and the company does not provide a Hadoop distribution of its own. Gartner believes Hadapt will add text search capabilities, since this is an obvious fit and the connection between Lucene and a stored structured index would be very powerful.
Challenges: Hadapt is a tiny vendor in a large and crowded space. At this stage, it has no sales force, minimal marketing staff and a product that has not entered general availability. Its funding is adequate for its current size (some 20 employees, almost entirely in engineering). The CEO feels there is no immediate need for next-round funding based on the current plans. However, Gartner believes that should the product be well-received, aggressive ramp-up will be required to keep pace and meet customer expectations, including support. Hadapt's value proposition is unique and well-differentiated, but communications and execution will be substantial challenges in the year ahead.
Who Should Care: CIOs, IT architects and data management professionals looking for scalable SQL-based access to structured and unstructured data on dedicated hardware or in the cloud, especially when the two are likely to be consistently used together in applications.
Alpharetta, Georgia (www.hpccsystems.com)
Analysis by Donald Feinberg
Why Cool: HPCC Systems was formed in 2011 by LexisNexis Risk Solutions as an open-source organization sponsoring the HPCC technology. The original technology was developed 10 years before by LexisNexis to manage what were then considered big data problems by its clients. It is not often that we find a proven technology emerging as a product (open source or not) in the market timed with a market awareness (big data) from a company that has developed and backed the technology for many years and having thousands of enterprise customers. HPCC Systems offers two versions: the Community Edition (fully open-source software) and the Enterprise Edition (offered as a licensed product with support). What is cool about this technology is that it offers an alternative model to MapReduce to store and process big data with its own language: enterprise control language (ECL).
HPCC is an alternative to Apache Hadoop at a point in time when the awareness of MapReduce is growing, and HPCC Systems is attempting take advantage of this situation. By offering this alternative, it expects to leverage the need to collect and process big data using a system that is less complex and easier to use. HPCC uses massively parallel processing (as does Hadoop) to process data. However, according to HPCC Systems, it uses fewer nodes with fewer cores producing faster results. ECL is less complex than MapReduce, which allows programmers to write routines as opposed to the highly sophisticated routines used with MapReduce. Comparisons of code base show ECL requires far less code than MapReduce. There are three components to HPCC: (1) Thor — the extraction, transformation and loading engine for loading data into HPCC Systems, which can also perform tasks such as data hygiene, transformations, linking and indexing; (2) Roxie — the HPCC data delivery engine for running the queries against the data; and (3) ECL — the language used to program the queries. HPCC also provides analytics modules that process data for specific purposes. However, some of these modules are not part of the open-source system and are licensed separately.
Challenges: Unlike other startup organizations, HPCC Systems has a product with many customers and proven technology used by LexisNexis to support its customers, bypassing many early-stage product challenges. The primary challenge faced by HPCC Systems is market awareness. It is emerging at a time when big data solutions are associated with Hadoop MapReduce. The purpose of offering both open-source and licensed alternatives is to develop the market awareness and penetration for the HPCC Systems platform. The open-source solution allows users to test the product before committing to using it. However, HPCC Systems has a large obstacle with the current market position of Hadoop MapReduce and alternatives to Hadoop such as MapR and Hadapt. Not only are these technologies familiar to the market but also many third-party software vendors have developed tools integrating with and supporting Hadoop. For example, IBM has incorporated Apache Hadoop code in its Big Analytics and the Oracle Big Data Appliance uses the Cloudera Distribution of Hadoop. With the market leaders integrating forms of Hadoop into their systems, this makes entering the market with an alternative even more difficult.
Who Should Care: Organizations looking for big data solutions to process and store data should consider HPCC Systems as a potential alternative solution to an Apache Hadoop MapReduce system. In addition to the CTO, database administrators (DBAs) and other technical management will be interested in HPCC Systems, along with business unit management, and analysts looking for a less complex approach to big data.
Chicago, Illinois (www.hstreaming.com)
Analysis by Roy Schulte
Why Cool: HStreaming applies the principles of MapReduce to the challenge of getting real-time intelligence from high-volume streams of event data. The resulting system has the potential to scale higher than conventional event-processing software platforms, while running much faster than common MapReduce systems.
Conventional event-processing platforms were designed as hubs and typically had a centralized topology. Newer event processing platforms have added parallelism and, separately, can also be distributed geographically to implement hierarchical event processing networks. HStreaming's version of parallelism is based on an advanced Hadoop cluster grid architecture that has the theoretical potential to scale to hundreds of nodes, each capable of up to a million events per second.
HStreaming applications have latencies measured in milliseconds or seconds, compared to common MapReduce systems that typically run in 10 minutes or more. Common MapReduce systems store data on arrival; run the mapping steps and store the data again; and then run the reduce steps and store the data again to make it available for analytics. By contrast, HStreaming, like all other event-processing platforms, operates on data in motion. It executes all of the operations in memory, turning raw input event streams into output streams without putting the data into a database. (However, the developer can optionally store the output in HFDS, Apache HBase, Apache Cassandra or a standard SQL database at the end of the complex-event processing [CEP] step.) Input streams are continuously mapped on arrival and the reduce step operates on moving time windows of data. HStreaming supports the CEP functions typically found in event processing platforms, including filter, sort, correlation, aggregation, pattern matching, projection and user-defined functions.
The patented HStreaming communication fabric architecture provides high availability in addition to scalability. HStreaming Cloud runs on Amazon EC2 which allows resources to be dynamically added or removed as the workload swells or shrinks. HStreaming Enterprise is the software version available for running on-premises. HStreaming Community Edition is an entry-level system with fewer reliability and security features. HStreaming leverages standard Apache Hadoop technologies such as Pig, HDFS, and Zookeeper, with some adjustments to accommodate the real-time nature of event processing. It leverages Pig ecosystem options including page rank, sessionization, Markov pairs, quantiles, Wilson binomial confidence intervals, set operations and linear regression.
Challenges: HStreaming is a young company (founded in 2010) with a largely unproven product. The HStreaming Cloud Beta version reportedly has just over 100 customers, and the Community Edition a few less than 100. It has leading-edge core technology and impressive developers with strong experience in event processing, virtualization and parallel processing. However, it is missing some of the supporting tools that would provide a polished product. For example, it has limited analytic functions and prebuilt adapters to external data sources, and no native dashboard tools. The company is underfunded, so its marketing and consulting resources are minimal. At least two larger competitors support MapReduce-like CEP, including IBM (InfoSphere Streams) and Vitria (M3O), although neither integrates as closely with Apache Hadoop and associated tools.
Who Should Care: HStreaming is relevant for very-high-volume CEP applications, particularly those involving unstructured data or fluctuating workloads where the scalability of cloud computing helps. Companies are most likely to use it for customer relationship management, fraud detection, media analytics, service-level management, regulatory compliance, log file analysis, social computing activity streams, intrusion detection and similar applications. These applications are inherently well-suited to experimentation with new products, such as HStreaming's, because the initial usage can be limited to read-only analytics with human oversight, without the risk of corrupting production data or losing actual customer transactions.
San Jose, California (www.mapr.com)
Analysis by Donald Feinberg
Why Cool: MapR launched its proprietary version of the Apache Hadoop stack — MapR — in 1Q11, featuring a proprietary file system called Direct Access NFS, which replaces the HDFS and facilitates the use of existing Network File System (NFS) data, enhancing the ease of use, stability and maturity of the platform. Beyond the proprietary file system, MapR has all the necessary products, functions and tools to compete with the other distributions of MapReduce. It is API-compatible with Apache Hadoop, and both of MapR's distributions — the Open Source M3 Edition and the proprietary M5 Edition — contain standard Apache Hadoop projects such as Pig, Hive, Mahout and HBase (see "How to Choose the Right Apache Hadoop Distribution"). MapR also distributes a number of tools to enhance the environment and easily create and manage a MapReduce environment. What is different from other distributions of Hadoop and MapReduce is Direct Access NFS. Traditionally, there have been many issues with HDFS such as little capability for high-availability (HA), several single points of failure (especially in the name node, used to track where and how data is distributed across the data nodes and storage) and with poor overall input/output performance. Direct Access NFS solves these issues and adds additional functionality such as snapshots, compression and automatic mirroring for disaster recovery. Direct Access NFS also replaces the append-only HDFS with a random read/write capability.
Since becoming operational in early 2011, but without self-promotion, MapR has enjoyed a position as one of the more stable and performant distributions of Apache Hadoop. This led EMC to become an OEM partner for the MapR product, both as a stand-alone offering and integrated with Greenplum in its Data Computing Appliance (DCA). For much of 2011, EMC was selling the MapR product to support MapReduce in big data environments and especially connected to the Greenplum DBMS for MapReduce functionality. Additionally, MapR and Informatica have announced a partnership where Informatica will support MapR in the Informatica Platform, allowing integration of external data with relational data via MapReduce processing. The two vendors announced that Informatica's HParser Community Edition will be available with the MapR distribution, leveraging Informatica's tools to eliminate the need to develop and test data transformations in Java and Perl.
Challenges: There are several issues ahead for MapR. Firstly, they are one of many new distributions of Apache Hadoop. Cloudera, the premier Apache Hadoop distributor, has grown significantly over several years and boasts the largest customer base of the distributions (see Where Are They Now). In 2011, IBM (Infosphere BigInsights) also released a product competing with MapR; recently Greenplum (a division of EMC) released its own version of Hadoop called Greenplum HD; and Hortonworks and Hadapt are about to release their distributions into production. Also, Apache Hadoop remains available as an open-source product with many organizations downloading it directly from the Apache website. With all these available sources, it is increasingly difficult for customers to differentiate among the distributions.
Secondly, in January 2011, Apache released its 0.23 release of Hadoop HDFS, which fixes many persistent issues such as the single point of failure of the name node, HA (although manual) and performance. Although the performance of MapR is higher than the new version of Apache 0.23 (yet to be shown), many users will find the performance increases of the Apache distribution sufficient for their needs.
Finally, although distribution by EMC has helped MapR gain revenue and market position through 2011, EMC now has its own distribution. EMC will continue to field MapR as a stand-alone Apache Hadoop distribution: Greenplum MR. However, the launch of the Greenplum HD distribution — the Isilon OneFS Storage Option for network-attached storage (NAS) — and the Greenplum HD DCA Module as a coprocessing Hadoop component for EMC's Greenplum Data Computing appliance for side-by-side operation with the Greenplum database, will certainly reduce the visibility of MapR within the EMC sales force and customer base. MapR needs to find additional partners quickly to offset the loss of mind share within EMC.
Who Should Care: Any organization looking to support a MapReduce environment for big data should consider MapR for the ease-of-use, stability and maturity of the platform. Over time, we believe that other distributions will add similar stability using the new 0.23 release of Apache Hadoop. However, MapR still maintains a strong differentiation from the other distributions when combined with the functionality added by MapR for ease of installation, set-up, monitoring and managing the MapR environment. Finally, for organizations with current MapReduce environments, MapR is an easy replacement as it maintains complete compatibility at the API and program level.
Dayton, Ohio (www.asterdata.com)
Analysis by Donald Feinberg
Profiled in "Cool Vendors in Data Management and Integration, 2009"
Why Cool Then: Aster Data Systems was founded in 2005 before the big data movement. Its unique architecture incorporated SQL-MapReduce, a patented framework for analytics. SQL-MapReduce acts as an SQL language extension, creating a platform which leverages MapReduce processing to exploit structured and unstructured data for efficient time-series, pattern-matching analysis and graph analytics through standard ANSI SQL and BI tools. Aster nCluster's in-DBMS analytic processing also incorporates 50+ prepackaged SQL-MapReduce analytic modules.
Where Are They Now: In March 2011, Aster Data Systems was acquired by Teradata (based in Dayton, Ohio) for an undisclosed amount (see "Aster Data Purchase Shows Teradata's Vision Is Deeper Than 'Big Data'"). Teradata continues to enhance and market Aster's nCluster DBMS as a separate product, now called Aster Database, and has invested in sales and marketing to expand its reach. The integration of its DBMS and tools with Teradata software is proceeding with a new Aster MapReduce Appliance, new SQL-MapReduce analytical functions, and a high-performance bidirectional bridge to move data between Teradata and Aster Database.
Who Should Care: Organizations seeking to leverage large amounts of structured and unstructured or multistructured data types for analytics and customer-facing interactive applications, and expecting to require support for production purposes should be interested in Aster Data's nCluster. In addition, Teradata customers will find Aster's offerings extend the reach of their technology into new areas.
Palo Alto, California (www.cloudera.com)
Analysis by Merv Adrian
Profiled in "Cool Vendors in Data Management and Integration, 2010"
Why Cool Then: Cloudera recognized the market opportunity in addressing the difficulty of installing the unsupported open-source Apache Hadoop MapReduce stack for a production environment. It did this by creating Cloudera Distribution for Hadoop (CDH), adding Cloudera Manager to manage Hadoop clusters and including multiple Apache projects such as HBase, Hive, Pig and Mahout for multiple use cases such as text analytics, SQL interfaces and machine learning. Cloudera championed Hadoop aggressively and created Amazon Machine Instances for Amazon EC2 and VMware vCloud API support to help drive down barriers to entry by leveraging the low-cost model of the cloud. The company built a suite of subscription support, consulting services and training offerings. It also took a leadership role in the community by driving the growing visibility of Hadoop and funding programmers ("committers") who made sizable contributions to the various Apache projects.
Where Are They Now: Usage of Apache Hadoop has exploded and has created opportunities and threats for Cloudera, who have grown from 30 to 215 employees since 2010. Major vendors such as IBM and EMC now offer their own distributions of the Apache Hadoop stack; others like Dell, NetApp, NTT Group, Oracle and SGI have chosen to partner with and redistribute CDH, as have adjacent vendors in DBMS, BI, data integration, hardware and system integration. There are 200 vendors that are part of the Cloudera Connect partner program. A pure-play competitor emerged when Yahoo spun out several dozen of its engineers into new vendor Hortonworks, which claims to have contributed more code to the stack than Cloudera's participants. The two firms are competing for customers, partners and mind share, but Hortonworks has not yet entered general availability with its distribution, while Cloudera has sprinted to an early lead and now has over 100 enterprise customers: Cloudera claims more than 80% of the top North American telecommunications carriers, equipment manufacturers and service providers as customers. CDH4 recently entered the beta stage, continuing a steady cadence of releases that bring to market continuing development of the Apache projects, steadily plugging gaps and adding features that are demanded by enterprise adopters.
Who Should Care: Organizations seeking to leverage large amounts of structured and unstructured or multistructured data types for analytics and customer-facing interactive applications, and expecting to require support for production purposes should be interested in Cloudera. In addition, companies interested in working with a vendor that is partnering closely with Cloudera (for example, companies where Oracle or NetApp is a strategic vendor) should familiarize themselves with the offerings. CIOs, CTOs, BI competency centers and DBAs will find the Cloudera product line useful for design, training, deployment and support.