In our last edition of Logging Data, we introduced Cluster Map Reduce, or CMR. The new tool acts as an alternative to Hadoop and HDFS when paired with a POSIX compliant clustered file system, simplifying the movement of data through the analytical back end, and helping to minimize the dependencies and potential points where the data pull process may slow or stop altogether. Today, we’re proud to provide this tool to the world as a free, Open Source release!

In our last edition of Logging Data, we introduced Cluster Map Reduce, or CMR. The new tool acts as an alternative to Hadoop and HDFS when paired with a POSIX compliant clustered file system. Within our environment, CMR provided us a number of improvements in terms of query completion speed, resilience, and ease of query construction, among other areas.

Today, we’re proud to provide this tool to the world as a free, Open Source release!

How CMR Works for You

CMR provides a new method of access that addresses a number of pain points we were experiencing with our earlier Hadoop deployment. Among them:

  1. Independent scaling of both storage and compute components due to complete separation of those duties.

  2. A much lighter footprint than Hadoop, with normal daemon operations each consuming less than 50MB of residential memory

  3. Elimination of node hotspots due to replication healing or simultaneous data access, due to all nodes having access to all data.

The crux of CMR is that it simplifies the movement of data through the analytical back end, minimizing the dependencies and potential points where the data pull process may slow or stop altogether. This, along with the more visible user accessibility advances, comprehensively improve day-to-day data analytics operations, especially within large data environments.

A couple of important notes:

Our CMR deployment utilizes InfiniBand, which allowed us to fully realize the additional speed benefits that CMR brought to the table. For those who would use CMR across a GigE connection, any speed increase may not be as evident.

Additionally, one of CMR’s biggest benefits is allowing all available resources on compute nodes to be used for computing instead of multitasked with storage requirements. As such, environments with centralized and robust data storage solutions are the best fit for CMR, rather than say, a two-disk home NAS appliance. While CMR will still work with such an NAS, compute nodes will quickly be data starved.

Getting Started

CMR is now available on GitHub, with all the applicable documentation including setup information and general FAQs. Please visit the CMR GitHub for information on how to get started. https://github.com/chitika/cmr

Take a look, tinker around, and let us know what you think! We’d love to hear the good, the not-so-good, and anything in between – drop us a line at cmr@chitika.com