Thus far, our Logging Data series has focused on the nuts and bolts of our network operations and data infrastructure. While we employ some terrific software and hardware, our proverbial secret sauce consists of the various customizations we employ using these tools. No place was this more evident than during the transition from HDFS to Gluster, and the subsequent porting of Hadoop resources. The team here is well versed in working around issues, so after some brainstorming, the solution pretty much morphed into “Let’s just build something internally that fulfills our needs better than Hadoop.” Not an easy task, but one that our Operations and DI teams took on readily

Thus far, our Logging Data series has focused on the nuts and bolts of our network operations and data infrastructure. While we employ some terrific software and hardware, our proverbial secret sauce consists of the various customizations we employ using these tools. No place was this more evident than during the transition from HDFS to Gluster, and the subsequent porting of Hadoop resources.

As mentioned in our second Logging Data installment, Gluster was chosen as our successor to HDFS for a number of reasons. Among them, faster data streams, greater data availability, and less training needed for users to interact with the file system. Yet, problems arose in trying to take advantage of these Gluster-specific benefits while still using our prior Hadoop framework. Luckily, Redhat provided us with a good start in open sourcing the Hadoop shim, known on GitHub as glusterfs-hadoop.

We felt the shim was missing a few things and subsequently modified it to honor both posix permissions specific to each user running a job, as well as permissions inheritance. These modifications allowed Hadoop to properly operate on Gluster in a multiuser environment. Specifically, the changes were necessary as the user running the Hadoop components must be able to access all parts of your warehouse and yet, the users of Hadoop, the ones submitting jobs, shouldn’t.

Unfortunately, while digging through the Hadoop code we found things that displeased us. In particular, the path through Hadoop that data takes is complex, going in and out of java many times. Basically, input data is read into java, then given to the mapper whose output is then consumed by java, then written to disk where java on another node reads it back in to give it to the reducer, whose output is then consumed by java once again before it is written back to disk.

The team here is well versed in working around issues, so after some brainstorming, the solution pretty much morphed into “Let’s just build something internally that fulfills our needs better than Hadoop.” Not an easy task, but one that our Operations and DI teams took on readily.

The result became what we refer to internally as CMR, or, Cluster Map Reduce. A diagram of how this system fits into our analytical data pipeline is below:

Once rolled out, the benefits to the company over our earlier Hadoop system were numerous and impactful. Specifically:

  1. Faster times to query completion due to complete separation of framework from data flow. In one of our use cases, this reduced the hardware we needed by 87.5%. One of our hadoop clusters totaled 24 nodes. Replacing Hadoop with CMR allowed the same workload to be completed by only 3 of those nodes.
  2. More straightforward query construction and granularity thanks to wrappers that sit on top of CMR. We call them cget and cgrep. We’ll go into these a little later.
  3. Greater ability to customize future iterations as needs dictate. The majority of the codebase, from the mappers to the framework, is written in Perl; a language in which we have substantial in-house expertise (there are a few parts that are written in C where CMR gets close to the metal).
  4. A much lighter footprint than Hadoop. Normal daemon operations each consume less than 50MB of residential memory. At no point in time is data sitting in memory used by the framework.
  5. CMR is relatively stable and its clients are resilient against failures in the server components. Because of this, any part of the CMR infrastructure may be restarted without causing currently running jobs to fail (though the jobs might take a little bit longer than usual).

A given CMR job consists of three essential components:

  1. A language agnostic mapper
  2. A language agnostic reducer (recommended, but technically optional)
  3. Data in the form of input files

Naturally, most of the functionality of actually performing a map-reduce job relies on your mapper and reducer. Our mapper takes the fields we specified as parameters and the reducer is told how many fields to expect and how many aggregates to make.

Now, suppose you don’t care about mapping or reducing your data set; its a scary thought, but bear with us. You may be examining a small time span and you want to extract all the lines that meet certain criteria (i.e. you want to grep over your files). With the CMR framework, you can run a distributed grep over your data warehouse – a nice added benefit.

Now that we’ve taken a kind of sky-high view of CMR, in our next installment we’ll be discussing how we’re opening it up to the world as an Open Source project. This project will include the CMR framework, both wrappers, and a working mapper/reducer pair.

Logging Data Part 5: Giving You the Keys

Stay tuned!