Logging Data Part 2: Taming the Storage Beast

Logging Data Part 2: Taming the Storage Beast

  • 24 July 2014

The previous installment of our Logging Data series outlined how individual impressions move through our network. In this edition, we’ll discuss the necessary storage considerations cataloguing all of these impressions effectively 24 hours a day, specifically focusing on the challenges that result from the requirements of ad network operations. Let’s first turn to the testing issue.

Despite their seeming simplicity and small size, designing ad units that maximize the likelihood of a user’s click is an iterative process that can take anywhere from several days to several months. Getting the optimum colors, text size, font, layout, and orientation are the product of a battery of A/B tests ranging from the simple to complex. For Chitika, we’re also looking to place these ads effectively on our network of sites, meaning that we need to comprehensively test and measure which ads are resonating with which users.

On top of all this ongoing testing, we’re running an extensive ad network serving hundreds of millions of ads per day, around the globe, 24 hours a day.

To give you an idea as to what we’re dealing with, let’s look at three of the topline daily figures in terms of what we are cataloguing:

  1. More than 100 million ads served within our network
  2. Several billion real-time bidding (RTB) events
  3. Millions of tracking events, measuring things such as the ad becoming visible, etc.

Needless to say, our storage requirements are wide-ranging and require a high degree of redundancy to eliminate downtime, as data are generated constantly throughout each day. The wide breadth of user location means that data centers are geographically spread out across the country to handle the associated global load. Each of these data centers has more than 20 ad servers and more than 25 RTB servers. Our analytical data center is closer to our Westborough office for the purposes of aggregation and analytics.

As you can probably tell, we have some unique requirements, which spurred the team here to create a more customized solution to manage the flow of data to each of our clusters. We decided to use Gluster which is currently maintained by Redhat.

In short, Gluster is the crux of our storage architecture, replacing HDFS in our Hadoop deployment and representing a POSIX-compliant data warehouse clustering application. Its benefits over HDFS are tremendous, namely:

  • Data available to ALL nodes, not just nodes holding the replicas

  • Data streams over Infiniband at 40Gb instead of HTTP over 1GB ethernet

  • Loss of a single replica will not trigger a high-cost heal operation

  • POSIX compliance means no user training on interaction with the filesystem

  • Separation of storage and computation components

Gluster currently runs across 4 dedicated servers, each of which is figuratively armed to the teeth with disk space. The per-server specs shake out as:

  • 36 2TB drives (144 drives total)

    • These are split into 6, 5-drive raid arrays, each array has 1 hotspare

      • Each array has 7.3TB of usable space

    • This equates to roughly 59TB of usable space in the data warehouse cluster, with 40TB consumed

Taking a step back, let’s summarize the overall impact of all this technical back end. There are a number of benefits to the utilization of Gluster, arguably the biggest of which is the system’s POSIX compliance. This allows us to very easily mount Gluster anywhere as if it was a disk local to the system, meaning that we can expose 60TB of data to anything in our data center, across any amount of servers, users, and applications.

Obviously, while we believe we’ve built a pretty great storage architecture, we’re still tackling some limitations of the current system.

The most apparent speed bump is slower file stat operations due to the system’s built-in redundancy. Each stat operation forces Gluster to check each file replica and heal if they don’t match. This takes time particularly due to it being a linear operation – the client system utilities are linear, and the replica check on each Gluster node is linear.

Beyond that, we’ve observed slow context switching – a common problem when reading a large number of small files. Each read request is load balanced, with a small file is unlikely to fill the Maximum Transmission Unit (MTU) of TCP/IP over Infiniband (IPoIB); which is 64KB.

All that being said, tackling these challenges is the name of the game for any DI team, and represent an ongoing part of our work at Chitika. For some additional reference, we've pasted a diagram of our analytical data pipeline architecture below:

With a view into how we store our data, our next installment of this series will be how we make these data accessible and usable for our staff of data solution engineers:

Logging Data Part 3: Getting to the Show

Stay tuned!