By Vaibhav Puranik, Prinicpal Engineer
Usually our CTO Ken Weiner never refuses to go to the usual GumGum afternoon coffee walk at 3:30. But on that day, something was fishy. He refused. When I came back I saw him salivating at the screen. I walked to his desk and he quickly pointed me to a website on his screen. Then he uttered the following words – ‘Realtime Data’!
Being an engineer at heart, I got overjoyed with the prospects of doing challenging work. The next task was to find the frameworks that would help us materialize the dream of Realtime data. We decided on Kafka, the distributed publish-subscribe messaging system, pretty quickly.
Developed by engineers at LinkedIn, the makers of Kafka believed that sequential disk access can be sometimes faster than RAM! And they designed this robust and distributed high performance publish-subscribe messaging system.
Kafka uses ZooKeeper for coordination. On the production side, one can avoid going through ZooKeeper by simply putting a load balancer in between the producer and the brokers.
That’s what we decided to do at GumGum.
At consumer side the coordination is important and hence the zookeeper consumer base is a better choice unless you have a single consumer or your own coordination system.
Kafka created multipe streams of events. Now we needed some framework to process this stream. Initially we contemplated using simple Groovy scripts to consume Kafka events. We already have a system that consumes events from Amazon SQS. A server starts 10 of these scripts (and sometimes more). But there were many issues with the system. We didn’t have a mechanism to divide code into logical blocks. Additionally, there was no container for the scripts and there was no coordination system.
Meanwhile, we had started hearing about Storm from friends and the big data community. The very first presentation I attended at 2012 Hadoop Summit was Nathan Marz’s presentation on Storm. By then, I had started playing with Storm. The presentation was jam packed. People were already talking about storm being the next big thing in the big data world at the conference. I was sold. We decided to go for Storm.
Storm is especially designed for processing unending streams. It offers parallalized processing framework and has concept of Spouts and Bolts. Spout connects with Kafka, SQS and JMS-like systems to emit ‘Tuples’. These tuples can then be processed in one or more bolts. Each bolt is a processing unit. Bolts can be chained. You can process the tuple in a bolt and then choose to forward it in a different or the same form for the next bolt. You can chain as many bolts as you want. Since the bolt is a generic construct, you are not required to break your problem in constructs such as Map and Reduce. The best part is that Bolts offer flexible constructs for processing data. You can also do some fancy routing of the tuples with grouping and so much more.
Recently, Nathan Marz has introduced Trident – a higher level of abstraction built on top of Storm. Instead of thinking about bolts an spouts, Trident offers a higher level of abstraction that any competitor. You can have all sorts of operations such as Suming, Counting, Reducers and Aggregations, all by using Trident. GumGum has tried one Trident Topology so far, but I would consider it as a work in progress.
GumGum is currenlty routing approximately 100 million events per day through Kafka and Storm. Our internal customers have started using real time data and the topologies, along with the Kafka cluster are pretty stable. We have a three node (m1.small) Kafka cluster and a three node (c1.xlarge) Storm cluster. Both clusters are underutilized and we have plans to use Kafka + Storm in the future.
It has become very easy to add a new topic or new consumers for the same topic. And compared to MapReduce, development with Storm is much easier than ever before. You can debug your topology by running it locally! With Map Reduce, you cannot easily debug your jobs in eclipse unless you are using Karmaspehere or a similar tool.
I gave a talk recently at Los Angeles Hadoop Users Group describing all of the above in much more detail. The audience was enthusiastic and asked lots of really interesting questions.
You can watch the video here and the slides can be found here.
You can read much more about our big data experience at my blog.
Let’s us know what you think about this big data experiment…will you give it a go? Sound off in the comments below!