My eye was caught the other day by a question posed to the “Big Data, Low Latency” group on LinkedIn. The question was as follows:
“I’ve customer looking for low latency data injection to hadoop . Customer wants to inject 1million records per/sec. Can someone guide me which tools or technology can be used for this kind of data injection to hadoop.”
The question itself is interesting, given its assumption that Hadoop is part of the answer – Hadoop really is the new black in data storage & management these days – but the answers were even more interesting. Among the eleven or so people who responded to the question, there was almost no consensus. No single product (or even shortlist of products) emerged, but more importantly, the actual interpretation of the question (or what the question was getting at) differed widely, spinning off a moderately impassioned debate about the true meaning of “latency”, the merits of solid-state storage vs HD storage, and whether to clean/dedupe the data at load-time,or once the data is in Hadoop.
I wouldn’t class myself as a Hadoop expert (I’m more of a Cosmos guy), much less a data storage architect, so I may be unfairly mischaracterizing the discussion, but the message that jumped out of the thread at me was this: This Big Data stuff really is not mature yet.
I was very much put in mind of the early days of the Web Analytics industry, where so many aspects of the industry and the way customers interacted with it had yet to mature. Not only was there still a plethora of widely differing solutions available, with heated debates about tags vs logs, hosted vs on-premise, and flexible-vs-affordable, but customers themselves didn’t even know how to articulate their needs. Much of the time I spent with customers at WebAbacus in those days was taken up by translating the customer’s requirements (which often had been ghost-written by another vendor who took a radically different approach to web analytics) into terms that we could respond to.
This question thread felt a lot like that – there didn’t seem to be a very mature common language or frame of reference which united the asker of the question and the various folk that answered it. As I read the answers, I found myself feeling mightily sorry for the question-poser, because she now has a list as long as her arm of vendors and technologies to investigate, each of which approaches the problem in a different way, so it’ll be hard going to choose a winner.
If this sounds like a grumble, it’s really not – the opposite, in fact. It’s very exciting to be involved in another industry that is forming before my very eyes. Buy most seasoned Web Analytics professionals enough drinks and they’ll admit to you that the industry was actually a bit more interesting before it was carved up between Omniture and Google (yes, I know there are other players still – as Craig Ferguson would say, I look forward to your letters). So I’m going to enjoy the childhood and adolescence of Big Data while I can.
I come from a more traditional branch of analytics (applied stats, now being applied to customer engagement and product strategies), and am facinated by some of this “big data” discussion. I am less interested in the technology aspect of it, and more in how I might use this data. From my little research, it looks like this is a “big on promise” area as of now, with the true value to business yet to be proven. Do you agree? Are there success stories where businesses have actually realized the value out of it?
Makul,
I agree that there is a lot of hype around Big Data at the moment. But I do think that there are some companies that are deriving value from it – Microsoft being just one example. As with web analytics about 10 years ago, though, I think that the resources required to leverage Big Data today are still quite considerable. I expect to see the emergence of lower-cost and lower-complexity solutions in the future, which will make it easier for a broader range of organizations to participate.
Cheers,
Ian