September 28, 2012

Hadoop! What is it good for? Absolutely ... everything!

In times of hysteria people tend to use their reptilian brain. This sub-brain, that has been with us since we were fish, or tadpoles, it's what kicks in when we face the unknown. In computer science or information technology, organizations tend to hold down to emotions and less and less in reasoning. Could it be because reasoning got them nowhere in the past?! This post is suppose to mock and shed some light over how we came to the situation of having people saying that Hadoop not only counts words, indexes the web, but also does science research, washes the dishes and finds me my next hottest gadget by knowing I'm driving in front of a specific store.

Imagination is abundant

Medical doctors think IT is sci-fi. In my developer days when working with in the healthcare industry I always had the feeling that these people saw computer systems as capable of doing the most amazing things. A nurse once asked me each time a specific document didn't got printed entirely that an email would be sent to her with the pages that were left to print. Crazy stuff like this was abundant because somehow the level of expectations of healthcare professional towards information technology was huge. Today under the umbrella of Big Data technologies the most amazing use cases pop out of people's brains. It's not that my cellphone won't receive a text with a promotion code if I drive through a Starbucks joint, I get it and it's not sci-fi to implement, but why would I care? Why all of the sudden the big and flashy sign saying STARBUCKS is not enough? Could it be because most people drive whilst looking at their smartphones instead of looking ahead to the road? It's possible, but still a long shot.

What can I achieve with Hadoop?

What can you achieve with a blank piece of paper? Blank canvas? The most amazing piece of literature, or the most amazing paint, or not. It's not about the canvas, but what we lay down on it. So when people ask me what can be done with Hadoop I struggle to answer with a straight face. It's not Hadoop and its the ecosystem that I'm referring to. It's the core functionality of processing data using Map-Reduce (MR) programs. The MR processing algorithms are the ink with which you either paint, or write the most amazing pieces of art, or just ruin the clothes you're wearing. So please let's concentrate on where to aim instead of the gun itself.

What are companies really doing with Hadoop?

Now that's more like it! Besides indexing large chucks of web logs, applications logs and other logging related data, Hadoop is being used as inexpensive batching platform. Have a job that runs on top of hundreds of terabytes or even some petabytes? Throw all that those bulky bytes into an Hadoop cluster and the "thing" will chew on it doing whatever your batch is suppose to do. It's the same quest companies were having on the last 15 years trying to built an Enterprise Data Warehouse (EDW), by concentrating all the data in one place and doing whatever they wanted with it, being aggregations or more complicated processing generating analytic data to serve neat cool tools and boring reports. Reporting, ah-hoc querying and data mining were done on this EDW and this costed loads of money. Important information, powerful machinery, armies of consultants and yet when the CEO asked simple questions, a new report had to be developed taking weeks to incorporate it on the process flow. The smart companies questioned all this and decided that with a tenth of the money thrown at these infrastructures, they could now use Hadoop to offload most of the batch processing that cranks up the EDW for hours, if not days. But they are starting with the simpler information.

How can we take Hadoop to the next level?

We are still far from that magic email that sends you the unprinted pages, but the Hadoop ecosystem started to inspire the creation of technologies in other fronts:

  • Batch

  • Real-time event processing and Data Streaming

  • Machine Learning

The next level would be:

  • Batch process more data that the EDWs are powerless with

  • Process Data Streams with in-memory databases (like HBase)

  • Ability to unveil new patterns in information on a recursive fashion (iterative learning)

Now we're talking! Now all those fantasy use cases don't seem so far away. The next level of Hadoop is to use it's massive scalability features to incorporate other types of processing that could benefit from it.


Why Batch?

Batch is back. That vision that "real-time" killed the "batch"-star can't handle the fact that massive amounts of data need to be constantly chewed in order to enrich decision engines, like recommendations in web market places. The question is that up until now most of the batch was done serially. The fact that Hadoop batching needs to abide with the Map Reduce processing will make developers with this type of skill very sought of, but the job market won't be able to deliver a fraction of what the demand will create. So if you have kids in school about to decide which master to take, push them into math, natural language, artificial intelligence, or just cooking. Chefs are the new rock stars these days along side with Data Scientists.

Data Streams or why real time is a fake challenge

Reacting to events in real time and processing streams of data are two different challenges, and yet people tend to throw them into the same bag. Event processing relies of very complex decision trees, rules tables, and other info that can be dynamic, but take some time to change. Adaptive event processing is only achieved in closed environments where the number of variables is not only small but controllable. Otherwise you're on your own and that's a very lonely and powerless place to be when it comes to react to events. The layer that holds the rules engine of these event processing systems is from where information is fed, and the source is that shiny new batch world mentioned before.

Data streams on the other hand are other playground. These are not events to which you have to react to, but instead continuous flows of information that you need to either serve or redirect. A simple example would be having the ability to record each and every push on the buttons of a TV remote control. The remote itself is the device that needs to handle reaction, but this data stream coming from millions of sofas needs to be recorded for later processing on that shiny new batch world mentioned before. Where did I heard this?

Information Feedback

We all know the expression that "money generates money", and on the same way information generates more information. This is the wealth of information, otherwise it becomes a pile of useless digital rubbish. This process can be done in three ways:

  • Create models to analyze data

  • Let the machine find those models

  • Trial and error and eventually stumble upon hidden information

The technical terms for these options are:

  • Data Mining or Data Science

  • Machine Learning, Natural Language and Artificial Intelligence

  • Data or Information Discovery

Now it's up to you how you will dig in. But remember that to do any of these you might need or not an Hadoop cluster, but only if your data is there, if the program was written in Map Reduce and if you have loads of data.

Most of the times at this stage data is not in Hadoop anymore because it needed to be mixed with other transactional data, and it's sitting on some relational, or multi-dimensional engine somewhere. The probability of having the program written in Map Reduce is very low, because we are not counting words, we are messing with some complex analysis and this is out of the Hadoop league. What? Mahout? Not there yet, though it's something to keep in the back of your mind.

Having loads of data at this stage it's not rare, but one that can easily be loaded into the Hadoop cluster it's more unusual.


Wow! This post has a conclusion! Not that I'm going to tell you something new, but at least is based on common sense of 2012. Sun has circled the earth for centuries and all of the sudden guys like  Copernicus and Galileo come and things changed. So what is true in Hadoop usage might change with time, but up until now Hadoop is not supposed to be used in everything and everywhere. It has a place, it's earning important territory due to economical outlook and the fact that x86 technology is increasingly powerful and cheaper. The blind spots are still the same:

  • Lack of Map Reduce literate people

  • Lack of maturity of this technology

  • Lack of Data Scientists

I think that once some of these constraints start to unlock, Hadoop usage will be more widespread, but that does not mean that other technologies will perish. Or does it? Is there space for everyone?

Put down your comments below if you want to brainstorm these ideas.







  1. There's no data mining or data science without machine learning, so I'm not sure what you mean that you can go with one or the other.

  2. Claudio: it's an honor to have you here in my blog. The options mentioned above are not exclusive but cumulative. The way I mentioned Data Mining and Machine Learning is very much aligned with the work of Ian H. Witten, Eibe Frank, and Mark A. Hall: Data Mining helps you find patterns in historical data, whereas when using Machine Learning in that process (next step) you will be able to predict possible outcomes in new situations. On the other hand forecasting is also part of the Data Mining iterative learning process, and this field of data science has lots of different implications depending on the each one's experience. I used the most flat vision of Data Mining of static models, and Machine Learning has the next level of Data Mining that incorporates dynamic features. It's probably an oversimplification, but that helps IT people that are not proficient in these matters to box the building up process and increasing complexity as you move from low density data to more rich one.