January 30, 2013

Big Data: stop the panic, start the project. Do's and Don'ts.

The question is not anymore if you have a Big Data strategy, the challenge is how are you going to roll it out. What layers of your Information Systems will change? And most importantly: level up the expectations at all levels of the enterprise.

Big Data at the Enterprise it's not about hype and I will try to show you why. You can leave the panic state now. It's ok to be calm.


From 3-Tier to N-Tier Architecture

If you were not working in the IT industry when the death of client-server architecture was announced by a brand new 3-tier architecture you can skip this part. If, on the other hand, you never recovered completely from that shock, you can skip the whole article and go back your mainframe vision of life, because you ended up in the wrong blog apparently.

If in fact you lived that change in the industry you might recall that the new layer just arriving at the reference architecture for computing systems was the Logic Tier sitting between the Presentation Tier and the Data Tier. Looking back now this was the beginning of the end for data-centric systems, because they would never cope with the data deluge of our days. Pushing away the Logic from the Data, created challenges that still today are hard to deal with. If you look closer at the Data Tier it had already challenges of its own when the storage layer got pushed out to the network. Massive SAN boxes, pushing data even further away from the logic. This architecture would grow up, mature into fantastic products across the board, but would never solve the underlining problem of each time heavy processing needs to be done, data would need to be moved around in massive amounts, using a plethora of protocols and caching layers that would enrich any performance expert company out there. Like this wikipedia article states: "Data transfer between tiers involved protocols like SNMP, CORBA, Java RMI, .NET Remoting, Windows Communication Foundation, sockets, UDP, web services or other standard or proprietary protocols.". A day in the life of a data packet in the network of any 3-tiered system is pretty busy and dizzy. So what could be done differently today for data-centric systems?



What about High Performance Computing (HPC)?

Supercomputer used to be sexy and even be talked about in the news, now the general public just don't care and assume that any computer with lots of blinking lights is a supercomputing system, and some just think that all their Facebook posts are dealt as per magic by "something" out there. In other words: supercomputing is not just taken for granted, but assumed to be a commodity like in a utility world. But the people working in the IT industry know that the Enterprise's Data Centers are still far far away from being considered supercomputing centers, although the challenges knocking on the door are high performance computing ones. And all of the sudden HPC or supercomputing becomes the talk of the town, because everyone wants to make sense of the data deluge, so we use the most basic terms the common man and woman would use to describe it: it's big and it's about data, so it should be called "Big Data". The good folks down at any HPC manufacturer like Cray Research or the nice people down at CERN would say they were born doing it. Every Big Data use case people talk today, absolutely everyone of them could be addressed and some of them were being addressed up until now by HPC technology. So why the need for other technologies like Hadoop?

The New Reference Architectures

The Use Case defines what type of data architecture you should use. These are the new reference architectures:

  • Low Latency Systems

  • Pattern Recognition

    • Real time event processing

    • Proactive behaviour prediction

  • Data Science as a Service

Low Latency Systems

If I press any action button in a mobile app, or if a page takes forever to load, that's it. I'm done with the app or the site. For much extreme as this behavior might sound, the new touch interfaces that came to stay make this low latency interaction a mandatory requirement for any application. And I'm not talking about User Experience yet. Just the responsiveness of the app or website. The cloud model on these types of systems clearly helps speed their deployment, but it's the processing architecture that lies underneath the should be highlighted as a reference. You might say that any user interface is a Low Latency System, and you are probably right, but please keep it to yourself because I'm not sure the world needs to know. Plus, the world is not expecting it yet at all levels of interaction. But they will and it's going to hit hard.


In the example shown above the browser interacts with app/service infrastructure that reads and writes an in-memory NoSQL database and tracks all its events to an Hadoop cluster. This Hadoop cluster might or not write to a relational repository, but that is not a mandatory part. The NoSQL layer also controls the metadata of the events being captured into the Hadoop cluster. This whole system or part of it can run on the cloud or not, but from a user experience point of view it's definitively "out there somewhere". So the experience should be great in terms of responsiveness in either cases.

Pattern Recognition


A Pattern Recognition System is one where you pretend to identify a certain behavior either in real-time or latter on a broader scope or universe of events. Location based applications are an example of real time, and online shopping recommendations are an example of the latter. These systems do two things: learn and act. They learn to know when to act and whom to act upon. Real time examples used today can be telco companies that predict you're about to leave the country and haven't got a good roaming data plan to enable you the use of internet abroad in your smart phone or tablet. It's not real real time, but there's a pretty narrow windows to act here. If it's too soon the target customers won't care, but if you're warned when you landed on the way back, it's ridiculous and it can harm the brand.

The recommendation engine is actually another example, but not the cheesy "if you bough white socks you might want black socks". It would be more on the lines of: "if you bought white socks, you might want to buy these light weight axes because you strike me as the serial killer type of guy". It's knowing what is the thought process behind a purchase that would make other purchases most likely instead of boring same-category rules engine. You'll need some geeky people here and also visualization engines as well. An example shown in the doodle is the behavior that some cells demonstrate that can lead to the detection of early signs of cancer. This architecture can be scaled to be used in many other cases where data hides signatures. Although I think cases like social uprising forecast are still in the works :-)

Data Science as a Service


Most of the systems that companies use today were not thought to incorporate non-deterministic data (no, I'm not talking about unstructured data. That "thing" does not exist outside the digital world!). This means you can't ask them questions that sit outside the function they were though for. Somehow this became very frustrating today, and specially for a generation that it is not used to memorize a simple URL like "youtube.com" or "facebook.com". Kids first go to Google and type the first letters of the website they want to go to, then they'll use the guided search and take it from there. Are kids dumb? You bet they're not,  just lazy. They need the system to tell them what they want with minimal input from their side. Most enterprise systems are the absolute opposite of this. That's why they will need to power up and enable intelligence at all layers of discovery. This is a Data Science as a Service type of architecture, where each agent shares a service registry that is powered by an advanced analytic engine allowing decision makers to see behind the cumbersome reports and stats. You'll get the picture once you see the doodle, and you won't need an intelligent system to tell you that I was lazy enough not to create a digital version with a drawing software. It's either the doodle or this blog post would never had seen the light of day.

The sharp reader might argue that today's enterprise decision makers are not kids, so the systems can be a bit more complex discarding the intelligent assistance. I couldn't disagree more. Decision makers focus on their line of business and don't have time nor the patience for cumbersome interfaces, that don't learn and force you to do repetitive tasks.

The never ending search for Meaning

As the beauty is in the eyes of the beholder, meaning might be held differently depending on the type of user interacting with a system. If you go back to the iconic TV Series of Star Trek or the first episode of the Alien movie saga, it was usual for the guy at the helm of the space ship to look at a screen filled with numbers and say something like "Captain! We have 32 seconds until impact!". Jesus! How could he know that just from looking at a screen filled with green numbers on a black background?  Meaning today needs to be more self explanatory. A bit more :-) Almost "in your face" for some business analysts. The various vectors of data that allow for the search for meaning come from:

  • Streamed data (social, sensors, pouring data)

  • Public databases (maps, historical facts, etc)

  • Science data (medical, biology, SNA, etc)

  • Transactional Data (aggregated or in detail)

If you add metadata and version management the task is huge. It's in the realm of HPC but needs to obey the architectural principles of data accessibility, security and all the other usual ones, plus be financially sustainable.

Still today meaning is not the same for every LoB, so when you say "deliver meaning for business" it depends very much to which department you are trying to deliver value to. Operational and Manufacturing systems are all about process optimization and excellence. In Retail the Supply Chain Management solutions have a bigger challenge since they depend on loads of external systems and entities, so on and so forth.


So there is no meaning without a clear definition of what meaning is. Once you have that definition the moment has passed by. Gone. So you need to change your strategy to deliver meaning to the business through the mining of data. You might have the Big Data architecture in place but without a strategy and a roadmap to implement it, you could lose the moment. Or worst: lose the money you asked for this brand new Big Data project. You wouldn't be the first nor the last, but there is no need to if you change some dogmas:

  • Gather is not hunt

  • Don't get caught in the recoil

  • Data generated from business is not what business wants

Gather is not hunt

Einstein is probably one of the most quoted person today, and so I'll make use of his genius once more: "Everything should be made as simple as possible, but not simpler". In this case I mean that meaning will not come from the simple fact that you can acquire, filter and organize all those streams of data business asked you to. To make any IT team life more simple, manufacturers are following the trends of the appliance model, but tools and specially pre-integrated ones will not make the task simpler. For some businesses this is a bit of the chicken and egg: "Should I invest in a data science practice before I had the real volumes of data available, or should I start gathering those volumes and then figure out what to do with them?". Volume versus deeper analysis. Can you have both at the same time? Yes, but I doubt that they would ever be aligned, so please keep it clear that gathering is not hunting. Gathering data is not the same as hunting for meaning.

Don't get caught in the recoil

The moment anyone starts messing with these new Big Data technologies and realizes that stuff that took ages to achieve in the past is now one click away, there's the danger of suffering from the recoil. Meaning that awe is not your final destination. I may sound like a broken record but once that hose is pumping data all over the place, it will be hard not to forget that in the end the whole thing needs to achieve its business goals. I could have named this point "Don't get caught in the same mistakes as all those failed Data Warehouse projects did", but then I might be the target of hate mail from all those hordes of people that managed to achieve complete business success with their DW projects. Hello? Anyone?

Data generated from business is not what business wants

It seems that I came up with a third title to explain the same thing, but rest assure that these are three different advises. If in the first I advise against choosing Big Data as an IT project over a Data Science one, or the other way around; on the second one I advise against staying in awe once any of these parts achieve their goal. Now it's the time to pass the testimony on to the business analysts so the Big Data project delivers on its promises. They are the ones businesses turns to in search for answers, or even for new questions that have never been posed. You might have a very well oiled machine of data acquisition, cleansing, and storing. You might even have the most advanced algorithms to mine your data, but if these good folks can't take advantage of all that and monitor the effects of feeding business with the wrong data analysis, then it's no big deal.

Putting it all together - Final Thoughts

If you came this far, I'm worried. But if on the other hand you managed to skip the text that precedes these final thoughts, then I need to find something useful to deliver to you now. Let's see if I can impress you:

- Everything is a data gatherer system. Specially transaction based systems. Data is your new gold and it should be sent as fast as you can to the place where it will be processed. Data meets logic... again.

- High Performance Computing is expensive. Data crunching is the new commodity in IT, so HPC-like processing needs are popping all over the place and you need to face this reality. Call it Big Data. I call it VLDBs for the rest of us. Kind of a Festivus. Going of rails here.

- There are new data management reference architectures to address your Big Data challenges. It means you shouldn't make a wheel out of a square anymore. It also mean you shouldn't go out there and start buying so-called Big Data packaged software for specific functions just because you can't make wheels out of squares. That in-house developers bunch you have under the stairs should  start shaping up some NoSQL and MapReduce skills. You fired them all? They quit? Bring the science back to your premises.

- Finding meaning in data has lots of pitfalls and caveats. Meaning is like knowledge: you don't know what you don't know so how to tackle those blind spots? Distinguish between the fight and the prize. Focus on the prize.

Still too vague or confused? Guess you will have to read the whole text after all... Bummer.

** Disclaimer: I'm an Oracle employee and these are my views. They do not necessarily reflect the views of my employer **

No comments:

Post a Comment