Many IT departments are being pressured to deliver "real-time" access to data inside their huge silos, just because the data warehouse systems lag on delivering data cleaned and aggregated. Well, here is the first of many concept and expectation gaps I will try to address in the next paragraphs.
Reporting's not dead. Readers are.
Every company needs reports. Some reports circulate in closed groups, others are public and others are internal, but wide spread in the organization. The sole purpose of the first waves of reporting automation was to get the calculations done right, and the on-demand printing. Yes, even after digital reporting was invented all it mattered was how it all would be looking like on paper. The ones that weren't aimed at wasting paper and precious ink, were actually used for decision making. A Chief Marketing Officer (CMO) only wanted to know the results of a campaign, so she could better allocate next years budget for financial forecasting. The CFO in return was an avid paper consumer, but critical information was manipulated not in paper nor in reporting. CFO and his entourage used (and still use) personal computing spreadsheets. Something that the enterprise compute world simply didn't have an alternative for. So we could say that central systems where data was kept clean and aggregated were a growing need after lots of application custom developments gone wrong in the reporting area. The ugly term "Data Warehouse" (DW) was born to tackle data manipulation at a low level, and after it, the more stylish Business Intelligence (BI) wave came to help create a common language for the automation of business definitions of the key performance indicators (KPIs). Today the world has changed tremendously. Most of these challenges remain, but the decision making process has very few time to analise data in detail. So no one reads reports in the beggining of the decision process, simply because by the time the DW/BI systems end computing them, the decision is already half made. This is why the enterprise world is talking about real-time DW/BI. They think this might solve it. It does but real-time is not the right-time, and right-time is what they really mean.
New tools, old problems
If I had an Euro coin for each time DW/BI personel asked me: "So what changes in the DW/BI space with Big Data?". Big Data started as a name for "New Data", so it was not an issue for companies not incorporating "New Data". New Data has never traditionally been part of any DW except for a few Internet companies like Yahoo, Amazon, Google and Facebook. These companies opened their pockets together with VC funding and helped create new technology to tackle New Data challenges. Still nothing changing at the tradicional DW/BI coast line. Like the internet technologies that were funded by the public sector funds for research, these new technologies could easily take 10 or 20 years incubating in Universities before they reached enteprise computing. But 2006 is not 1976 or 1986 or even 1996 for that matter. In 2006 the model was different:
- Internet giants make money from data
- Data driven technologies are not ready (eg. financially viable)
- New paradigms need to be developed by the scientific community
- New technology takes a long time to be developed by public funds
- Internet giants make money from data... and can't wait
- Internet giants create funds, VCs help, and so Big Data is born
How does this will affect DW/BI systems? In the coming years the lag of reporting calculations in right-time started to be addressed by these New Technologies and so new paradigms were now a possibility to develop solutions for the "old" DW/BI problems.
Bear in mind companies like IBM, Oracle, and Teradata had already much of the installed base for DW, but the BI landscape was much more fragmented. One of the things companies start using to solve the operational inefficiencies of DW was to create Operational Data Stores (ODS). This would be the future sweet spot for Hadoop. But Hadoop is not real-time. Wait in line. Right-time is solved for now.
What is this thing called "love for real-time"?
Real-time is business lingo for "responsive systems giving information that was true X seconds ago". In IT terms real-time means the lights are on and all green. So when business people started funding their own apps, the kids had that definition engraved in their minds. So they used whatever tools could give the level of latency of a guided missile control. For a change, the devil is in the details, and in this case the detail was "X seconds ago". They started to compromise the "X Factor" in exchange for split second response times. Database is slow? We'll use another database. Tune it? Nah. Don't bother. Data models for these apps are stupid simple, so why model? So the encounter of all these factors gave birth to a sprawl of new databases. Because they didn't spoke SQL, all sorts of names were invented (NewSQL, NoSQL, IdontgivearatsassaboutSQL, etc.). This was the quest for real-time. I press a button, stuff happens. Now!
Pick your hammer first, we'll give you the nails after
So we had Hadoop technologies and NoSQL databases growing like mushrooms. None of them mature, none of them adopted by experienced DW/BI personel, all very strangely managed by a developer centric culture, and the folks at the data center still cabling storage, servers and switches, not knowing what was about to hit them.
The ones that picked NoSQL for new apps, had to deliver their info for DW systems as a new source, but the volumes clashed. DW is tipically a big system, but NoSQL Databases can be much bigger, and some selection/filtering needed to be done. No worries. Throw all that into Hadoop (your brand new ODS) and then push it further to the DW. What a brave new world this was. No one really caring about the plumbing and piping. Soon connectors of all sorts would be announced and now all of the sudden SAP, SAS, Oracle, HP, Informatica, EMC, NetApp and all the other big IT vendors had "connectivity" from their technologies into these strange new systems.
But using more pipes will make this real-time "thang" more hard. So what's next?
Store and Stream
Real-time is not stream, as data moves in different ways. Real-time needs some proactivity and streaming is purely reactive. This is in the context of the enterprise computing, because in the web world, streaming is just another technique to implement real-time web services.
From the last paragraphs you can denote that the enterprise world is moving fast into a wall with very few holes on it. Most of the "fast movers" will hit on it hard, the market will "mature", and others will slip through the holes by deflating. The tech world is not a mess, but is ran by generational waves. They used to be rather harmonious in terms of continuity, but the speed at which things are moving today, makes all sorts of overlaps. This means that for the same problem there will be several solutions, mainly depending on the maturity, experience and generation of the person answering.
Streaming in enterprise world is associated with event processing, whilst real-time is associated with the response to an event.
Putting it all together
By now the world can agree in some notions that help choose the right tool and technology:
- DW/BI gain a lot from incorporating Hadoop to move from wrong-time to right-time
- New Data need new technology not because it's hype. Because new data comes in new formats.
- New solutions are: Hadoop, Spark and NoSQL databases.
- A system based in New Solutions alone will need to walk alone in the zoo, or attach to the existing stuff
- For real-time you need new stuff
- For streaming you need a mix of new and existing stuff
PS: As always these are not the views of my employeer and I take full credit for them, by trying to be as unbiased as I can through out the previous lines. The same un-biased mindset is needed to read them.