"It's the analytics stupid!"
Obviously the offense is not intended at the dear reader. It's a wake up call for all the people excited with Hadoop and lack BI vision. The BI people that lack infrastructure vision are also to blame. Blame for what? We'll see later in this text.
Traditional BI is an established practice that never completely delivered on some of its promises. Real-time, pervasiveness, and 360 degree views were simply regarded as "nice to haves". Not anymore. Those are now mandatory requirements.
Technology advances like Hadoop are being use left and right in all industries to tackle these challenges. Thus the need for a vision on how the Hadoop Ecosystem and other related Apache projects could help shape the vision of the next generation BI systems.
New Data, More Data and Fast Data
In a nutshell these are the three vectors. New Data is about incorporating data external to your organization either through a broker or a public API. The More Data bit has been explained ad nausea and by now everybody is pissing their pants with the data deluge that is approaching. Humans are great at creating panic to sell stuff. Fast Data is the interesting part. Not only companies need to handle rats and needles from the haystacks but now they need to act on these results in either the right time or lose the opportunity.
Obviously I'm witting this article inside a plane, surrounded by people that feel their lives are on hold when offline. The feeling is so ridiculously spread across all levels of society that each citizen is now a source of data. Citizens make up organizations, buy products from companies and use services from businesses. If citizens change the way they interact with the world, you shouldn't expect the businesses to expect less from their Business Intelligence systems, so get your act together and start acting on these three vectors.
This is where IT personel is to blame for the lack of agility. Technology departments have the responsibility of adopting the latest innovations advances in the benefits of businesses.
The slow pace in adopting the open source technologies is not tolerated anymore by a bleeding market in need of new solutions.
Let's go deeper now.
Hadoop as an one stop shop for low density data
Obviously Hadoop handles all sorts of data, but it lacks in sophisticated metadata management. Not a limitation that will prevail but one that exists today and it's not likely to change inside Apache Hadoop for the near future. But if you're talking about low density data, then Hadoop becomes an almost mandatory element under the covers of any BI system. Examples:
- Machine generated data
- Self describing data in XML, JSon or similar formats in which web servers and other applications dump their logs into
- Auto indexed information: the consume-it-as-it-comes data streamed from sensors, or any source that generated data that needs to be acquired sequentially.
The potential innovations this can bring falls in the remit of More and New Data.
Hadoop as a persistency store, enhancing your BI timeline
Big is always a relative adjective in information management. Actually when systems administrators get caught in high volumes it's seldom due to a capacity problem. When you remove that limitation, or push it forward to the arena of gigantism, new possibilities rise. Persistency and timeline are the main advantages.
Examples of what Persistency can bring to BI:
- Travel in time. Not only to the past by enabling a bigger timeline but also to the future by using predictive technology
- Usefulness of the actual solution. Example: some video surveillance systems only record a very limited amount of information and they do it to offline recording systems. If the speed at which this information is accessed increases proportionally with the amount of data retained then the actual solution can serve its initial purpose.
- Decrease costs dramatically in cases where the Information Life Cycle Management (ILM) requirements from business are not realistic in terms of return of investment. Example: Call Detail Records (CDRs) in the comms industry are an ever growing source of headaches for Network Operations teams but also at the Business Support Systems (BSS) level. Keeping more historical CDRs online splitting them between HDFS and relational databases may unlock a whole new set of possibilities and potentially transform a major problem into a new suite of business solutions.
Hadoop, NoSQL and Open Source as the sole source for BI
This is a nightmare for some companies and the wet dream of others. Companies like Platfora, 10gen, Karmasphere and Datameer are aligning the VC dollars in that direction. On the other hand giants of the IT industry simply don't want to lose the chance of be represented in this market, carefully investing in a way that defends the installed base not cannibalising propriety technology, while protecting customers investment plans in their current BI systems.
This might be the right approach for low budget green field projects, but bear in mind that some gold rush addicts will bring this approach to the table all the time, with the excuse that it's time to create the Microsofts, Oracles, and EMCs of the future.
The main driver for this approach will be cost. But we all know what happens when solutions are only cost driven.