April 21, 2011

It's All About Architecture: the Data Warehouse building blocks poweredby OLAP and Data Mining

It's All About Architecture...


First of all if you're working for years in the bi/dw arena or if you just got here you must read this paper: http://bit.ly/pervasive-bi. This paper will give you an idea of the "Reference Architecture" and how all the different concepts fit together, without even touching products. Each layer addresses a different phase in the development process of a DW solution and this is where this architecture adds value specially for those who tend to only care about the creation of complex snowflake structures.

So my first advice would be: decouple Technology from Information Management Architecture. This way you'll have a clean and clear vision of what you goals and challenges would be, that later on you can address using technology. But most of all, the architecture provides a robust logical view of an enterprise IM Solution, making your futures changes independent from the tools features.

The main lessons from the "Reference Architecture" from an infrastructure point of view would be:

- Every system that creates information you want to extract some reasoning out of, is a source. And sources change over time, they merge, there will be new sources, but still

- You should get in place a way to help out giving a single view of every common entities in your systems (for instance CUSTOMER, PRODUCT, etc, ...). Not necessarily using a Master Data Management (MDM) solution, but if you do, please treat it like a source too, and don't bring that burden into the DW.

- Information should be pumped straight into the staging layer using low impact mechanisms such as drip feeding, micro-batching and/or non-intrusive methods such as off-line logs reading. Quoting Doug Cackett here: "Big batch, often nightlyh batch runs are a thing of the past.  The obvious way to get around shrinking batch window problems is simply not to have a batch window at all – drip the data into staging and prepare the data even if you don’t want to refresh the DW more regularly.  Sooner or later someone will ask for “right-time” analytics so you might as well design for it today!"

- Once in the staging layer information will be in the same format as in the sources, but should be kept clean before feeding the Foundation Layer. This is a iterative task that would go for as many iterations as it takes in order to meet the quality standards imposed in the system, or just simply because information has dependencies and references.

- In the staging layer you will need a system that can enables you three tasks at the same time: reading, cleaning and pumping information. This can only be accomplished if you have Read consistency, a key feature that will enable reading and writing information at the same time.

- Pushing information down to the next layer should be as fast as it can be. And what faster mechanism is there than not even pushing it at all? Meta-data transporting is a key technique in this process, but not mandatory.

- Once in the Foundation Layer information will be detached from its original format in order to be in its most atomic form. In this layer you should have a Data Model in 3rd Normal Form (no duplicates) that would enable information to be kept regardless the format where it came from (business neutral).

- Foundation Layer Data Model has to be designed to keep information for a long time, neutral to any process.  Here is where information will sit for as long as the organization needs, and it can be for a long time! Sometimes in order to build a strategy for the company, business needs to understand what happened in the past and with a good level of detail.

- The needs of reporting and intelligence from the business will shape the way the Foundation Layer will be queried. For that you have to transform information into a significant view that will be useful for business, departments, etc. You'll do that in the Access and Performance Layer.

- Access and Performance Layer will serve BI and Performance Management Tools with aggregations of information in a useful manner, but it's transient aggregated information. New entities will be created here to ease the way business uses information, giving it a meaningful perspective.

It's here where you use OLAP option inside the database to speed up aggregations, Data Mining for pattern finding and statistical analysis. The use of OLAP will avoid moving data around, hence avoiding the need to write additional ETL. The OLAP cubes also come with lots of integrated and out-of-the box calculations making dispensable the maintenance of lots of Materialised Views (MVs), whereas you'll only need to build and maintain one OLAP cube. This is a powerful multidimensional engine that sits inside the database, with which you can interact either

Otherwise you would have to split these functions into two extra-layers: aggregations layer inside the database and aggregations layer inside the tools realm. Most of the BI projects fail here because architecture gets pushed aside and the BI Tools try to do all the work duplicating effort, duplicating information and most important, with less performance. Also if the aggregations fall in the tools realm, the day you change tools you're in a lot of trouble.

The amount of times that users change their minds about the relevance of some data extraction, you would have to rebuilt these structures, and most of the cases it's here where we see IT and Business divorcing gradually, with IT not being able to cope with the amount of effort needed to change access structures and Business creating local Data Marts. This is the beginning of the end for the the whole DW project.

The power of OLAP and Data Mining inside the database brings much more synergies than you might expect. For instance joining OLAP and data mining results with spatial analytics, analysis cross linked to unstructured data, brings you to a whole different level.

Next you'll have some resources about the OLAP and Data Mining options of the Oracle Database.

OLAP Resources


Nice video about OLAP 11g, that on minute 11 will start laying down the basics of the "Reference Architecture".

Very neat "Build your own OLAP 11g Cubes" Example: http://st-curriculum.oracle.com/obe/db/11g/r1/olap/cube/buildicubes.htm taken from the main OLAP PRODUCT PAGE.

OLAP Wiki: http://wiki.oracle.com/page/Oracle+OLAP+option

The real-deal-heavy-duty OLAP blog: http://oracleolap.blogspot.com/

Posts about OLAP in the best BI blog in the world!: http://oraclebi.blogspot.com/search/label/OLAP

A book about Essbase and OLAP: http://my.safaribooksonline.com/book/databases/oracle/9780071621823

Oracle OLAP 11g powering Excel Spreadsheets using Samba's MDX OLE-DB Provider

Data Mining Resources


Oracle Data Mining Option main page where you can find lots of interesting stuff, being my favourite the list of Techniques and Algorithms used by ODM and the R-Interface for ODM.

Data Mining on the Cloud it's also a must see and use for anyone interested in testing this technology.

Regarding blogs and wiki's you must follow and read what these folks are up to in: http://blogs.oracle.com/datamining/

and http://wiki.oracle.com/page/Oracle+Data+Mining

Hope it helped to clear the sometimes confusing messages around DW and BI and specially where does OLAP and Data Mining options fit.

LMC
This article wouldn't be possible without the help and inspiration from Johan Hendrickx, Andrew Bond, Doug Cackett, Kevin Lancaster, and Keith Laker.

2 comments:

  1. With havin so much written content do you ever run into any issues
    of plagorism or copyright infringement? My site has a lot of completely unique content I've either authored myself or outsourced but it looks like a lot of it is popping it up all over the web without my authorization. Do you know any methods to help prevent content from being ripped off? I'd certainly appreciate it.

    ReplyDelete
  2. If it's in the internet I just pray to God that people plagiarize me. You got to let go, once it's created.

    ReplyDelete