Your data lake: ‘Build it and they will come’

What is a data lake and why would your business need one?

Most IT leaders recognise that there are current gaps in their information and data platforms.  There are significant portions of the enterprise data model that has not been captured or indeed has been caught too many times and there is duplication.

Frankly to find an organisation with a mature information environment along with data governance to support this platform is quite rare.  I’ve met the occasional CIO who understands data and has a data model on their office wall.  This CIO has convinced the organisation to take the approach of ‘build it and they will come’; while it is a leap of faith, you can understand the religion and that heaven will attained in the end.

Let’s first address the question, what is a data lake?

“A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.”

Value of a data lake

One of the best ways to consider this is to simply compare a data lake to your current data warehouse.  There is a great definition that does along the lines that a ‘datamart is a store of bottled water – cleansed and packaged and structured for easy consumption’.

However ‘the data lake is a large body of water in a more natural state’.  I would suspect that most data warehouses are not built to allow for rapid change and any degree of flexibility.   This has been engineering with complex data loading in order to provide for reporting of specific business KPIs.

The real question will be around the value that your data lake can bring to the enterprise.  There is a perspective that a data lake that does not serve any purpose will become a ‘stagnant’ dumping ground.

It's not about Hadoop

It is a fact that most data lakes rely on the open source Apache Hadoop project.  But it is not necessary to build on Hadoop; the key difference is to achieve this goal on your traditional RDBMS would be economically not viable.

In principle a data lake depends on two milestones: The early ingestion followed by a late processing of raw data —the exact opposite of the normal data warehousing approach.

It’s less structured

The big difference with the data lake (hub) approach is that while you still need some discipline, it is more unstructured.  Unlike a traditional approach, not all the information has to be modeled upfront and then populated.  For insurance and banking, such schema are standard across the industry and are defined in the data store before the data can be loaded.

Unfortunately this creates a longer requirements and validation process.  The reality is that this does change in the normal course of the year due to the regulatory and evolution of the business.  Any such changes can create havoc for a schema that has been defined in advance.

However using tools like Hadoop can deliver schema-on-read.  Hence raw data can be loaded into Hadoop and the structure is imposed at processing time.  Thus this can change to the needs of the processing application.

Horses for courses

For some business use cases, a data lake may be inappropriate fit.  In the case of a schema-on-write, this is definitely better for clean and consistent data sets, but those data sets may be more limited.

This is clearly a case of horses for courses.  A schema-on-read provides for much more flexible organisation of data.  This ‘on the fly’ approach will be cut against the grain for those who have been data modellers and can seem anarchic.

But in today’s business world, this agility is sometimes needed.  Thus both approaches can co-exist.

Your call to action

So what are your plans for 2018 on this?  I would say the best approach is to work with your business partners to explore potential use cases. 

The best practice should be to quickly narrow down on a number of key themes and this will ensure that value is delivered quicker than going too broad.  There will be some good examples that you can identify around predictive analytics , fraud and customer engagement.

Most lakes were formed by glaciers that covered areas of land during the ice age.  We would expect that in the same fashion we have data and information that is frozen both inside and outside of the enterprise that can be made liquid and then reshaped into something useful.

It’s time for your data lake to start to take shape.

Join the newsletter!

Error: Please check your email address.

Tags datadata warehouseData Lake

More about Apacheindeed

Show Comments