By Infraweb | December 28, 2017
The current trend in transforming IT services to major cloud vendors seems to be happening at a blistering pace. Amazon, Microsoft, Google and IBM are all competing in a dynamic markeplace that feels like there is something new to review every week. One my recent tech exploration exercises has been focused on "Data Lakes" and reporting tools.
Put simply it was initially curiosity. Data platforms, Hadoop, Event streams, Data Lakes were just a number of key phrases that took me off on that uncontrollable spider web journey across the internet. When your craving for more knowledge on a topic that gets you side tracked right up to the point of understanding aand being comforatble talking about a concept and how it can be applied. Then comes the next part of the journey...how can I make use of these new concepts and their adopted approaches in a cloud first world ? Would it provide value for our customers ? Do our customers even need it ? How easy is it to set up ? Cost ? GDPR ? A long list of questions were develop.
What is a Data Lake?
As with most things there are hundreds of definitions available on the internet but this one from Amazon is my favourite "A data lake allows you to store all your structured and unstructured data, in one centralized repository, and at any scale. With a data lake, you can store your data as-is, without having to first structure the data, based on potential questions you may have in the future". Essentially you can start collecting all of your data feeds from your website, mobile applications, CRM, ERP, Finance systems or any other data source into one "pot" without needing to apply transformations to make sure the data is in a reporting friendly format.
Is the Data Lake approach just for large enterprise data projects?
It very much depends on the use case. Many of the posts and blogs I read to formulate this blog and help steer some proof of concept work seemed to be focused on Big Data for large enterprises normally using one of the big cloud vendors. A strategy being deployed by a number of large organisations seems to suggest adoption of the Data Lake concept, start storing all the data in one central location in preperation for large scale processing at a later date, when the questions being asked are clearly understood.
This is probably not a good approach for a small-medium sized business if they do not have the finances available to start innovating in this area, don't have the skill set or haven't really identified a strong use case. But we wouldn't dismiss it completely, take a small start-up focused in the IoT space with a number of sensors sending millions of events per day or a small mobile game developer sending millions of messages across a multiplayer platform. For these types of scenarios Data Lakes could provide a solution to get up and running quickly and start plugging in some reporting tools to start analysing data.
To get a feel for how these services worked I decided to opt to trial a Data Lake on the Microsoft Azure platform, I did look at AWS but not wanting to dwell, decided on Azure mainly based on a few posts I had read on maturity of the Microsoft Data Platform. First steps, I set about creating a data lake via the Azure portal to store all of the logs files from a variety of web sites we host with different providers, for the purpose of the trial I manually copied a selection of the historic log files into the newly created lake. Great but what does that give me ? Not much in reality but when I opended up Microsoft BI I was able to create a new report that pointed directly to the data lake and start joining and querying the different data sets.
I will admit this is a very simplistic example but it give me a feel for setting up a data lake in azure and then the simplicty of connecting Power BI to deliver some basic reporting. With a few enhancements and a bit more thought on the data landed in the lake I can see some real value in this approach especially as we as a team learn more about the capabilities of tools such as Power BI and also what's available in the open source commmunity.
During my exploratory work into Data lakes and reporting it was difficult to ignore the fast approaching GDPR legislation. The concept of storing all data from a variety of data all in one place sounds great especially when you want to apply some complex data transformations, use the lake as a staging area prior to building a data warehouse or just run ad-hoc queries againts your Data Lake. However with the forthcoming GDPR legislation you should work with your Data Protection officer and Information specialists to determine how GDPR might impact loading data from a variety of systems into a central store for further processing.
I have only scratched the surface of Data lakes and their use cases but from my initial development work and analysis they seem to offer an easy way to land raw data from a variety of sources into a central hub. Then using some reporting tools such as Power BI and Polybase you can begin to slice data and start building interesting reports and dashboards for your business that provide valuable insights to allow you to make evidence based decisions. A data lake on it's own will probably not be the answer to a long term data analytics strategy but I think it adds some value to the first part this journey and could provide a staging area for a much more dynamic management information and business intelligence solution.