
According to Google, Big Data has been famous for the past few years and is really popular in the last two years. However, there are still many parties who are mistaken about the notion of data lake and data warehouse. Though this is important so that companies can make the right decisions in data management.
Data Warehouse
Wikipedia defines Data warehouse as:
“…a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.
This is a complex definition that describes the purpose of the data warehouse but does not explain how that goal can be achieved.
In addition, data warehouse also has the following characteristics:
- Subject Oriented
Unlike the operational systems, the data in the data warehouse revolves around subjects of the enterprise (database normalization). Subject orientation can be really useful for decision making. Gathering the required objects is called subject oriented.
- Integrated
The data found within the data warehouse is integrated. Since it comes from several operational systems, all inconsistencies must be removed. Consistencies include naming conventions, measurement of variables, encoding structures, physical attributes of data, and so forth.
- Time-variant
While operational systems reflect current values as they support day-to-day operations, data warehouse data represents data over a long time horizon (up to 10 years) which means it stores historical data. It is mainly meant for data mining and forecasting, If a user is searching for a buying pattern of a specific customer, the user needs to look at data on the current and past purchases.
- Nonvolatile
The data in the data warehouse is read-only which means it cannot be updated, created, or deleted.
- Summarized
In the data warehouse, data is summarized at different levels.The user may start looking at the total sale units of a product in an entire region. Then the user looks at the states in that region. Finally, they may examine the individual stores in a certain state. Therefore, typically, the analysis starts at a higher level and moves down to lower levels of details.
Data Lake
The term data lake has generally been coined by CTO Pentaho James Dixon. He described the data mart (a subset of the data warehouse) like a bottle of water, “clean, packed, and structured for easy consumption” while the data lake is more like water in its natural state. Data flows from the river (source system) to the lake. Users have access to the lake to check, take samples or even dive into it.
Although it is sufficient to answer, the definition above is still considered not very accurate. There are some other specific properties about data lakes:
- All data is loaded from the source system. No data was rejected.
- Data is stored in its original or almost unchanged form.
- Data is transformed and the schema applied to meet the needs of the analysis
Next, there are a few key differences between the data lake and the data warehouse approach.
- Data Lake Retains All Data
During the development of the data warehouse, a considerable amount of time was spent analyzing data sources, understanding business processes, and data profiling. The result is a highly structured data model designed for reporting. A large part of the process includes making decisions about what data is entered and not entered into the warehouse. Generally, if the data is not used to answer specific questions or not in the defined report, it is possible that the data was not entered into the warehouse. This is usually done to simplify the data model and also saves space on expensive disk storage which is also used as a data warehouse performance enhancer.
In contrast, data lake preserves all data. Not only the data used today but data that might be used at any time or even data that might never be used at all because maybe the data will be used in one particular situation. Data is also stored all the time so that if there is an analysis that must be done at one time, it can be done.
This approach is possible because the hardware for data lakes is usually very different from what is used for data warehouses. Commodity, server without rack combined with cheap storage media makes scaling lake data to terabytes and petabytes relatively economical
- Data Lake Supports All Types of Data
Data warehouses generally consist of data extracted from transactional systems and quantitative matrices with the properties that describe them. Non-traditional data sources such as web server logs, sensor data, social network activities, text and images are usually ignored, because it’s quite difficult to consume and store these data, not mentioning the expensive costs.
The data lake approach includes non-traditional data types like those above. In data lake, a company can store all data from any source and any structure. It’s saved in the form of raw data and only modified when the data is ready for use. This approach is known as “Schema on Read” whose equivalent is “Schema on Write” which is used in the data warehouse.
- Data Lake is Adaptable
One frequent complaint about the data warehouse is how long it takes to change it. A considerable amount of time spent in the early stages of development just to make the correct warehouse structure. Good warehouse design can adapt to changes but because of the complexity of the process of loading data and the things that must be done to make analysis and reporting easier, these changes will consume developers’ resources and also take a lot of time.
Many business questions cannot afford to wait for data warehouse teams to adjust their systems to answer them. The increasing need for faster answers is why the concept of self-service business intelligence emerged.
On the other hand, because all data is stored in raw form and is always accessible to someone who needs it in the data lake, users are empowered to explore data through the structure of the warehouse to answer the questions.
If the exploration results proved to be useful and the users are willing to repeat them, then the more formal schemes can be applied with the automation and reusability that can be developed to help expand results to a wider audience. Results that proved useless can also be discarded without changing the structure of the data that has been created and does not consume existing development resources.
Which Approach You Should Take?
Choosing between these two technologies can be confusing. If the company already has an established data warehouse, it’s not recommended to discard everything that has been done and rebuild it from the beginning. However, like other data warehouses, the possibility of problems as described above still exists. Therefore, it is better for companies to implement data lakes together with existing data warehouses. Data warehouse can continue to operate as before and the company also begins to fill its data lake with new data sources simultaneously. Data lake can also be used as an archive repository for data from a warehouse provided to employees so they can access even more data. For that, it is important to consider the option to move data warehouse to the data lake or maybe a combination of the two
Technology?
Nowadays, the term data lake arguably is the equation of big data technology. One of the big data technologies that adopt a data lake system is Paques. Paques Smart Data Lake which is one of the pioneers of big data in Indonesia has applied the use of data lake as its concept. With this approach, Paques can process all types of data in its original form resulting in time-saving and efficiency.
Technology will continue to develop, big data is not exception. Everywhere, the company will always have the choice to make between keeping up with technological trends, or be left behind by competitors in increasingly dynamic markets. Because of that, for more optimal data processing and efficiency, Paques might be the solution.