Data Lake Implementation Guide: How To Build A Data Lake Correctly And Avoid Common Pitfalls

In recent years, among the topics that have been much discussed in enterprise data architecture is the existence of data lakes. As a practitioner with many years of experience in the data field, I have seen the pitfalls that many companies have made when building data lakes. Many people think that the data lake is a larger hard disk, and it is completed by throwing all the data into it. As a result, the data lake eventually becomes a "data swamp". Today, I would like to use my own practical experience to tell you how to correctly build and implement a data lake, hoping to help you avoid those common pitfalls.

What exactly is a data lake?

A data lake is a centralized repository that stores data in its original format. It can store structured data, like database tables, as well as unstructured data, such as pictures, videos, and log files. Unlike traditional data warehouses, data lakes retain the original form of data without pre-defined schema. This sounds great, but it is this "store everything" concept that causes many projects to get out of control. My understanding is that a data lake should not be a data dump, but a carefully managed warehouse of raw materials.

How to choose the right data lake storage solution

When choosing a storage solution, many teams will be deeply entangled in technology selection. I think the first thing to consider is cost and scalability. Object storage is generally the first choice, such as AWS's S3, Alibaba Cloud's OSS or HDFS. You need to evaluate the rate at which your data is growing and how frequently it is accessed to determine which storage tier to use. For hot and cold data, different storage strategies need to be planned. Hot data is placed on media with good performance, while cold data can be transferred to lower-cost archive storage.

Compatibility with existing technology stacks is also an aspect of storage solutions that must be considered. If your team is familiar with the open source ecosystem, then HDFS may be a good choice; if you are in the cloud, the object storage provided by cloud vendors can provide more flexible services. Don’t be fooled by those flashy technologies, stability and operation and maintenance costs are the key points. I once saw a team choose an extremely complex distributed file system to achieve ultimate performance. As a result, half a year later, even the personnel responsible for operation and maintenance could not be found.

How to implement data governance in data lake

The key to the success or failure of a data lake is data governance. Many people think that data lakes do not need governance. This is a very serious misconception. A metadata management system must be built from the beginning to record the source of the data, the format of the data, the update time of the data, and the lineage relationship of the data. Imagine that without the support of metadata, when faced with thousands of files half a year later, it is not clear what data is in them. In this case, this lake will be completely useless!

Permission management and data lifecycle management are also the core of governance, not just metadata. You have to clearly determine who can read what data and who can write what data. In addition, you need to set the expiration time and cleanup strategy of the data. I once took over a project. There were server logs from three years ago in the data lake, which took up space and were useless because there was no set life cycle. Good governance can make the use of a data lake increasingly clear, not confusing.

What are the key steps to enter data into the lake?

Data entering the lake is not as simple as uploading files, it has a set of standard processes. The first step is data access. At this time, it is necessary to take into account the diverse characteristics of data sources, such as business databases, application logs, external API data, etc. For scenarios with high real-time requirements, Kafka or Flume can be used for streaming access; for batch data, Sqoop or DataX can be used to implement scheduled synchronization.

Building data lakes implementation_Data lake storage solution selection_Data lake construction best practices

The next step is to perform verification and cleaning operations on the data. Although the data lake is responsible for storing raw data, this does not mean that there is no concern for quality. When data enters the data lake, basic format verification and integrity checks should be implemented to flag data that is obviously corrupted. At the same time, the data storage format should be unified as much as possible. For example, it is recommended to use columnar storage formats such as ORC. This can not only save space, but also improve the performance of subsequent queries. Only when this step is successfully completed can subsequent analysis work be carried out smoothly and advanced.

How to avoid data lakes turning into data swamps

The most common situation in which a data lake is destroyed is a data swamp. The fundamental reason is the lack of organization and maintenance. To prevent this situation, you have to implement hierarchical management of the data in the lake. Generally speaking, we will divide the data lake into several areas, namely the original data area. This area is placed without The original data for any changes, there is also the clean data area, which stores the data after cleaning, removing duplicates, and standardization, and then there is the application data area, which stores the data sets carefully processed for specific applications, and the sandbox area is for data scientists to carry out exploratory analysis.

In addition, you must develop the habit of writing data documents. Whenever you add a new data source or change the data processing logic, you have to update the corresponding document. Even just a few lines of description will be of great help to newcomers. I have seen too many teams due to lack of documentation. Newcomers have no idea how the data was obtained when they started, and in the end they had to abandon the entire project.

How data lake collaborates with data warehouse

After many companies have a data lake, they plan to completely replace the data warehouse with the data lake. This is a common misconception. Data lakes and data warehouses should have a collaborative relationship, not a substitute relationship. The data lake is used to store raw data and is suitable for exploratory analysis and data mining; the data warehouse is used to store deeply processed and modeled data, and is suitable for fixed reports and BI analysis.

Under normal circumstances, in the actual overall architecture, data is often synchronized from the business system to the data lake in real-time or quasi-real-time, and then cleaned and standardized within the data lake. Part of the processed data will be directly provided to data scientists for their use, while the other part will be imported into the data warehouse and organized according to dimensional modeling for query by business departments. Such an "integrated lake and warehouse" architecture can, on the one hand, give full play to the flexibility of the data lake, and on the other hand, ensure the performance and stability of the data warehouse.

What technology stacks are needed to build a data lake?

Which technology stack to choose depends on your team's inherent background and the business needs based on the actual situation. If you are in the initial stage of starting from scratch, you can consider Spark or Flink as a computing engine. They not only have the ability to process batch data, but also have the ability to process streaming data. For storage, you can use MinIO or Ceph to build private object storage, or you can use cloud services. For metadata management, you can choose Hudi or Delta Lake, which can help you achieve ACID transactions and incremental processing.

If you prefer cloud-native solutions, then all major cloud vendors have provided complete solutions, such as AWS's Lake, Azure's Data Lake, and Alibaba Cloud's DLF. These hosting services can reduce the burden of operation and maintenance, but they will also cause a certain degree of vendor lock-in. For start-up teams, I suggest giving priority to using cloud services to quickly verify the business. For mature large companies, you can consider building your own to gain higher autonomy.

What is the most troublesome problem you have encountered in the process of building or using a data lake? Welcome to leave a message and share it in the comment area, and let’s communicate and explore together. If you think this article is helpful to you, don't forget to like it and share it with more friends in need.

搜索此博客

smart solutions