49% of the data stored in the world will be in public cloud environments by 2025 according to the research company IDC. By then there will be 175 Zettabytes of global data. With this sheer quantity of data and the speed of growth, data driven processes will become much more important for companies. Business will face a whole host of new data challenges.
Creating a data lake is quickly becoming a necessity for companies. A data lake is a centralized repository for storing data (structured and unstructured) and helps with the execution of different analytical processes. It can make it easier for organizations to use machine learning and data visualization techniques to make informed decisions. Creating a data lake quickly, which is scalable, secure, and easy to build, allows businesses to speed up informed decision making, which in turn creates a competitive advantage.
Amazon Web Services (AWS) Lake Formation is one service which enables technology teams to quickly set up a data lake. A key benefit of it is that it allows the integration of multiple data sources. AWS launched it in August 2019, and since then has been improving its features, becoming one of the main services of the AWS Data and Analytics set. Lake Formation is a valuable service because of the granular way that it enables you to assign access permissions to data, guaranteeing the least privileged policy, and achieving columnar granularity at the permission level.
In an architecture like the following in which we have objects in an Amazon Simple Storage Service (Amazon S3) that are consulted through Athena connected to the Quicksight visualization tool; we need to implement a permission control so that users who do SQL queries for Athena only see the specific columns allowed according to their role.
The same architecture with the implementation of Lake Formation would be:
In the image we can see that before reaching the request of the query to the Glue Data Catalog, the Lake Formation service makes an access control of the permissions according to the role of the user. This will allow it to only execute queries on the data that is authorized.
Managing granular permissions
Within Lake Formation we can generate granular permissions which we will see with the following example.
1. In the below image you can see a table created in the Glue Data Catalog called “envigado” which has multiple columns like the following:
2. It is required to generate access for the “czambrano” user only, so that he can consult the id, user, zone and gender. For this the Lake Formation must be configured as follows:
3. With the specified configuration, the user “czambrano” can only have access to the fields allowed to perform select operations on the table, thus restricting their access to see the remaining columns.
Lake Formation permission control
The feature to manage permissions over the tables are the following ones:
- Include or exclude columns.
- Specify one or multiple users.
- Integrate with Active Directory (just for EMR).
- Table permissions:
Additionally we could:
- View permissions from different users
- Verify permissions
Lake Formation is a very powerful tool to manage data lakes, enabling you to implement very granular security controls. This ensures that data is only exposed based on specific requirements. You can read more about Amazon Lake Formation here.