A developer’s guide to managing columnar permissions when building an AWS data lake

49% of the data stored in the world will be in public cloud environments by 2025 according to the research company IDC. By then there will be 175 Zettabytes of global data. With this sheer quantity of data and the speed of growth, data driven processes will become much more important for companies. Business will face a whole host of new data challenges.

Creating a data lake is quickly becoming a necessity for companies. A data lake is a centralized repository for storing data (structured and unstructured) and helps with the execution of different analytical processes. It can make it easier for organizations to use machine learning and data visualization techniques to make informed decisions. Creating a data lake quickly, which is scalable, secure, and easy to build, allows businesses to speed up informed decision making, which in turn creates a competitive advantage.

Amazon Web Services (AWS) Lake Formation is one service which enables technology teams to quickly set up a data lake. A key benefit of it is that it allows the integration of multiple data sources. AWS launched it in August 2019, and since then has been improving its features, becoming one of the main services of the AWS Data and Analytics set. Lake Formation is a valuable service because of the granular way that it enables you to assign access permissions to data, guaranteeing the least privileged policy, and achieving columnar granularity at the permission level.

In an architecture like the following in which we have objects in an Amazon Simple Storage Service (Amazon S3) that are consulted through Athena connected to the Quicksight visualization tool; we need to implement a permission control so that users who do SQL queries for Athena only see the specific columns allowed according to their role.

img 5efd182fa25c6

The same architecture with the implementation of Lake Formation would be:

img 5efd182ff374a

In the image we can see that before reaching the request of the query to the Glue Data Catalog, the Lake Formation service makes an access control of the permissions according to the role of the user. This will allow it to only execute queries on the data that is authorized.

Managing granular permissions

 

Within Lake Formation we can generate granular permissions which we will see with the following example.

1. In the below image you can see a table created in the Glue Data Catalog called “envigado” which has multiple columns like the following:

  • Id
  • Date
  • User
  • Id_zone
  • Id_client
  • Zone
  • Name
  • Age
  • Gender
img 5efd18304d983

 

2. It is required to generate access for the “czambrano” user only, so that he can consult the id, user, zone and gender. For this the Lake Formation must be configured as follows:

img 5efd1830af77d

3. With the specified configuration, the user “czambrano” can only have access to the fields allowed to perform select operations on the table, thus restricting their access to see the remaining columns.

Lake Formation permission control

 

The feature to manage permissions over the tables are the following ones:

  • Include or exclude columns.
  • Specify one or multiple users.
  • Integrate with Active Directory (just for EMR).
  • Table permissions:
    • Alter
    • Insert
    • Drop
    • Delete
    • Select

Additionally we could:

  • Revoke
  • View permissions from different users
  • Verify permissions

Lake Formation is a very powerful tool to manage data lakes, enabling you to implement very granular security controls. This ensures that data is only exposed based on specific requirements. You can read more about Amazon Lake Formation here.

If you want to find out more about Globant’s Cloud Ops studio, visit us here.

Trending Topics
Data & AI
Finance
Globant Experience
Healthcare & Life Sciences
Media & Entertainment
Salesforce

Subscribe to our newsletter

Receive the latests news, curated posts and highlights from us. We’ll never spam, we promise.

More From

The CloudOps Studio combines the best cloud technologies, continuous integration, and continuous delivery practices along with cloud operations management and unique capabilities to facilitate new and more efficient ways of doing business.