Machine Learning] Data Lake Configuration in S3


Machine Learning] Data Lake Configuration in S3

What is a data lake?

A data lake is a centralized data repository that stores data, including structured/unstructured data, binaries, and other files. Typically, data lakes consolidate in one place copies of enterprise data used for reporting, visualization, analysis, and machine learning, as well as returned data.


AWS Lake Formation

AWS Lake Formation can be used to easily implement data lake configurations using S3.

Lake Formation collects and catalogs data from databases and object storage, moves the data to a new Amazon S3 data lake, cleans and classifies the data using machine learning algorithms, and protects access to sensitive data.

Once these tasks are completed, the user will have access to a centralized data catalog. This data catalog shows the available data sets and their proper use.



With Redshift Spectrum, an S3 bucket can be configured as a data lake for RedShift analysis.

This configuration enables high-load analysis, such as big data analysis.