AWS] What is Redshift?
Amazon Redshift is a fully managed, petabyte-scale, in-the-cloud
data warehouse service
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud.
Redshift is a relational database service that can be used as a BI system or data warehouse for normal business system data and other data.
You can start with a few hundred gigabytes of data and expand to petabytes or more. This allows you to use the data for the purpose of gaining new insights for your business and your customers.
How to use Redshift (use cases)
S3 is set up as a data lake, and RedShift can use queries for big data processing in relational database format.
Differences from EMR
Unlike EMR, Redshift is a relational database dedicated solely to data analysis and not to data processing.
In other words, an EMR is needed for big data processing, not BI analysis.
- Big Data Processing with EMR
- BI Analysis with RedShift
Fast, simple and cost-effective data warehousing service.
Used for BI and business data analysis
It is not suited for high-speed processing of user behavior data. ElastiCache is suitable for this purpose.
Redshift is a petabyte-scale relational database data warehouse service fully managed in the cloud.
Starting with a few hundred gigabytes of data, it can expand to petabytes or more.
We provide a DWH for business analytics, allowing you to use large amounts of data to perform analytics for your business and customers.
Amazon Redshift distributes table rows to computing nodes so that
data can be processed in parallel
The data can be processed in parallel.
By selecting the appropriate distributed key for each table, data distribution can be optimized to distribute workloads and minimize data movement between nodes.
Inexpensive ways to use the system
Redshift offers free storage for snapshots, but you'll have to use the storage capacity of the cluster.
When the snapshot space limit is reached, you will be charged for additional storage at the normal rate.
For this reason, automatic snapshots need to be saved, the number of days that the retention period needs to be set should be evaluated, and manual snapshots that are no longer needed should be deleted.
For Redshift clusters
Enable extended VPC routing
to monitor all COPY and UNLOAD traffic for Redshift clusters entering and leaving the VPC.
To prevent all traffic between Redshift clusters from passing through the Internet
Amazon Redshift's extended VPC routing
forces Amazon Redshift to route all COPY and UNLOAD traffic between the cluster and the data repository through Amazon VPC.
Extended VPC routing allows for the use of VPC security groups, network access control lists (ACLs), VPC endpoints, VPC endpoint policies, Internet gateways, Domain Name System (DNS) servers, and other standard VPC features.
Use these features to manage the detailed data flow between your Amazon Redshift cluster and other resources.
If you are routing traffic out of a VPC using extended VPC routing, you should
COPY and UNLOAD traffic can also be monitored using the VPC flow log
COPY and UNLOAD traffic can also be monitored using the VPC flow log.
If there is a requirement to define how to route query contents to a queue when performing query processing
Use WorkLoad Management (WLM)
By using Redshift's WLM (Work Load Management), it is possible to route query contents to a queue when executing query processing.
WLM is the ability to specify Redshift resources to be allocated for queries thrown to Redshift.
By preparing queues in advance as WLMs and specifying the percentage of memory to be allocated to queues, the degree of parallelism, and the timeout period, it is possible to determine the allocation of resources to queries and stop queries that run for long periods of time to avoid wasting cluster resources.
About Redshift Spectrum