AWS Redshift Customizing Storage Across Your Cluster / Becoming Performant
AWS Redshift is a highly tunable data warehousing environment that can also be quite cost effective compared to it’s older on premise competitors.
Out of the box the product may be alright in terms of performance but it shines when you (a Redshift Developer / Admin ) know your how your Clients want to use the database and you take steps to customize Redshift to take advantage of that knowledge.
By understanding in as much detail as possible, the data (in tables, indices etc.) in a Redshift database cluster, when, how it will be used, by whom and by how many people and processes concurrently – a Redshift DW can be extensively architect-ed to be extremely performant and very cost effective.
Amazon Web Services – does a pretty good job – documenting it’s features including tuning and a simple google will get you to an array of AWS Redshift documentation but my feeling was some of the concepts were not explained very well and the information is spread around in various documents so, here’s my take on the subject in one blog. It not meant to be comprehensive but, hopefully it helps.
Redshift DW databases are setup as self provisioning AWS cloud based clusters that can store into the PetaBytes of data and still respond with shockingly good response times after tuning. At the time of this writing, clusters can be up to 128 compute nodes + leader nodes (more on this soon).
The following diagram (courtesy of AWS) depicts the architecture in general.
Compute Nodes and Slices
Compute Nodes are where the end user data is stored.
As the diagram shows Compute Nodes are also composed of “slices”. Slices are a Redshift proprietary concept to facilitate the controlled spread of data across the cluster.
When you setup a cluster you specify the size of the cluster in Compute Nodes. And the size of each Compute Node (generally the number of CPUs and RAM, storage limits) by picking a Computer Node size.
The following screen shot shows the Compute Node types available.
DSx compute node types generally uses slower IO devices – magnetic spinning HDs at the largest levels. And DCx node types use faster SSDs and NVMe-SSDs at the largest levels. Of course DCx node types cost more.
The network between all nodes in the cluster is 10Gb-E mesh.
Spinning up a Redshift cluster
Spinning up a Redshift cluster is easy. When you define a cluster from the Redshift Console (under Databases in the AWS Console), you get a wizard that takes you through the definition of the cluster. One of the most important pages in the wizard process is shown below. It allows you to size your cluster:
As a side note
1.) That data in Redshift is stored by default in a columnar fashion in 1 MB disk blocks and the data can be compressed. As opposed to storing rows of data adjacent to each other (as traditional OLTP based databases usually do), Redshift stores columns adjacent to each other. The columnar storage architecture is automatic and cannot be overridden. As a result, to optimize performance select only the columns that you must have in your SQL. DO NOT DO select * from table_name; is NOT at all advisable.
2.) Redshift is optimized for reading data and loading data in bulk. It not optimized for OLTP / transactional operations.
3.) Redshift is architected to very efficiently support star and snowflake database designs (with fact and dimension tables) typically used in data warehouse environments.
4.) Redshift uses memory to cache recently accessed data for fast access if you ask for it again.
What is a slice? How many slices per compute node? And why are slices important?
As mentioned slices are where the tables / indices / data are stored. Generally the number of slices per compute node is controlled by the number of CPUs in your compute nodes. It used to be 1 to 1, while that may not be the case with the latest releases it is certainly true that the more CPUs you have in a compute node, the more slices you will have.
Since slices are tied to the number of CPUs Redshift can use that fact to guarantee a CPUs availability to deal with data in a slice – which theoretically and in most cases practically improves performance.
Controlling where your data / tables are stored in a Redshift cluster
As mentioned controlling how and where your tables etc. are stored is critical to performance. Where your tables are stored – on which slice – can be controlled by Distribution Key.
To complicate matters and add flexibility and performance, there are three types or styles of distribution keys you can use.
- Even – let Redshift decide in a round robin manner which slices to put your data on. Use this type for all objects that don’t fit the next two (or you don’t know what type to choose). In other words table/columns are distrbuted evenly across all compute node and slices.
- All – Place an object a table on all slices – replicate it across all slices in the cluster. Use this type for small to medium dimension tables – that don’t fit the next type.
- Key – You tell Redshift what key in the table to create a hash value on. All key columns in all tables that have matching hash values will be placed on the same slice.
Controlling how you table data is ordered within a table
Table data is ordered by Sort Key. When you create a table with DDL you include the keywork sortkey with the column that you wish to control the ordering of the tables data.
Create table syntax diagram – courtesy of AWS
Controlling actions against your Redshift database
In an effort to allow Redshift databases to scale in a controlled manner based on the size of the database(s) and cluster, AWS Redshift provides and requires all access to the database to go through a Redshift Work Load Manager (WLM).
WLM manages activity thru queues. A default cluster / Redshift database comes with two queues. One for the superuser defined when you provisioned the cluster and another for queries and other type jobs.
WLM is defined in a parameter group where additional queues can be added. Queues have slots for jobs / queries and also control concurrency, percent of compute node memory to use, time out, and who and what programs can use a queue by defining user groups, and query groups.
Monitoring Redshift Performance
Other options related to Redshift
Using AWS EMR to convert S3 to columnar compressed Apache Parquet format
Apache Parquet format creates some of the same optimizations that Redshift does with table based data stored Redshift – but on S3 based storage – in other words – Parquet converts text, and CSV files to compressed columnar format providing the opportunity to use AWS Spectrum (still uses Redshift) or even AWS Athena (without Redshift).