General description of the process
- Generate a fairly large volume of test data using the tpch-kit – (setup required described below)
- Move the data to AWS S3 (same region as your Redshift cluster)
- Setup IAM role to use for future S3 copy to Redshift and Create a AWS Redshift Cluster (assigning the IAM Role)
- Download, configure and connect a Desktop Client (Aginity Workbench) to Redshift to issue the copy command
- Issue the S3 to Redshift copy command
- View the results
Generating the test data
How to generate volume data using https://github.com/gregrahn/tpch-kit for BigData testing
Note I generated this test data on a rhel 6 VM on-premise – I call that machine/VM octobeast virtualbox o69122. But you could do exactly the same thing with an EC2 instance and then moving the data to S3 is much quicker.
cd $HOME yum install -y make git gcc yum install -y bison flex byacc git clone https://github.com/gregrahn/tpch-kit make OS=LINUX cd $HOME # create a directory which will hold the generated data mkdir emrdata # set the environment variable that dbgen will write to export DSS_PATH=$HOME/redshiftdata pwd # cd into the directory that contains the binaries that generate the data cd tpch-kit/dbgen make dbgen ./dbgen -v -T o -s 10 cd $HOME/redshiftdata ls lineitem.tbl orders.tbl @ol69stby redshiftdata]# ls -al total 9228472 drwxr-xr-x 2 root root 4096 Jun 30 15:51 . dr-xr-x---. 27 root root 4096 Jun 30 10:49 .. -rw-r--r-- 1 root root 7715741636 Jun 30 15:56 lineitem.tbl -rw-r--r-- 1 root root 1734195031 Jun 30 15:56 orders.tbl [root@ol69stby redshiftdata]#
Split the data files into smaller files – Redshift loads many smaller files faster
wc -l lineitem.tbl 48000000 # divide the output of the linecount by 4 - in this case 12000000 for the split split -d -l 12000000 -a 4 lineitem.tbl lineitem.tbl.
Copy your test data into S3
Create your bucket lonzodbredshift/redshiftdata using the AWS Console or CLI before issuing the copy below
aws s3 cp /root/redshiftdata s3://lonzodbredshift/redshiftdata --recursive
Note: the command above uses the “aws cli” – if you need to know how to set set up look here.
Load S3 Data Into Redshift
If you have a Redshift instance, and an IAM role that grants redshift access to S3 you can simply use the “copy” command below from a client connected to the Redshift cluster:
copy lineitem from 's3://lonzodbredshift/redshiftdata/lineitem.tbl' credentials 'aws_iam_role=arn:aws:iam::xxxxxxx:role/redshiftLoad'
Note: I used the free desktop Windows based Aginity Workbench client to connect to the cluster as shown here.
Here is the DDL to create the Redshift table to be loaded (yes do this before running the copy command above):
CREATE TABLE LINEITEMWPRIM ( L_ORDERKEY BIGINT NOT NULL, -- references O_ORDERKEY L_PARTKEY BIGINT NOT NULL, -- references P_PARTKEY (compound fk to PARTSUPP) L_SUPPKEY BIGINT NOT NULL, -- references S_SUPPKEY (compound fk to PARTSUPP) L_LINENUMBER INTEGER, L_QUANTITY DECIMAL, L_EXTENDEDPRICE DECIMAL, L_DISCOUNT DECIMAL, L_TAX DECIMAL, L_RETURNFLAG CHAR(1), L_LINESTATUS CHAR(1), L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT CHAR(25), L_SHIPMODE CHAR(10), L_COMMENT VARCHAR(44), PRIMARY KEY (L_ORDERKEY, L_LINENUMBER) );
Redshift by default runs in automatic compression encoding when a copy command is run against an empty table. The table is then modified to have compression encoding turned on for individual columns and then the table is loaded with compressed data.
Subsequent display of DDL shows the automatic change as below.
CREATE TABLE public.lineitem ( l_orderkey BIGINT NOT NULL ENCODE delta, l_partkey BIGINT NOT NULL ENCODE lzo, l_suppkey BIGINT NOT NULL ENCODE lzo, l_linenumber INTEGER ENCODE delta, l_quantity NUMERIC(18, 0) ENCODE delta, l_extendedprice NUMERIC(18, 0) ENCODE lzo, l_discount NUMERIC(18, 0) ENCODE lzo, l_tax NUMERIC(18, 0) ENCODE lzo, l_returnflag CHAR(1) ENCODE lzo, l_linestatus CHAR(1) ENCODE lzo, l_shipdate DATE ENCODE delta32k, l_commitdate DATE ENCODE delta32k, l_receiptdate DATE ENCODE delta32k, l_shipinstruct CHAR(25) ENCODE bytedict, l_shipmode CHAR(10) ENCODE bytedict, l_comment VARCHAR(44) ENCODE text255 ) DISTSTYLE EVEN;
If you do not have an IAM Role that grants S3 access policies – these screen shots allude to how it is setup
Generally the process is:
- Use AWS IAM Create a role to be assign to Redshift allowing Redshift to use S3
- Assign S3 access policies to the Role
- Assign the role to the Redshift cluster (easiest way is to assign the role to the Redshift cluster when you create the Redshift cluster but it can be done afterwards)
Generating Alternative Test Data Set tpcds-kit – much larger dataset – requires 200GB EBS volume
cd $HOME yum install -y make git gcc flex bison byacc git clone https://github.com/gregrahn/tpcds-kit cd tpcds-kit/tools # Note dir above contains scripts to generate large test data files - you must customize to set scale size and the scale size cannot be too small make OS=LINUX # or just "make" likely works cd $HOME git clone https://github.com/sko71/hands-on-with-redshift.git
# the git clone above is from acloudgura big-data course and you will have to customize his scripts to fit your ec instance and dir paths
Installing PostgreSQL on EC2 so you can use “psql” / psql client to load data into redshift
FYI – this worked on an Amazon Linux 2 AMI which I was cloned then customized from RHEL 7 you can verify as follows:
cat system-release Amazon Linux release
Now setup a repo, install postgresql10 – unfortunately there is no client only install so the entire DB is installed but no DB is initialized – to use the psql client – by the way you obviously don’t have to use “sudo” if you are doing this from root – I used root
sudo yum install -y https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-2.noarch.rpm sudo sed -i "s/rhel-\$releasever-\$basearch/rhel-latest-x86_64/g" "/etc/yum.repos.d/pgdg-10-redhat.repo" sudo yum install -y postgresql10= psql --version psql (PostgreSQL) 10.4 # connect to redshift psql -h <redshift-end-point-no-port> -p port -U username -d databaseName psql -h my-dw-instance.XXXXXX.redshift.amazonaws.com -p 5439 --username mf --dbname mydb # setup / create the DDL / tables in redshift for the copy command \i /root/hands-on-with-redshift/tpcdsddl.sql # output would be CREATE TABLE... 8 times # list the tables created \d # control-d to exit and or
Password for user mfoster: psql (10.4, server 8.0.2) SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off) Type "help" for help. mydb=# help You are using psql, the command-line interface to PostgreSQL. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit
To pass the password into psql it may be possible to:
- use export PGPASSWORD=password # to pass a password into a bash script – this worked for me to connect to redshift – very insecure though
- put the password in an encrpted file
- use the file .pgpass
Note: I don’t have the details on the last two of these techniques – more research necessary
GITHUB AWSLAB / amazon-redshift-utils
scroll down to admin utils
select “collection of utilities”
pick table_info – run the script
I believe I had to reboot the Redshift cluster after assigning the “redshiftLoad” role to the cluster – before it would take.
Error Handling on the Copy
Error handling can be accomplished in several ways, one is to check the following system error tables:
ST_LOAD_ERRORS ST_LOADERROR_DETAIL More from LonzoDB on AWS