Here is some of the material I used to prepare for the Amazon AWS Big Data Specialty Certification test
Lets Start With A Hadoop 2.0 Very High Level Diagram
Note: Spark, Tez and other Big Data products can and do use some Hadoop 2.0 components like YARN, HDFS2 and replace other components like MapReduce. Hadoop MapReduce these days has been replaced with more efficient components that minimize IO, and use Memory and CPU more efficiently with the same massively parallel distributed architecture concepts in general.
Most excellent five part Redshift blogs plus
Redshift cost containment and troubleshooting
Note – Redshift clusters cannot be “started and shutdown” they can only be “created and deleted”. They can be rebooted, for parameter group changes that require a restart and there is a “Modify Cluster and Resize Cluster” for network and size related changes – but no clean shutdown and start back up.
So, if you want to “shut” one down and keep the data, take a manual snapshot BEFORE you delete it. Then you’ll need to create a new cluster with the snapshot – recreating assigned roles etc to get it back.
Redshift Backups – Automated and Manual
Redshift by default automatically backs up your database cluster on creation, and once every 8 hours or every 5GB of new data whichever comes first and this backup is retained for 1 day.
What does retained for 1 day mean exactly? Well the good news is it means you will always have a backup of your database if you use automated backup. The bad news is you cannot go back further than your “retained days” if you restore with your automated backup further than the “retention window” which is by default one day.
One day of automated backups is free. If you extend the retention period past the default one day, you have to pay and for the incremental storage past 1 day…
And oh yes, these automated backups are incremental and only Amazon has access to the physical files, you don’t have to worry about how they are keeping all the “incrementals” around, it just happens – and it is a good thang ;-)…
When you restore a snapshot – whether it is an automated snapshot or a manual snapshot – you are creating a new cluster. You can do an individual table restore to an existing cluster.
You can create manual snapshots (yes you are charged for the storage of these) either from the console, or via a script and manual snapshots stay around until you manually delete them. Manual snapshots are not incremental.
Snapshot Manager to create snapshots using scheduled Lambda (I think) – In Redshift Utils Github HERE
This blog explains a similar approach pretty well.
If you are not using a large Redshift database very frequently, and don’t want to go thru the time consuming act of restoring it every time you want the database “up”, here’s another possible approach or two:
1.) keep your data in Spectrum (which is S3), and partition it in S3… The metadata will reside in Redshift, but the DB will be small, and backups and restores will be fast and cheap… I haven’t tried this but, it sounds reasonable to me.
Here’s another backup approach:
Kinesis and Spark with integration
Note: Use caution when running the following Kinesis to Spark, and Spark to Redshift examples as they could cost QUITE A BIT OF $$$ MONEY. I would cut down on the # of shards and volume of data that they use and pay close attention to what size instances you are spinning up ECS and EMR Cluster sizes and instance types. But if you want to learn Kinesis and Spark and Kinesis to Spark integration… here you go.
This most excellent blog gives you a step by step descriptive recipe for generating a test data stream, putting it on Kinesis Streams, taking it off KS, and maninpulating it in an EMR cluster with Spark – and more.
Here is just the command to create the cluster from AWS CLI (note this blog is a couple years old – check parameters like instance type for current valid values and cost effective instance types that you want to use – it uses (InstanceType=m3.xlarge):
aws emr create-cluster --release-label emr-4.2.0 --applications Name=Spark Name=Hive --ec2-attributes KeyName=myKey --use-default-roles --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --bootstrap-actions Path=s3://aws-bigdata-blog/artifacts/Querying_Amazon_Kinesis/DownloadKCLtoEMR400.sh,Name=InstallKCLLibs
But the big learning opportunity here is in the example Scala program that reads from the Kinesis Stream (known as a “consumer”) – it seems to be very well documented. I have not tried it yet, but implementing the entire example appears fairly straight forward for an operation such as this:
Kinesis Streams Related Note:
How to send event messages to Kinesis synchronously or asynchronously for critical payloads versus not so critical (informational) payloads:
- The critical events must be sent synchronously, and informational events can be sent asynchronously.
- The Kinesis Producer Library (KPL) implements an asynchronous send function, so it can be used for the informational non-critical messages as the write returns before it is confirmed successful.
- PutRecords API is a synchronous send function (call to put records does not return until it is confirmed success/failure), so it must be used for the critical events.
- Stream data is stored across multiple Availability Zones in a region for 24 hours
- Amazon Kinesis Streams is made up of one or more shards, each shard gives you a capacity 5 read transactions per second, up to a maximum total of 2 MB of data read per second. Each shard can support up to 1000 record writes per second and up to a maximum total of 1 MB data written per second.
- Kinesis Streams Producers can write data to a Kinesis Stream using one of three methods:
1.) the Amazon Kinesis PUT API (syncronously),
2.) the Amazon Kinesis Producer Library (KPL) (asyncronously),
3.) or the Amazon Kinesis Agent
Below are another couple example / blog posts for Kinesis to Spark
Example 2.) this time with Zepplin notebook (a Spark Oriented Development and test environment that also supports some visualizations)
Example 4.) Redshift Analytics With Spark based Machine Learning Using DataBricks Libary
Example 5.) Spark SQL For ETL
6.) Github DynamoDB connector to EMR
7.) Github Redshift To Sparks Connector DataBricks
Amazon EMR / EC2 Instance Types For Big Data
Because your requirements and focus on a specific type of processing when dealing with Big Data sets may vary, a variety of AWM EC2 instance types can be of value. Here is how you would decide on an instance type very generally.
- For Hadoop or other IO intensive workloads (that use HDFS, Instance Stores – Ephemeral Volumes, EBS Local volumes) – consider using I3 or D2 instance types
- For general purpose workloads – Consider M3 or M4 Instance Types
- For Machine Learning or other compute intensive workloads – consider P2 or C3/C4 Instance Types
- For Spark / Redis / Memcache Memory Intensive workloads – consider R3 or R4 Instance Types
Amazon EMR HDFS Default Replication Factor Depends on Cluster Size
HDFS replicates blocks for durability. The setting is in hdfs-site.xml and defaults to:
- 3 for clusters of 10 or more core nodes
- 2 for clusters from 4 to 9 core nodes
- 1 for clusters of 1 to 3 core nodes
hdfs-site.xml is found in the …/conf directory of the hadoop installation directory
<property> <name>dfs.replication<name> <value>3<value> <description>Block Replication<description> <property>
EMR Best Practices and Optimization
Amazon AWS Big Data Specialty Certification Various Blogs about personal experiences with the Big Data test:
Google and watch AWS Reinvent youtubes for various Big Data subjects – Including But Not limited to:
Kinesis, Redshift, DynamoDB, EMR, Data Pipeline, Compression, Data Pipeline, S3
Be careful though, at the time of this writing – Aug 2018 – the Big Data Certification test is getting old – so technologies newer to the AWS landscape will NOT be on the test – e.g. Glue, Redshift Spectrum, S3 Select – so you may want to filter your googles for older slightly older ReInvent deep dives.
AWS Big Data White Papers
A key quote from the white paper above is:
The following services are described in order from collecting, processing, storing, and analyzing big data:
Amazon Kinesis Streams
Amazon Elastic MapReduce
Amazon Machine Learning
Amazon Elasticsearch Service
Google “AWS Redshift github” for the github containing AWS’s canned utility Redshift scripts
Ok – I’ll save you some time – here are a couple of the links:
In this zip / github repo – there is an AdminScripts dir with a table_inspector.sql script that tells you a lot about what you have done or not – in terms of skew, compression, sort key, and distribution key definitions. Again, this just tells you if they are there – not if it is appropriate. See screenshot below for sample output:
Github Lambda Redshift Loader Scripts
A bit dated but: https://aws.amazon.com/blogs/big-data/a-zero-administration-amazon-redshift-database-loader/
TPC Link ( to generate test data sets like TPCH / TPCH-DS ): http://www.tpc.org/
AWS Docs. For WorkLoad Manager / WLM Assignment Rules
Note for Query Group based queues use – set query_group to queue_name; – in the SQL code to direct a query to a specific query group queue.
Upload From Internet To S3 – Use MultiPart Uploads For Parallel
S3 copies / copy to Redshift
select * from stv_slices; — starts at zero
Query this table and run your copies in parallel based on # of slices and nature of the data / size
s3 copy with no manifest is easy – google it – use iam role based credentials
Note: compress large files – to bzip2 etc. Redshift copy can handle compressed – recognizes it – + split files to even – up to 1GB if possible
Here are a few examples of DynamoDB, EMR, and S3 with manifest to Redshift copies
S3 Copy To Redshift With Manifest – Features
- load only files listed in manifest no matter what is in the directory / bucket / folder
- load files from different buckets
- load file with different prefixes
- better error handling – and check STL_LOAD_ERRORS & STL_LOADERROR_DETAIL
Three machine learning supervised learning models and how model ACCURACY is SCORED
Model accuracy is determined differently depending on the type of modelling you are doing.
- Linear Regression Models – least squares curve fit – RMSE – lower is better
- Multi-Class Models – F1 – 0 to 1 – closer to 1 is better – relative score, meaning a model with a higher score is better
- Binary Models (e.g. yes/no) – AUC – between 0 and 1 – closer to 1 is better – .5 means 50/50 no better than guessing – less than .5 throw it out / worse than guessing
Athena Under The Covers
The advantage of using Athena is that it is serverless, and self-managing.
Today – it is integrated with AWS Glue – but when the AWS Big Data Cert. test was written Glue did not exist.
Update after taking the test: Athena and Glue are both on the test – the test has been updated since it was released.
IOT Framework and Authentication Diagrams
Note: IoT Device Gateway is also known as the Device Broker
Cognito Identity Flow – Generally Used For Mobile Authentication – There are several other ways to authenticate
Security – Key Management – Comparison AWS CloudHSM Versus AWS KMS
BTW – Asymmetric versus symmetric encryption general definition is:
symmetric algorithms: (also called "secret key") use the same key for both encryption and decryption; asymmetric algorithms: (also called "public key") use different keys for encryption and decryption.
AWS KMS – supports symmetric only – CloudHSM supports both symmetric and asymmetric
Customer Master Keys – for encryption – support “envelope” encryption – where a master key is used to encrypt another key (often a data key).
Amazon AWS Elastic Search – Tutorial To Move Apache Logs From EC2 Thru Kinesis Firehouse and Lambda To ES / Elastic Search and Kibana
Passed The Amazon AWS Big Data Certification Yesterday – Only By Two Or Three Questions…
The test was harder than I anticipated. It had lots of scenarios based questions, with long… many times complex, multi-sentence requirements in the questions, with select two choices answers. About 85 percent of 65 questions (170 minutes total) were like that – very few easy questions. When nearing the end of the test, I thought I flunked it for certain before I saw the result.
Among The Subject Matter:
- On-Prem relational to cloud / Redshift
- Encrypted Redshift backups to another region
- Kinesis both Streams and Firehose many of these type where Kinesis may fit (thought I knew this stuff but “Collection” score below proves otherwise…)
- Elastic Search in there once or twice as valid answer
- Compression one or two
- EMR hadoop and spark, hive, the right instance types, six or seven questions – better know this stuff
- Glue and Athena one each – use case
- Security, encryption key management, views to restrict access – knew I was weak on security and score below confirmed – not many easy questions on the subject
- Quite a few scenarios where DynamoDB fit or not, use cases (several) and RCU, WCU
- A few where Aurora or MySQL fit
- About three question on what I consider complex Machine Learning – no basic Supervised, Binary / Multi-Class / Regression / Scoring questions – in other words, the ML questions on the test were over and above my basic study of Supervised ML – I probably missed them
- Visualization – basic what kind of chart type questions – one pivot scenario
- D3.js versus Jupyter versus Zepplin
Had enough time to complete the test but again, had to read fast throughout the entire 2.8 hours – as a result guessed a couple times to save time – ended up with 15 or 20 minutes left over
Didn’t flag and go back, just trusted initial instinct
Used the typical MO of removing answers that did not fit first the question – then consider what was left – usually two
Here are the category scores I received on the AWS Big Data Specialty Certification Test – I’m a 25+ year combo DBA, DevOps guy with heavy database OLTP and DW experience (mostly on-prem, with a few years cloud)
email me at: email@example.com if you have questions – I will reply if time permits
|Passed! Congratulations again on your achievement!
|Overall Score: 72%
|Topic Level Scoring:
1.0 Collection: 50%
2.0 Storage: 100%
3.0 Processing: 62%
4.0 Analysis: 75%
5.0 Visualization: 100%
6.0 Data Security: 50%