Databases
Choosing the Right Database
• We have a lot of managed databases on AWS to choose from
• Questions to choose the right database based on your architecture:
• Read-heavy, write-heavy, or balanced workload? Throughput needs? Will it change, does it need to scale or fluctuate during the day?
• How much data to store and for how long? Will it grow? Average object size? How are they accessed?
• Data durability? Source of truth for the data ?
• Latency requirements? Concurrent users?
• Data model? How will you query the data? Joins? Structured? Semi-Structured?
• Strong schema? More flexibility? Reporting? Search? RDBMS / NoSQL?
• License costs? Switch to Cloud Native DB such as Aurora?
Database Types
• RDBMS (= SQL / OLTP): RDS, Aurora – great for joins
• NoSQL database: DynamoDB (~JSON), ElastiCache (key / value pairs), Neptune (graphs) – no joins, no SQL
• Object Store: S3 (for big objects) / Glacier (for backups / archives)
• Data Warehouse (= SQL Analytics / BI): Redshift (OLAP), Athena
• Search: ElasticSearch (JSON) – free text, unstructured searches
• Graphs: Neptune – displays relationships between data
Online analytical processing =OLAP
RDS Overview
• Managed PostgreSQL / MySQL / Oracle / SQL Server
• Must provision an EC2 instance & EBS Volume type and size
• Support for Read Replicas and Multi AZ
• Security through IAM, Security Groups, KMS , SSL in transit
• Backup / Snapshot / Point in time restore feature
• Managed and Scheduled maintenance
• Monitoring through CloudWatch
• Use case: Store relational datasets (RDBMS / OLTP), perform SQL queries, transactional inserts / update / delete is available
RDS for Solutions Architect
• Operations: small downtime when failover happens, when maintenance happens, scaling in read replicas / ec2 instance / restore EBS implies manual intervention, application changes
• Security: AWS responsible for OS security, we are responsible for setting up KMS, security groups, IAM policies, authorizing users in DB, using SSL
• Reliability: Multi AZ feature, failover in case of failures
• Performance: depends on EC2 instance type, EBS volume type, ability to add Read Replicas. Storage auto-scaling & manual scaling of instances
• Cost: Pay per hour based on provisioned EC2 and EBS
Aurora Overview
• Compatible API for PostgreSQL / MySQL
• Data is held in 6 replicas, across 3 AZ
• Auto healing capability
• Multi AZ, Auto Scaling Read Replicas
• Read Replicas can be Global
• Aurora database can be Global for DR or latency purposes
• Auto scaling of storage from 10GB to 128 TB
• Define EC2 instance type for aurora instances
• Same security / monitoring / maintenance features as RDS
• Aurora Serverless – for unpredictable / intermittent workloads
• Aurora Multi-Master – for continuous writes failover
• Use case: same as RDS, but with less maintenance / more flexibility / more performance
Aurora for Solutions Architect
• Operations: less operations, auto scaling storage
• Security: AWS responsible for OS security, we are responsible for setting up KMS, security groups, IAM policies, authorizing users in DB, using SSL
• Reliability: Multi AZ, highly available, possibly more than RDS, Aurora Serverless option, Aurora Multi-Master option
• Performance: 5x performance (according to AWS) due to architectural optimizations. Up to 15 Read Replicas (only 5 for RDS)
• Cost: Pay per hour based on EC2 and storage usage. Possibly lower costs compared to Enterprise grade databases such as Oracle
ElastiCache Overview
• Managed Redis / Memcached (similar offering as RDS, but for caches)
• In-memory data store, sub-millisecond latency
• Must provision an EC2 instance type
• Support for Clustering (Redis) and Multi AZ, Read Replicas (sharding)
• Security through IAM, Security Groups, KMS, Redis Auth
• Backup / Snapshot / Point in time restore feature
• Managed and Scheduled maintenance
• Monitoring through CloudWatch
• Use Case: Key/Value store, Frequent reads, less writes, cache results for DB queries, store session data for websites, cannot use SQL.
ElastiCache for Solutions Architect
• Operations: same as RDS
• Security: AWS responsible for OS security, we are responsible for setting up KMS, security groups, IAM policies, users (Redis Auth), using SSL
• Reliability: Clustering, Multi AZ
• Performance: Sub-millisecond performance, in memory, read replicas for sharding, very popular cache option
• Cost: Pay per hour based on EC2 and storage usage
DynamoDB Overview
• AWS proprietary technology, managed NoSQL database
• Serverless, provisioned capacity, auto scaling, on demand capacity (Nov 2018)
• Can replace ElastiCache as a key/value store (storing session data for example)
• Highly Available, Multi AZ by default, Read and Writes are decoupled, DAX for read cache
• Reads can be eventually consistent or strongly consistent
• Security, authentication and authorization is done through IAM
• DynamoDB Streams to integrate with AWS Lambda
• Backup / Restore feature, Global Table feature
• Monitoring through CloudWatch
• Can only query on primary key, sort key, or indexes
• Use Case: Serverless applications development (small documents 100s KB), distributed serverless cache, doesn’t have SQL query language available, has transactions capability from Nov 2018
DynamoDB for Solutions Architect
• Operations: no operations needed, auto scaling capability, serverless
• Security: full security through IAM policies, KMS encryption, SSL in flight
• Reliability: Multi AZ, Backups
• Performance: single digit millisecond performance, DAX for caching reads, performance doesn’t degrade if your application scales
• Cost: Pay per provisioned capacity and storage usage (no need to guess in advance any capacity – can use auto scaling)
S3 Overview
• S3 is a… key / value store for objects
• Great for big objects, not so great for small objects
• Serverless, scales infinitely, max object size is 5 TB
• Strong consistency
• Tiers: S3 Standard, S3 IA, S3 One Zone IA, Glacier for backups
• Features: Versioning, Encryption, Cross Region Replication, etc…
• Security: IAM, Bucket Policies, ACL
• Encryption: SSE-S3, SSE-KMS, SSE-C, client side encryption, SSL in transit
• Use Case: static files, key value store for big files, website hosting
S3 for Solutions Architect
• Operations: no operations needed
• Security: IAM, Bucket Policies, ACL, Encryption (Server/Client), SSL
• Reliability: 99.999999999% durability / 99.99% availability, Multi AZ, CRR
• Performance: scales to thousands of read / writes per second, transfer acceleration / multi-part for big files
• Cost: pay per storage usage, network cost, requests number
Athena Overview
• Fully Serverless database with SQL capabilities
• Used to query data in S3
• Pay per query
• Output results back to S3
• Secured through IAM
• Use Case: one time SQL queries, serverless queries on S3, log analytics
Athena for Solutions Architect
• Operations: no operations needed, serverless
• Security: IAM + S3 security
• Reliability: managed service, uses Presto engine, highly available
• Performance: queries scale based on data size
• Cost: pay per query / per TB of data scanned, serverless
Redshift Overview
• Redshift is based on PostgreSQL, but it’s not used for OLTP
• It’s OLAP – online analytical processing (analytics and data warehousing)
• 10x better performance than other data warehouses, scale to PBs of data
• Columnar storage of data (instead of row based)
• Massively Parallel Query Execution (MPP)
• Pay as you go based on the instances provisioned
• Has a SQL interface for performing the queries
• BI tools such as AWS Quicksight or Tableau integrate with it
Redshift Continued…
• Data is loaded from S3, DynamoDB, DMS, other DBs…
• From 1 node to 128 nodes, up to 128 TB of space per node
• Leader node: for query planning, results aggregation
• Compute node: for performing the queries, send results to leader
• Redshift Spectrum: perform queries directly against S3 (no need to load)
• Backup & Restore, Security VPC / IAM / KMS, Monitoring
• Redshift Enhanced VPC Routing: COPY / UNLOAD goes through VPC
Redshift – Snapshots & DR
• Redshift has no “Multi-AZ” mode
• Snapshots are point-in-time backups of a cluster, stored internally in S3
• Snapshots are incremental (only what has changed is saved)
• You can restore a snapshot into a new cluster
• Automated: every 8 hours, every 5 GB, or on a schedule. Set retention
• Manual: snapshot is retained until you delete it
• You can configure Amazon Redshift to automatically copy snapshots (automated or manual) of a cluster to another AWS Region
Loading data into Redshift

Redshift Spectrum
• Query data that is already in S3 without loading it
• Must have a Redshift cluster available to start the query
• The query is then submitted to thousands of Redshift Spectrum nodes

Redshift for Solutions Architect
• Operations: like RDS
• Security: IAM, VPC, KMS, SSL (like RDS)
• Reliability: auto healing features, cross-region snapshot copy
• Performance: 10x performance vs other data warehousing, compression
• Cost: pay per node provisioned, 1/10th of the cost vs other warehouses
• vs Athena: faster queries / joins / aggregations thanks to indexes
• Remember: Redshift = Analytics / BI / Data Warehouse
AWS Glue
• Managed extract, transform, and load (ETL) service
• Useful to prepare and transform data for analytics
• Fully serverless service


Neptune
• Fully managed graph database
• When do we use Graphs?
• High relationship data
• Social Networking: Users friends with Users, replied to comment on post of user and likes other comments.
• Knowledge graphs (Wikipedia)
• Highly available across 3 AZ, with up to 15 read replicas
• Point -in -time recovery, continuous backup to Amazon S3
• Support for KMS encryption at rest + HTTPS
Neptune for Solutions Architect
• Operations: similar to RDS
• Security: IAM, VPC, KMS, SSL (similar to RDS) + IAM Authentication
• Reliability: Multi-AZ, clustering
• Performance: best suited for graphs, clustering to improve performance
• Cost: pay per node provisioned (similar to RDS)
• Remember: Neptune = Graphs
Amazon OpenSearch Service
• Amazon OpenSearch is successor to Amazon ElasticSearch
• Example: In DynamoDB, you can only find by primary key or indexes
• With OpenSearch, you can search any field, even partially matches
• It’s common to use OpenSearch as a complement to another database
• OpenSearch also has some usage for Big Data applications
• You can provision a cluster of instances
• Built-in integrations: Amazon Kinesis Data Firehose, AWS IoT, and Amazon CloudWatch Logs for data ingestion
• Security through Cognito & IAM, KMS encryption, SSL & VPC
• Comes with OpenSearch Dashboards (visualization)
OpenSearch for Solutions Architect
• Operations: similar to RDS
• Security: Cognito, IAM, VPC, KMS, SSL
• Reliability: Multi-AZ, clustering
• Performance: based on ElasticSearch project (open source), petabyte scale
• Cost: pay per node provisioned (similar to RDS)
• Remember: OpenSearch = Search / Indexing
Last updated