Skip to content

Latest commit

 

History

History
534 lines (426 loc) · 23.6 KB

File metadata and controls

534 lines (426 loc) · 23.6 KB

Amazon S3

  • Amazon S3 allows us to store objects (files) in "buckets" (directories)

S3 Buckets and Objects

  • Buckets must have a globally unique name
  • Buckets are defined at the region level
  • Buckets do have a naming convention:
    • No uppercase
    • No underscore
    • 3-63 character long names
    • Name should not be an IP
    • Must start with a lower case letter or number
  • Objects (files) have a key, which is the full path:
    • s3://my-bucket/my_file.txt
    • s3://my-bucket/my_folder/another_folder/my_file.txt
  • They key is composed of the prefix + object name:
    • s3://my-bucket/ my_folder/another_folder/ my_file.txt
  • There is no concept of directories within buckets, although the UI will trick us to think otherwise
  • The keys can be very long names which contain slashes ("/")
  • Object values are the content of the body
  • Max object size is 5TB
  • If uploading mor than 5 GB, we must use multi-part upload
  • Each object in S3 can have metadata, tags and (optional) version ID

S3 Versioning

  • Versioning should be enabled at the bucket level
  • Versioning means: objects uploaded with same key will increment the version
  • It is best practice to version files in S3, because:
    • It will protect against unintended deletes (ability to restore version)
    • Provides the ability to rollback a file
  • Notes:
    • Any file that is not versioned prior to enabling versioning, will have the version of null
    • Suspending versioning does not delete the previous versions of existing files

S3 Encryption for Objects

  • There are 4 methods of encrypting objects in S3:
    • SSE-S3: encrypts S3 objects using keys handled managed by AWS
    • SSE-KMS: leverages AWS KMS to manage encryption keys
    • SSE-C: keys are managed by the users
    • Client Side Encryption

SSE-S3:

  • Encryption using keys handled and managed by AWS
  • Objects are encrypted server side
  • Uses AES-256 encryption
  • In order to upload and set the SSE-S3 encryption, the upload request must have a header "x-amz-server-side-encryption": "AES256"

SSE-KMS

  • Encryption using keys handled and managed by KMS
  • KMS advantages: user control + audit trail
  • Objects are encrypted in the server side
  • In order to upload and set the SSE-S3 encryption, the upload request must have a header "x-amz-server-side-encryption": "aws:kms"

SSE-C

  • Server-side encryption using the keys provided by the customers
  • S3 does not store the encryption key the customers provide
  • To upload the data, HTTPS must be used
  • Encryption key must be provided in HTTP headers, for every HTTP request

Client Side Encryption

  • Files are encrypted by the customers before uploading them to S3
  • Client side libraries can be used to accomplish this, example Amazon S3 Encryption Client
  • Customers are responsible for encrypting the data as well
  • Encryption keys are fully managed by the customers

S3 Security

  • User based security:
    • IAM policies - which API calls should be allowed for a specific user from IAM console
  • Resource based:
    • Bucket policies - bucket wide rules from the S3 console, allows cross account
    • Object Access Control List
    • Bucket Access Control List
  • An IAM principal can access an S3 object if:
    • the user IAM permissions allow it or the resource policy allows it
    • there is no explicit DENY

S3 Bucket Policies

  • JSON based documents, which can contain:
    • Resources: buckets and objects
    • Actions: set of API to Allow or Deny
    • Effects: Allow or Deny
    • Principal: the account or user to apply the policy to
  • Use S3 bucket for policy to:
    • Grant public access to the bucket
    • Force objects to be encrypted at upload
    • Gran access to another account

Bucket Settings for Block Public Access

  • Block public access to buckets and objects granted through
    • new access control list (ACLs)
    • any access control list (ACLs)
    • new public bucket or access point policies
  • Block public and cross-account access to buckets and objects through any public or access point policies
  • These settings were created to prevent company data leaks
  • ACL (Access Control List): define read and write permissions at the object level

S3 Security - Other

  • Networking: supports VPC Endpoints
  • Logging and audit:
    • S3 Access Logs can be stored in other S3 buckets
    • API calls can be logged in AWS CloudTrail
  • Ser Security:
    • MFA Delete
    • Pre-Signed URLs: URLS that are valid only for a limited time

S3 Websites

CORS

  • An origin is a scheme (protocol), host (domain) and a port
  • CORS: Cross-Origin Resource Sharing
  • Web browser based mechanism to allow requests to other origins while visiting the main origin
  • The request wont ne fulfilled unless the other origin allows for the request, using CORS Headers (ex: Access-Control-Allow-Origin, Access-Control-Allow-Method)
  • If a client does a cross-origin request on our S3 bucket, we need to enable the correct CORS headers
  • We can allow for a specific origin or for * (all origins)

S3 Consistency Model

  • S3 provides read after write consistency for PUTS of new objects
    • As son as a new object is written, we can retrieve it (PUT 200 => GET 200)
    • This is true, except if we did a GET before to see if the object existed (GET 404 => PUT 200 => GET 404)
  • S3 provides eventual consistency for DELETES and PUTS of existing objects
    • If we read an object after updating it, we might get a the older version (PUT 200 => PUT 200 => GET 200 (which might return the older version))
    • If we delete an object, we might be able to delete it for a short amount of time (DELETE 200 => GET 200)
  • There is no way to request strong consistency for S3!

S3 Advanced Versioning

  • S3 Versioning creates a new version each time a file is changed
  • That includes when we encrypt a file
  • It is a nice ti get protected against unintentional file encryptions (example: ransomware)
  • Deleting a file is S3 bucket just adds a delete marker on the versioning
  • To delete a bucket, we need to remove all the file and their versions from it

S3 MFA-Delete

  • MFA forces users to generate a code on a device (usually a mobile phone) before doing import operations on S3
  • To use MFA-Delete, we have to enable versioning on the S3 bucket
  • If MFA-Delete is active, we would require MFA generated codes in order to:
    • Permanently delete an object version
    • Suspend versioning on the bucket
  • We don't need MFA codes for:
    • Enabling versioning
    • Listing deleted versions
  • Only bucket owner (root account) can enable/disable MFA-Delete!
  • MFA-Delete can be enabled currently using the CLI only!

S3 Default Encryption and Bucket Policies

  • The old way of enabling default encryption was to use a bucket policy and refuse any HTTP command without proper headers
  • The new way is to use the default encryption option in S3
  • Bucket polices are evaluated before the default encryption flag

S3 Access Logs

  • For audit purposes, we may want to log all access to S3 buckets
  • Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
  • This data can be analyzed using data analysis tools or Amazon Athena
  • WARNING: We should not set the logging bucket to be the monitored bucket. This will create a logging loop and the bucket will grow in size exponentially!

S3 Replication

  • Replication can be cross-region (CRR) or same-region (SRR)
  • It happens asynchronously
  • In order to enable S3 replication, we must have versioning enabled in the source and destination buckets
  • Buckets can be in different accounts and must have proper IAM permissions
  • CRR - use cases: compliance, lower latency access, replication across accounts
  • SRR - use cases: log aggregation, live replication between production and test accounts
  • After activating replication, only new objects are replicated
  • For delete operations:
    • If we delete without a version ID, S3 adds a delete market which is not replicated
    • If we delete an object with version ID, it deletes the object from the source bucket, operation is not replicated
    • Delete marker replication is an option which can be enabled
  • There is no "chaining" of replication: if a bucket 1 has replication in bucket 2, which has replication in bucket 3, then objects in bucket 1 are not replicated in bucket 3

S3 Pre-Signed URLs

  • We can generate pre-signed URLs using SDK or CLI
  • A pre-signed URL is valid for a default 3600 seconds, can change timeout with --expires-in [SECONDS] argument
  • Users given a pre-signed URL inherit the permissions of the person who generated the URL for GET/PUT
  • Use cases for pre-signed URLs:
    • Allow only logged-in users to download a premium video from a bucket
    • Allow an even changing list of users to download files by generating URLs dynamically
    • Allow temporarily a user to upload a file to a precise location in a bucket
  • Command to generate a presigned url: aws s3 presign s3://bucket-name --region us-east-2 --expires-in 300

CloudFront

  • It is content delivery network (CDN)
  • Improves read performance by caching content at the edge locations
  • Can provide DDoS protection, integration with AWS Shield and AWS WAF
  • Can expose external HTTPS and can talk to intern HTTPS back-ends

CloudFront - Origins

  • S3 buckets:
    • For distributing files and caching them at the edge
    • Enhanced security with CloudFront Origin Access Identity (OAI)
    • CloudFront can be used as an ingress to upload files
  • Custom Origin (HTTP)
    • Application Load Balancer
    • EC2 instance
    • S3 website (must first enable the bucket as a static S3 website)
    • Any HTTP backend

CloudFront Geo Restriction

  • We can restrict who can access our distribution by using:
    • Whitelists: allow our users to access our content only if they are in on of the countries on a list of approved countries
    • Blacklist: prevent our users from accessing our content if they are in one the countries on a blacklist of banned countries

CloudFront vs S3 CRR

  • CloudFront:
    • Uses global edge network
    • Files are cached for a TTL
    • Great for static content which must be available everywhere
  • S3 Cross Region Replication:
    • Must be set up for each region we want replication to happen
    • Files are updated in near real-time
    • Great for dynamic content that needs to be available at low-latency in a few regions

CloudFront Access Logs

  • CloudFront access logs: contain every request made to CloudFront
  • Access logs are stored in S3

CloudFront Reports

  • It is possible to generate reports on:
    • Cache statistics report
    • Popular object reports
    • Top referrers report
    • Usage report
    • Viewers report
  • These reports are based on the data from access logs

CloudFront Troubleshooting

  • CloudFront caches HTTP 4xx and 5xx status codes returned by S3 or the origin server
  • 4xx error code indicates the user does not have access to the underlying resources (403) or the object requested does not exist (404)
  • 5xx error codes indicate gateway issues

S3 Inventory

  • Amazon S3 inventory helps manage our storage
  • Used for audit and report in the replication and encryption status of our objects
  • Use cases:
    • Business
    • Compliance
    • Regulatory needs
  • We can query all the data using Amazon Athena, Redshift, Presto, Hive, Spark, etc.
  • We can set up multiple inventories
  • Inventory setup: we have to setup a target bucket with a policy, inventory data will be saved here

S3 Storage Classes

  • Standard - General Purpose
  • Standard - Infrequent Access (IA)
  • One Zone - Infrequent Access
  • Intelligent Tiering
  • Glacier Instant Retrieval
  • Glacier Flexible Retrieval
  • Glacier Deep Archive
  • Reduced Redundancy Storage (deprecated)

S3 Standard - General Purpose

  • Provides high durability of objects across multiple AZs
  • 99.99% availability over a year
  • Can sustain 2 concurrent facility failures
  • Suitable for frequently accessed data such for use cases such as: big data, mobile and gaming applications, content distribution

S3 Standard - Infrequent Access (IA)

  • Provides high durability of objects across multiple AZs
  • 99.9% availability
  • Lower cost compared to S3 Standard General Purpose
  • Can sustain 2 concurrent facility failures
  • Suitable for data that is less frequently accessed, but requires rapid access when needed. Use cases: data store for disaster recovery, backups

S3 One Zone - Infrequent Access (IA)

  • Similar to IA, but data is stored in a single AZ
  • 99.5% availability
  • Provides low latency and high throughput
  • Supports SSL for data at transit and encryption at rest
  • Lower cost compared to IA (by about 20%)
  • Use cases: store secondary backup copies or data that can be recreated

S3 Intelligent Tiering

  • Same low latency and high throughput performance of S3 Standard
  • Requires a small monthly fee for monitoring and auto-tiering
  • Automatically moves objects between two access tiers based on changing access patterns
  • Resilient against events that impacts an entire AZ

Amazon Glacier

  • Low cost object storage meant for archiving, backup
  • Data is retained for longer terms (tens of years)
  • It is an alternative to on-premise magnetic tape storage
  • Each item in Glacier is called an Archive. An Archive can be up to 40TB is size
  • Archives are stored in Vaults
  • Amazon Glacier Instant Retrieval provides data retrieval in milliseconds with the same performance as S3 Standard
  • Amazon Glacier Flexible Retrieval data retrieval options:
    • Expedited (1 to 5 minutes)
    • Standard (3 to 5 hours)
    • Bulk (5 to 12 hours)
  • Minimum storage duration for Glacier Instant and Flexible Retrieval is 90 days
  • Amazon Glacier Deep Archive data retrieval options:
    • Standard (up to 12 hours)
    • Bulk (up to 48 hours)
  • Minimum storage duration for Deep Archive is 180 days

S3 Moving Between Storage Classes

  • We can transition objects between storage classes
  • For infrequent accessed objects, we can move them to standard IA
  • For archiving objects, we should move them to Glacier or Deep Archive
  • Moving objects can be automated using lifecycle configurations

S3 Lifecycle Rules

  • Transition actions: define when objects are transitioned to another storage class. Examples:
    • Move objects to standard Ia 60 days after creation
    • Move objects go Glacier for archiving after 6 months
  • Expiration actions: automatically expire (delete) objects after some time. Examples:
    • Access logs can be set to be deleted after 1 year
    • Can be used to delete old versions of files if versioning is enabled
    • Clean up incomplete multipart uploads parts
  • Rules can be created for a certain prefix
  • Rules can be created for certain tags as well

S3 Baseline Performance

  • S3 automatically scales to high requests, having a latency of 100-200 ms
  • Our application can achieve al least 3500 PUT/COPY/POST/DELETE and 5500 GET/HEAD requests per second per prefix in a bucket
  • There are no limits to the number of prefixes in a bucket

S3 - KMS Limitation

  • If we use SSE-KMS, we may be impacted by the KMS limits
  • When we upload a file to a bucket with KMS encryption enabled, S3 calls the GenerateDataKey KMS API
  • When we download an encrypted file, it will cal the Decrypt KMS API
  • These calls count towards the KMS quota per second (which can be 5500, 10000, 30000 req/s based on the region)
  • Quota increase for KMS can not be requested

S3 Performance Optimization

S3 Uploads

  • Multi-Part upload:
    • Recommended for files > 100MB, required for files > 5GB
    • Can help parallelize uploads speeding up transfers
  • S3 Transfer Acceleration:
    • Increase the transfer speed by transferring file to an AWS edge location which will forward the data to the S3 bucket in the target region
    • Compatible with multi-part upload

S3 Byte-Range Fetches

  • Parallelize GETs by requesting specific byte ranges
  • Better resilience in case of failures
  • Can be used to speed up downloads
  • Can used to retrieve only partial data (for example the head of a file)

S3 Select and Glacier Select

  • Retrieve less data using SQL by performing server side filtering
  • Can filter by rows and columns (simple SQL statements)
  • Using selects we can have less network usage, less CPU cost on the client-side

S3 Event Notifications

  • React to changes, events happening in a bucket. Example of events:
    • S3:ObjectCreated
    • S3:ObjectRemoved
    • S3:ObjectRestore
    • S3:Replication
  • Use cases: generate thumbnail if images uploaded to S3
  • Event consumers can be:
    • SNS
    • SQS
    • Lambda Functions
  • S3 events notifications typically deliver events in seconds, sometimes cna take over a minute
  • If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. To solve this issue, we have to enable versioning on the bucket

S3 Analytics - Storage Class Analysis

  • We can set up analytics to help determine when to transition objects from standard to standard IA
  • Does not work for one zone IA or Glacier
  • Report is updated on daily basis
  • Takes about 24h to 48h to first start
  • Helps put together lifecycle rules

Glacier Overview

  • Low cost storage for archiving and backups
  • Data is retained for longer terms
  • Alternative to on-premise tape storage
  • Cost per storage per month ($0.004/GB + retrieval cost)
  • Each item in Glacier is called an Archive, which can be up to 40TB
  • Archives are stored in Vaults

Glacier Operations

  • Provides 3 types of operations:
    • Upload: single operation or by parts (multi-part upload) for larger archives
    • Download: first initiates a retrieval job for the particular archive, the prepares the date for download. Users have a limited time to download the data from the staging server
    • Delete: single operation, can be achieved by user Glacier Rest API or by specifying the archive ID
  • Restore links have an expiry date
  • Glacier provides 3 retrieval options:
    • Expedited (1 to 5 minutes retrieval time): $0.03 per GB $0.01 per request
    • Standard (3 to 5 hours retrieval time): $0.01 per GB $0.05 per 1000 requests
    • Bulk (5 to 12 hours retrieval time): $0.0025 per GB $0.025 per 1000 requests

Glacier Vault Policies and Vault Locks

  • Vault is a collection of archives
  • Each vault has:
    • ONE vault access policy
    • One vault lock policy
  • Vault policies are written in JSON
  • Vault access policies are similar to bucket policies
  • Vault lock policies are used for regulatory and compliance requirements:
    • The policy is immutable, it can never be changed
    • Example: forbid to delete an archive if is less than 1 year old
    • Example: WORM policy (write once read many times)

Snowball

  • Physical data transport solution for moving TBs or PBs of data in and out of AWS
  • Alternative to move data over the network
  • It is secure, tamper resistant, uses KMS 256 bit encryption
  • Tracking is done using SNS and text messages. It has an E-ink shipping label
  • Pay per data transfer job
  • Use cases: large data cloud migration, DC decommission, disaster recovery
  • If it takes more than a week to transfer over the network, probably it is a good idea to use Snowball

Snowball Transferring Process

  1. Request a Snowball form AWS console delivery
  2. Install the device on the server
  3. Copy the files over the device using the Snowball client
  4. Ship back the device
  5. Data will be loaded into an S3 bucket
  6. Snowball will be wiped completely

Snowball Edge

  • Snowball Edge adds computational capability to the device
  • Provides up to 100TB of storage with either:
    • 24 vCPU - storage optimized (100TB storage)
    • 52 vCPU and optional GPU - compute optimized (42TB storage HDD, 7.68TB SSD)
  • Supports a custom EC2 AMI for data processing on the go
  • Supports custom Lambda Functions
  • Very useful for data preprocessing
  • Use cases: data migration, image collation, IoT capture, machine learning

Snowmobile

  • It is a truck
  • It can transfer exabytes of data
  • Each Snowmobile has 100PB of capacity
  • Better option than Snowball in case we want transfer more than 10PB of data

Snowball into Glacier

  • Snowball cannot import data into Glacier directly
  • We have to use S3 first and create an S3 lifecycle policy to transfer data to Glacier

Hybrid Cloud for Storage - Storage Gateway

  • It is a bridge between on-premise data and cloud data in S3
  • Provides 3 types of gateways:
    • File Gateway
    • Volume Gateway
    • Tape Gateway
  • File Gateway:
    • Configured S3 buckets are accessible using NFS and SMB protocols
    • Supports S3 standard, IA, One Zone IA
    • Buckets will be accessed by the File Gateway, the File Gateway will have its own IAM role
    • Most recently used data will be cached in the file gateway
    • File gateway can be mounted in many servers
    • RefreshCache API:
      • The Storage Gateway updates the File Share Cache automatically when we write files to the File Gateway
      • When we upload files directly to S3 buckets, users connected to the File Gateway may not see the files on the File Share
      • We need to invoke the RefreshCache API
    • Automating Cache Refresh:
      • A File Gateway features that enables us to automatically refresh File Gateway cache to stay up to date with the changes in the S3 bucket
      • Ensures that users are not accessing state data on their file shares
      • No need to manually and periodically invoke the RefreshCache API
  • Volume Gateway:
    • Provides block storage using iSCSI protocol backed by S3
    • Backed by EBS snapshots which can help restore on-premise volumes
    • The gateway supports the following volume configurations:
      • Cached volumes:
        • We store our data in S3 and retain a copy of frequently accessed data subsets locally
      • Stored volumes:
        • If we need low-latency access to your entire dataset, first we configure our on-premises gateway to store all our data locally
        • Then we asynchronously back up point-in-time snapshots of this data to Amazon S3 - This configuration provides durable and inexpensive offsite backups that we can recover to our local data center or Amazon EC2
  • Tape Gateway:
    • Provides backup processes similar to the ones using physical tapes
    • It is a Virtual Tape Library (VTL) backed by S3 and Glacier
    • Back up data using existing tape-based processes (and iSCSI interface)
    • Works with leading backup software vendors

Athena

  • Serverless service used to perform analytics directly against S3 files
  • Uses SQL language to query the files
  • Provides JDBC/ODBC driver
  • We are charged per query and amount of data scanned
  • Supports CSV, JSON, ORC, Avro and Parquet. In the backend is using Presto'
  • Use cases: business intelligence, analytics, reporting, analyze and query VPC Flow Logs, ELB Logs and CloudTrail logs, etc.