- Amazon S3 allows us to store objects (files) in "buckets" (directories)
- Buckets must have a globally unique name
- Buckets are defined at the region level
- Buckets do have a naming convention:
- No uppercase
- No underscore
- 3-63 character long names
- Name should not be an IP
- Must start with a lower case letter or number
- Objects (files) have a key, which is the full path:
- s3://my-bucket/my_file.txt
- s3://my-bucket/my_folder/another_folder/my_file.txt
- They key is composed of the prefix + object name:
- s3://my-bucket/ my_folder/another_folder/ my_file.txt
- There is no concept of directories within buckets, although the UI will trick us to think otherwise
- The keys can be very long names which contain slashes ("/")
- Object values are the content of the body
- Max object size is 5TB
- If uploading mor than 5 GB, we must use multi-part upload
- Each object in S3 can have metadata, tags and (optional) version ID
- Versioning should be enabled at the bucket level
- Versioning means: objects uploaded with same key will increment the version
- It is best practice to version files in S3, because:
- It will protect against unintended deletes (ability to restore version)
- Provides the ability to rollback a file
- Notes:
- Any file that is not versioned prior to enabling versioning, will have the version of null
- Suspending versioning does not delete the previous versions of existing files
- There are 4 methods of encrypting objects in S3:
- SSE-S3: encrypts S3 objects using keys handled managed by AWS
- SSE-KMS: leverages AWS KMS to manage encryption keys
- SSE-C: keys are managed by the users
- Client Side Encryption
- Encryption using keys handled and managed by AWS
- Objects are encrypted server side
- Uses AES-256 encryption
- In order to upload and set the SSE-S3 encryption, the upload request must have a header
"x-amz-server-side-encryption": "AES256"
- Encryption using keys handled and managed by KMS
- KMS advantages: user control + audit trail
- Objects are encrypted in the server side
- In order to upload and set the SSE-S3 encryption, the upload request must have a header
"x-amz-server-side-encryption": "aws:kms"
- Server-side encryption using the keys provided by the customers
- S3 does not store the encryption key the customers provide
- To upload the data, HTTPS must be used
- Encryption key must be provided in HTTP headers, for every HTTP request
- Files are encrypted by the customers before uploading them to S3
- Client side libraries can be used to accomplish this, example Amazon S3 Encryption Client
- Customers are responsible for encrypting the data as well
- Encryption keys are fully managed by the customers
- User based security:
- IAM policies - which API calls should be allowed for a specific user from IAM console
- Resource based:
- Bucket policies - bucket wide rules from the S3 console, allows cross account
- Object Access Control List
- Bucket Access Control List
- An IAM principal can access an S3 object if:
- the user IAM permissions allow it or the resource policy allows it
- there is no explicit DENY
- JSON based documents, which can contain:
- Resources: buckets and objects
- Actions: set of API to Allow or Deny
- Effects: Allow or Deny
- Principal: the account or user to apply the policy to
- Use S3 bucket for policy to:
- Grant public access to the bucket
- Force objects to be encrypted at upload
- Gran access to another account
- Block public access to buckets and objects granted through
- new access control list (ACLs)
- any access control list (ACLs)
- new public bucket or access point policies
- Block public and cross-account access to buckets and objects through any public or access point policies
- These settings were created to prevent company data leaks
- ACL (Access Control List): define read and write permissions at the object level
- Networking: supports VPC Endpoints
- Logging and audit:
- S3 Access Logs can be stored in other S3 buckets
- API calls can be logged in AWS CloudTrail
- Ser Security:
- MFA Delete
- Pre-Signed URLs: URLS that are valid only for a limited time
- S3 can host static websites and have them accessible on the web
- The website url will be something like this:
- s3-website dash (-) Region ‐ http://bucket-name.s3-website-Region.amazonaws.com
- s3-website dot (.) Region ‐ http://bucket-name.s3-website.Region.amazonaws.com
- If we forget to allow public reads in bucket policies, we will get a 403 - Forbidden when accessing the website
- An origin is a scheme (protocol), host (domain) and a port
- CORS: Cross-Origin Resource Sharing
- Web browser based mechanism to allow requests to other origins while visiting the main origin
- The request wont ne fulfilled unless the other origin allows for the request, using CORS Headers (ex:
Access-Control-Allow-Origin
,Access-Control-Allow-Method
) - If a client does a cross-origin request on our S3 bucket, we need to enable the correct CORS headers
- We can allow for a specific origin or for
*
(all origins)
- S3 provides read after write consistency for PUTS of new objects
- As son as a new object is written, we can retrieve it (PUT 200 => GET 200)
- This is true, except if we did a GET before to see if the object existed (GET 404 => PUT 200 => GET 404)
- S3 provides eventual consistency for DELETES and PUTS of existing objects
- If we read an object after updating it, we might get a the older version (PUT 200 => PUT 200 => GET 200 (which might return the older version))
- If we delete an object, we might be able to delete it for a short amount of time (DELETE 200 => GET 200)
- There is no way to request strong consistency for S3!
- S3 Versioning creates a new version each time a file is changed
- That includes when we encrypt a file
- It is a nice ti get protected against unintentional file encryptions (example: ransomware)
- Deleting a file is S3 bucket just adds a delete marker on the versioning
- To delete a bucket, we need to remove all the file and their versions from it
- MFA forces users to generate a code on a device (usually a mobile phone) before doing import operations on S3
- To use MFA-Delete, we have to enable versioning on the S3 bucket
- If MFA-Delete is active, we would require MFA generated codes in order to:
- Permanently delete an object version
- Suspend versioning on the bucket
- We don't need MFA codes for:
- Enabling versioning
- Listing deleted versions
- Only bucket owner (root account) can enable/disable MFA-Delete!
- MFA-Delete can be enabled currently using the CLI only!
- The old way of enabling default encryption was to use a bucket policy and refuse any HTTP command without proper headers
- The new way is to use the default encryption option in S3
- Bucket polices are evaluated before the default encryption flag
- For audit purposes, we may want to log all access to S3 buckets
- Any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
- This data can be analyzed using data analysis tools or Amazon Athena
- WARNING: We should not set the logging bucket to be the monitored bucket. This will create a logging loop and the bucket will grow in size exponentially!
- Replication can be cross-region (CRR) or same-region (SRR)
- It happens asynchronously
- In order to enable S3 replication, we must have versioning enabled in the source and destination buckets
- Buckets can be in different accounts and must have proper IAM permissions
- CRR - use cases: compliance, lower latency access, replication across accounts
- SRR - use cases: log aggregation, live replication between production and test accounts
- After activating replication, only new objects are replicated
- For delete operations:
- If we delete without a version ID, S3 adds a delete market which is not replicated
- If we delete an object with version ID, it deletes the object from the source bucket, operation is not replicated
- Delete marker replication is an option which can be enabled
- There is no "chaining" of replication: if a bucket 1 has replication in bucket 2, which has replication in bucket 3, then objects in bucket 1 are not replicated in bucket 3
- We can generate pre-signed URLs using SDK or CLI
- A pre-signed URL is valid for a default 3600 seconds, can change timeout with
--expires-in [SECONDS] argument
- Users given a pre-signed URL inherit the permissions of the person who generated the URL for GET/PUT
- Use cases for pre-signed URLs:
- Allow only logged-in users to download a premium video from a bucket
- Allow an even changing list of users to download files by generating URLs dynamically
- Allow temporarily a user to upload a file to a precise location in a bucket
- Command to generate a presigned url:
aws s3 presign s3://bucket-name --region us-east-2 --expires-in 300
- It is content delivery network (CDN)
- Improves read performance by caching content at the edge locations
- Can provide DDoS protection, integration with AWS Shield and AWS WAF
- Can expose external HTTPS and can talk to intern HTTPS back-ends
- S3 buckets:
- For distributing files and caching them at the edge
- Enhanced security with CloudFront Origin Access Identity (OAI)
- CloudFront can be used as an ingress to upload files
- Custom Origin (HTTP)
- Application Load Balancer
- EC2 instance
- S3 website (must first enable the bucket as a static S3 website)
- Any HTTP backend
- We can restrict who can access our distribution by using:
- Whitelists: allow our users to access our content only if they are in on of the countries on a list of approved countries
- Blacklist: prevent our users from accessing our content if they are in one the countries on a blacklist of banned countries
- CloudFront:
- Uses global edge network
- Files are cached for a TTL
- Great for static content which must be available everywhere
- S3 Cross Region Replication:
- Must be set up for each region we want replication to happen
- Files are updated in near real-time
- Great for dynamic content that needs to be available at low-latency in a few regions
- CloudFront access logs: contain every request made to CloudFront
- Access logs are stored in S3
- It is possible to generate reports on:
- Cache statistics report
- Popular object reports
- Top referrers report
- Usage report
- Viewers report
- These reports are based on the data from access logs
- CloudFront caches HTTP 4xx and 5xx status codes returned by S3 or the origin server
- 4xx error code indicates the user does not have access to the underlying resources (403) or the object requested does not exist (404)
- 5xx error codes indicate gateway issues
- Amazon S3 inventory helps manage our storage
- Used for audit and report in the replication and encryption status of our objects
- Use cases:
- Business
- Compliance
- Regulatory needs
- We can query all the data using Amazon Athena, Redshift, Presto, Hive, Spark, etc.
- We can set up multiple inventories
- Inventory setup: we have to setup a target bucket with a policy, inventory data will be saved here
- Standard - General Purpose
- Standard - Infrequent Access (IA)
- One Zone - Infrequent Access
- Intelligent Tiering
- Glacier Instant Retrieval
- Glacier Flexible Retrieval
- Glacier Deep Archive
- Reduced Redundancy Storage (deprecated)
- Provides high durability of objects across multiple AZs
- 99.99% availability over a year
- Can sustain 2 concurrent facility failures
- Suitable for frequently accessed data such for use cases such as: big data, mobile and gaming applications, content distribution
- Provides high durability of objects across multiple AZs
- 99.9% availability
- Lower cost compared to S3 Standard General Purpose
- Can sustain 2 concurrent facility failures
- Suitable for data that is less frequently accessed, but requires rapid access when needed. Use cases: data store for disaster recovery, backups
- Similar to IA, but data is stored in a single AZ
- 99.5% availability
- Provides low latency and high throughput
- Supports SSL for data at transit and encryption at rest
- Lower cost compared to IA (by about 20%)
- Use cases: store secondary backup copies or data that can be recreated
- Same low latency and high throughput performance of S3 Standard
- Requires a small monthly fee for monitoring and auto-tiering
- Automatically moves objects between two access tiers based on changing access patterns
- Resilient against events that impacts an entire AZ
- Low cost object storage meant for archiving, backup
- Data is retained for longer terms (tens of years)
- It is an alternative to on-premise magnetic tape storage
- Each item in Glacier is called an Archive. An Archive can be up to 40TB is size
- Archives are stored in Vaults
- Amazon Glacier Instant Retrieval provides data retrieval in milliseconds with the same performance as S3 Standard
- Amazon Glacier Flexible Retrieval data retrieval options:
- Expedited (1 to 5 minutes)
- Standard (3 to 5 hours)
- Bulk (5 to 12 hours)
- Minimum storage duration for Glacier Instant and Flexible Retrieval is 90 days
- Amazon Glacier Deep Archive data retrieval options:
- Standard (up to 12 hours)
- Bulk (up to 48 hours)
- Minimum storage duration for Deep Archive is 180 days
- We can transition objects between storage classes
- For infrequent accessed objects, we can move them to standard IA
- For archiving objects, we should move them to Glacier or Deep Archive
- Moving objects can be automated using lifecycle configurations
- Transition actions: define when objects are transitioned to another storage class. Examples:
- Move objects to standard Ia 60 days after creation
- Move objects go Glacier for archiving after 6 months
- Expiration actions: automatically expire (delete) objects after some time. Examples:
- Access logs can be set to be deleted after 1 year
- Can be used to delete old versions of files if versioning is enabled
- Clean up incomplete multipart uploads parts
- Rules can be created for a certain prefix
- Rules can be created for certain tags as well
- S3 automatically scales to high requests, having a latency of 100-200 ms
- Our application can achieve al least 3500 PUT/COPY/POST/DELETE and 5500 GET/HEAD requests per second per prefix in a bucket
- There are no limits to the number of prefixes in a bucket
- If we use SSE-KMS, we may be impacted by the KMS limits
- When we upload a file to a bucket with KMS encryption enabled, S3 calls the
GenerateDataKey
KMS API - When we download an encrypted file, it will cal the
Decrypt
KMS API - These calls count towards the KMS quota per second (which can be 5500, 10000, 30000 req/s based on the region)
- Quota increase for KMS can not be requested
- Multi-Part upload:
- Recommended for files > 100MB, required for files > 5GB
- Can help parallelize uploads speeding up transfers
- S3 Transfer Acceleration:
- Increase the transfer speed by transferring file to an AWS edge location which will forward the data to the S3 bucket in the target region
- Compatible with multi-part upload
- Parallelize GETs by requesting specific byte ranges
- Better resilience in case of failures
- Can be used to speed up downloads
- Can used to retrieve only partial data (for example the head of a file)
- Retrieve less data using SQL by performing server side filtering
- Can filter by rows and columns (simple SQL statements)
- Using selects we can have less network usage, less CPU cost on the client-side
- React to changes, events happening in a bucket. Example of events:
S3:ObjectCreated
S3:ObjectRemoved
S3:ObjectRestore
S3:Replication
- Use cases: generate thumbnail if images uploaded to S3
- Event consumers can be:
- SNS
- SQS
- Lambda Functions
- S3 events notifications typically deliver events in seconds, sometimes cna take over a minute
- If two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent. To solve this issue, we have to enable versioning on the bucket
- We can set up analytics to help determine when to transition objects from standard to standard IA
- Does not work for one zone IA or Glacier
- Report is updated on daily basis
- Takes about 24h to 48h to first start
- Helps put together lifecycle rules
- Low cost storage for archiving and backups
- Data is retained for longer terms
- Alternative to on-premise tape storage
- Cost per storage per month ($0.004/GB + retrieval cost)
- Each item in Glacier is called an Archive, which can be up to 40TB
- Archives are stored in Vaults
- Provides 3 types of operations:
- Upload: single operation or by parts (multi-part upload) for larger archives
- Download: first initiates a retrieval job for the particular archive, the prepares the date for download. Users have a limited time to download the data from the staging server
- Delete: single operation, can be achieved by user Glacier Rest API or by specifying the archive ID
- Restore links have an expiry date
- Glacier provides 3 retrieval options:
- Expedited (1 to 5 minutes retrieval time): $0.03 per GB $0.01 per request
- Standard (3 to 5 hours retrieval time): $0.01 per GB $0.05 per 1000 requests
- Bulk (5 to 12 hours retrieval time): $0.0025 per GB $0.025 per 1000 requests
- Vault is a collection of archives
- Each vault has:
- ONE vault access policy
- One vault lock policy
- Vault policies are written in JSON
- Vault access policies are similar to bucket policies
- Vault lock policies are used for regulatory and compliance requirements:
- The policy is immutable, it can never be changed
- Example: forbid to delete an archive if is less than 1 year old
- Example: WORM policy (write once read many times)
- Physical data transport solution for moving TBs or PBs of data in and out of AWS
- Alternative to move data over the network
- It is secure, tamper resistant, uses KMS 256 bit encryption
- Tracking is done using SNS and text messages. It has an E-ink shipping label
- Pay per data transfer job
- Use cases: large data cloud migration, DC decommission, disaster recovery
- If it takes more than a week to transfer over the network, probably it is a good idea to use Snowball
- Request a Snowball form AWS console delivery
- Install the device on the server
- Copy the files over the device using the Snowball client
- Ship back the device
- Data will be loaded into an S3 bucket
- Snowball will be wiped completely
- Snowball Edge adds computational capability to the device
- Provides up to 100TB of storage with either:
- 24 vCPU - storage optimized (100TB storage)
- 52 vCPU and optional GPU - compute optimized (42TB storage HDD, 7.68TB SSD)
- Supports a custom EC2 AMI for data processing on the go
- Supports custom Lambda Functions
- Very useful for data preprocessing
- Use cases: data migration, image collation, IoT capture, machine learning
- It is a truck
- It can transfer exabytes of data
- Each Snowmobile has 100PB of capacity
- Better option than Snowball in case we want transfer more than 10PB of data
- Snowball cannot import data into Glacier directly
- We have to use S3 first and create an S3 lifecycle policy to transfer data to Glacier
- It is a bridge between on-premise data and cloud data in S3
- Provides 3 types of gateways:
- File Gateway
- Volume Gateway
- Tape Gateway
- File Gateway:
- Configured S3 buckets are accessible using NFS and SMB protocols
- Supports S3 standard, IA, One Zone IA
- Buckets will be accessed by the File Gateway, the File Gateway will have its own IAM role
- Most recently used data will be cached in the file gateway
- File gateway can be mounted in many servers
- RefreshCache API:
- The Storage Gateway updates the File Share Cache automatically when we write files to the File Gateway
- When we upload files directly to S3 buckets, users connected to the File Gateway may not see the files on the File Share
- We need to invoke the RefreshCache API
- Automating Cache Refresh:
- A File Gateway features that enables us to automatically refresh File Gateway cache to stay up to date with the changes in the S3 bucket
- Ensures that users are not accessing state data on their file shares
- No need to manually and periodically invoke the RefreshCache API
- Volume Gateway:
- Provides block storage using iSCSI protocol backed by S3
- Backed by EBS snapshots which can help restore on-premise volumes
- The gateway supports the following volume configurations:
- Cached volumes:
- We store our data in S3 and retain a copy of frequently accessed data subsets locally
- Stored volumes:
- If we need low-latency access to your entire dataset, first we configure our on-premises gateway to store all our data locally
- Then we asynchronously back up point-in-time snapshots of this data to Amazon S3 - This configuration provides durable and inexpensive offsite backups that we can recover to our local data center or Amazon EC2
- Cached volumes:
- Tape Gateway:
- Provides backup processes similar to the ones using physical tapes
- It is a Virtual Tape Library (VTL) backed by S3 and Glacier
- Back up data using existing tape-based processes (and iSCSI interface)
- Works with leading backup software vendors
- Serverless service used to perform analytics directly against S3 files
- Uses SQL language to query the files
- Provides JDBC/ODBC driver
- We are charged per query and amount of data scanned
- Supports CSV, JSON, ORC, Avro and Parquet. In the backend is using Presto'
- Use cases: business intelligence, analytics, reporting, analyze and query VPC Flow Logs, ELB Logs and CloudTrail logs, etc.