AWS Storage

AWS offers a comprehensive suite of storage solutions tailored to meet diverse needs in the cloud. This article explores various types of storage available in AWS, including file storage, block storage, and object storage. We will delve into services such as Amazon EFS, Amazon FSx, Amazon EBS, and Amazon S3, highlighting their features and use cases to help you better understand them.

there are three categories of AWS storage:
- file storage (data is stored as files in a hierarchy)
- block storage (data is stored in fixed-size blocks)
- object storage (data is stored as objects in buckets)

File storage

files stored in tree-like hierarchy of folders and subfolders
files have metadata (e.g. file name, size, date of creation)
ideal when you need centralized access, shared and managed by multiple host computers
typically mounted onto multiple hosts
requires file locking and integration with existing file system communication protocols.

Use cases for file storage:

web serving (cloud file storage solutions)
analytics (for workloads that interact with data through a file interface and rely on features such as file lock or writing to portions of a file)
media and entertainment (for hybrid cloud deployment that need standardized access using file system protocols (NFS or SMB) or concurrent protocol access)
home directories (extending access to home directories for many users).

Amazon Elastic File System (Amazon EFS)

fully managed elastic file system
set-and-forget file system that automatically grows and shrinks when you add/remove files
no need for provisioning or managing storage
petabytes of capacity
can be used with AWS compute services and on-premises resources
you can connect thousands of compute instances to an Amazon EFS file system at the same time with consistent performance to each compute instance
you pay only for the storage used, you can choose from a range of storage classes:
- standard storage classes (EFS Standard and EFS Standard-Infrequent Access (Standard-IA) – offer Multi-AZ resilience, the highest levels of durability and availability)
- one zones storage classes (EFS One Zone and EFS One Zone-Infrequent Access (EFS One Zone-IA) – provide costs savings by saving your data in a single availability zone)

Amazon FSx

fully managed shared file system to launch, run and scale high-performance file systems in the cloud
multiple instances can be connected at once
you can choose between four systems:
- Amazon FSx for NetApp ONTAP:
  - fully managed service
  - can serve as a drop-in replacement for existing ONTAP deployments, giving customers the ability run ONTAP file systems in the cloud
  - broadly accessible from Linux, Windows, and macOS compute instances running in AWS or on premises
- Amazon FSx for OpenZFS:
  - fully managed file storage service
  - allows you to move data from on-premises ZFS or other Linux-based file servers to AWS without changing your application
  - good choice for latency-sensitive and small-file workloads with popular NAS data management capabilities (snapshots, and cloning)
- Amazon FSx for Windows File Server:
  - fully managed Microsoft Windows file servers, backed by a fully native Windows file system
  - it is accessible over the Service Message Block (SMB) protocol
- Amazon FSx for Lustre:
  - open-source Lustre file system is designed for applications that require fast storage that to keep up with your compute
  - FSx for Lustre file systems can be linked to data repositories on Amazon S3 or to on-premises data stores
  - FSx for Lustre delivers the highest levels of throughput (up to 1+ TB/s) and IOPS (millions)

Block storage

splits files into fixed-size chunks of data called blocks that have their own addresses (thanks to which blocks can be retrieved efficiently)
addresses are used to organize the blocks in the correct order to form a complete file presented to requestor
no additional metadata is associated with blocks
optimized for low-latency operations
to change one character in the file, you just change block with this character (thanks to that this solution is faster and use less bandwidth, when we want to change one character in 1 GB file)

Use cases for block storage:

transactional workload (allows setting up a robust, scalable, and highly efficient transactional database. Each block is a self-contained unit, so a database performs optimally, even when the stored data grows)
containers (to store containerized applications in the cloud, which allows to migrate the containers seamlessly between servers, locations, and operating environments)
virtual machines (block storage supports popular virtual machine (VM) hypervisors. Users can install the operating system, file system, and other computing resources on a block storage volume (by turning it into a VM file system). Thanks to that you can increase or decrease the virtual drive size and transfer the virtualized storage from one host to another).

Amazon EC2 instance store

EC2 instance store provides temporary block-level storage for an instance
it is located on disks that are physically attached to the host computer (lifecycle of the data is tied with lifecycle of EC2 instance)
if you delete the EC2 instance, the instance store is also deleted (ephemeral storage)
used for applications that replicate data to other EC2 instances like Hadoop cluster. Replication provides resilience and locally attached volume performance, thanks to which you can achieve data distribution at high performance
used for temporary storage of information that changes frequently (buffers, caches, scratch data, other temporary content)

Amazon EBS (Elastic Block Store):

block-level storage that you can attach to an Amazon EC2 instance
EBS volumes are:
- Detachable – you can detach an EBS volume from one EC2 instance and attach it to another EC2 instance in the same AZ
- Distinct – if EC2 goes down you still have your data on EBS volume
- Size-limited – has a max limitation of how much content you can store on it (64 TiB)
- 1-to-1 connection – most EBS volumes can only be connected with one computer at a time
Amazon EBS multi-attach feature permits Provisioned IOPS SSD (io1 or io2) volumes to be attached to multiple EC2 instances at one time (it is not available for all instance types, and all instances must be in the same AZ)
can be encrypted
supports sending snapshots to S3

Scaling Amazon EBS volumes:

You can scale EBS volumes in two ways:
- increase the volume size (up to max limit of 64 TiB)
- attach multiple volumes to single EC2 (Amazon EC2 has a one-to-many relationship with EBS volumes)

Amazon EBS use cases:

general app data storage
when you must retrieve data quickly and have data persist long term
operating system (root device for an instance launched from an AMI is typically an EBS volume, commonly referred to as EBS-backed AMIs)
stand-alone databases (storage layer for databases running on Amazon EC2)
enterprise applications (high availability and high durability block storage to run business-critical applications)
big data analytics engine (you can resize clusters for big data analytics)

EBS volume types:

EBS volumes are organized into two main categories:

solid-state drives (SSDs) (SSD-backed):
for cases where the focus is on IOS (Input/Output operation per second)
transactional workloads with frequent read/write operations with small I/O size
gp2,3 – general purpose ssd volumes, default, dev environments, low-latency applications
io1,2 – provisioned IOPS SSD volumes for latency sensitive workload, support EBS multi-attach
hard-disk drives (HDDs):
- in case the emphasis is on throughput
- streaming workloads that need high throughput performance
- Throughput optimized HDD (st1) – frequent access, large throughput, large datasets
- Cold HDD (sc1) – lowest cost per GB among all EBS, lower frequency, large, cold dataset

Amazon EBS benefits

high availability: created EBS volume is automatically replicated in its Availability Zone
data persistence: in the event of an EC2 failure, data is not lost
data encryption: when activated by the user, all EBS volumes support encryption
flexibility: on-the-fly change, modify volume type, volume size, and input/output operations per second (IOPS) capacity without stopping your instance
backups: provides the ability to create backups of any EBS volume.

Amazon EBS snapshots:

EBS volumes consist of the data from your EC2 instance, and you should make backups of these volumes (those are called snapshots)
EBS snapshots are incremental backups (saves only blocks that have changed after the last snapshot)
when you take a snapshot of your EBS volumes, the backups are stored redundantly in multiple Availability Zones using Amazon S3
EBS snapshots can be used to create new volumes (exact copies), whether they’re in the same Availability Zone or a different one

Object storage

in object storage, files are stored as objects
objects are treated as a single, distinct unit of data when stored
objects are stored in a bucket using a flat structure, meaning there are no folders, directories or complex hierarchies
each object contains a unique identifier
to change one character in an object, the entire object must be updated (significantly worse compared to block storage)
you can store almost any type of data, and there is no limit to the number of objects stored (readily scalable)
generally useful when storing large or unstructured data sets.

Use cases for object storage:

data archiving (long-term data retention, cost-effectively archive large amounts of content, retain mandated regulatory data for extended periods of time)
backup and recovery (object storage can use replication)
rich media (reduce the cost of storing rich media files such as videos, digital images, and music)

Amazon S3 Amazon Simple Storage Service

standalone storage solution that isn’t tied to compute
object storage service (stores data as objects in a flat structure)
object = file + metadata
you can store unlimited objects, and each individual object can be up to 5 TB
objects are stored in containers called buckets
you can’t upload an object to Amazon S3 without creating a bucket first
the combination of a bucket name, key, and version ID uniquely identifies the object
to create a bucket you need to specify, at the very minimum: the bucket name and the AWS Region
buckets are created on region level
Amazon S3 supports global buckets (therefore, each bucket name must be unique across all AWS accounts in all AWS Regions within a partition)
- partition is a grouping of Regions, of which AWS currently has three: Standard Regions, China Regions, and AWS GovCloud (US)

Bucket names rules:

you should avoid using “AWS” or “Amazon” in your bucket name
names must be between 3–63 characters long
names can consist only of lowercase letters, numbers, dots (.), and hyphens (-)
names must begin and end with a letter or number
names cannot be formatted as an IP address
bucket name need to be globally unique – name cannot be used by another AWS account in the same partition until the bucket is deleted

Amazon S3 objects

each object in S3 have 5 characteristics:
- object key – name that we assigned to object. We use it to query object
- version ID – together with key it uniquely identify object
- value – uploaded value, immutable sequence of bits. If we want to modify it we need to upload new version
- metadata – metadata about the object in name – value format. We have system-metadata and user-defined metadata
- subresources – additional information object-specific

Object key names:

object key (key name) uniquely identifies the object in an Amazon S3 bucket
you specified it while creating an object
there is no hierarchy of subbuckets or subfolders (Amazon S3 console does support the concept of folders. You can imply logical hierarchy by using key name prefixes and delimiters e.g. 2022-03-01/AmazonS3.html 2022-03-01/Cats.jpg. The console uses the key name prefix, 2022-03-01, and delimiter (/) to present a folder structure)

Amazon S3 use cases:

backup and storage
media hosting – each object have unique http url and can be served directly from S3. S3 can be source for CDN (e.g., CloudFront)
software delivery (S3 to host your software applications that customers can download)
data lakes (virtually unlimited scalability)
static websites (you can configure your S3 bucket to host a static website of HTML, CSS, and client-side scripts)
static content

Security in Amazon S3:

everything in Amazon S3 is private by default (can only be viewed by the one who created this resource). Buckets are private and object are protected
you can choose to make your buckets and objects public (then everyone on the internet can see it)
Amazon S3 provides several security management features for more granular access:
- Block Public Access feature (enabled by default)
- IAM policies,
- S3 bucket policies,
- ACL (legacy)
- S3 Access Point
- Presigned URLs (time limited access via url link)
- AWS Trusted Advisor
- encryption to develop and implement your own security policies

Encryption of data in S3

Options of data encryption in AWS S3:

server-side encryption (default) – Amazon S3 automatically encrypts all objects on an upload and applies server-side encryption with S3-managed keys as the base level of encryption for every bucket in Amazon S3 at no additional cost
client side encryption – Amazon S3 reinforces encryption in transit (as it travels to and from Amazon S3) and at rest

Amazon S3 and IAM policies:

when IAM policies are attached to your resources (buckets and objects) or IAM users/groups/roles they define, which actions they can perform
access policies that you attach to your resources are referred to as resource-based policies and access policies attached to users in your account are called user policies
you should use IAM policies for private buckets in the following scenarios:
- you have many buckets with different permission requirements (instead of S3 policy foe each bucket you define IAM policies)
- you want all policies to be in a centralized location (in IAM policies you can manage all policies)

Amazon S3 bucket policies:

defined in a JSON format
can only be attached to S3 buckets
the policy placed on the bucket applies to every object in that bucket
you should use S3 bucket policies in the following scenarios:
- you need a simple way to do cross-account access to Amazon S3, without using IAM roles
- your IAM policies exceed defined size limit (S3 bucket policies have a larger size limit)

Amazon S3 storage classes:

When you upload an object to Amazon S3 and you don’t specify the storage class, you upload it to the default storage class (standard storage)
most storage classes store data in a minimum of three Availability Zones
Available storage classes:
- S3 Standard: frequently accessed data
  - general-purpose,
  - cloud applications,
  - mobile and gaming applications,
  - big data analytics
- S3 Intelligent-Tiering: unknown or changing access patters, stores objects in one of three tiers, based on frequency of access, in the most cost-effective way:
  - a frequent access tier,
  - an infrequent access tier,
  - an archive instance access tier
  - there are no additional costs of moving data between tiers using S3 intelligent tiering
  - S3 lifecycle policies decides how to move data between tiers depending on their age
- S3 Standard-Infrequent Access (S3 Standard-IA): long-lived, infrequently accessed data that requires rapid access when needed.
  - higher costs of accessing data than S3 standard
  - ideal if you want to store long-term backups or disaster recovery files
- S3 One Zone-Infrequent Access (S3 One Zone-IA): long-lived, infrequently accessed, non-critical data
  - stores data in a single Availability Zone, which makes it less expensive
  - lower-cost option for infrequently accessed data, that do not require the availability and resilience of S3 Standard
  - for storing secondary backup copies of on-premises data or easily recreatable data
- S3 Glacier Instant Retrieval: for archiving data that is rarely accessed and requires millisecond retrieval.
  - up to 68 % cost saving compared to the S3 Standard-IA storage class, with the same latency and throughput performance
- S3 Glacier Flexible Retrieval: low-cost storage for archived data that is accessed 1–2 times per year.
  - data can be accessed in 1–5 minutes using an expedited retrieval
  - you can also request free bulk retrievals in up to 5–12 hours
  - ideal solution for backup, disaster recovery, offsite data storage needs, and for when some data occasionally must be retrieved in minutes
- S3 Glacier Deep Archive: lowest-cost Amazon S3 storage class.
  - offer 3 modes of retrieval:
    - expedited retrievals are typically made available within 1–5 minutes
    - standard retrievals typically complete within 3–5 hours
    - bulk retrievals typically complete within 5–12 hours
  - data that might be accessed 1–2 times per year
  - default retrieval time of 12 hours
  - for data that is retained 7–10 years or longer (especially to meet regulatory compliance requirements)
- S3 on Outposts: object storage to on-premises AWS Outposts environment
  - for workloads that require satisfying local data residency requirements or need to keep data close to on premises applications for performance reasons

Amazon S3 versioning:

versioning keeps multiple versions of a single object in the same bucket
this can be an issue for several reasons, including the following:
- common names can be overridden by mistake
- you may want to have different version of the same file
if you enable versioning for a bucket, Amazon S3 automatically generates a unique version ID for the object
deleting an object from versioning-enabled bucket does not remove the object permanently, but puts marker on the object. If you want to restore the object, you can remove the marker, and the object is reinstated.

Versioning states:

buckets can be in one of three states (applies to all objects in the bucket):
- unversioned (default) – no new and existing objects in the bucket have a version
- versioning-enabled – versioning is enabled for all objects in the bucket. After you version-enable a bucket, it can never return to an unversioned state. However, you can suspend versioning on that bucket
- versioning-suspended – suspended for new objects. All new objects in the bucket will not have a version. However, all existing objects keep their object versions

Managing your storage lifecycle:

Amazon S3 lifecycle – lifecycle configuration for an object or group of objects
two types of actions:
- transition actions – define when objects should transition to another storage class
- expiration actions – define when objects expire and should be permanently deleted
e.g. you might transition objects to S3 Standard-IA storage class 30 days after you create them

S3 consistency model

S3 is strongly consistent for all new and existing objects in all regions (we have access to object right after it is saved). It is available automatically without loss of performance and availability.
there is read-after-write consistency for GET, LIST, PUT
bucket configuration is eventually consistent (when we delete bucket and then list all available buckets, the removed bucket may be still visible)
if you want to know more about consistency models checkout our article: System Design Concepts: Consistency Models

S3 costs

we pay for:
- GB of stored data (different fees for different regions and storage classes)
- transfer OUT: to different regions or to internet
- PUT, COPY, POST, LIST, GET, SELECT, lifecycle transition, data retrieval requests
we don’t pay for:
- transfer IN: from the internet to S3
- transfer between S3 buckets
- transfer from S3 do other services n the same region
- transfer out to Amazon CloudFront
- DELETE and CANCEL requests

Ways to upload files to S3

through AWS management console
through AWS CLI
through AWS Tools and SDKs

Multipart upload

when uploading, the file may be divided into parts, which are put together into one whole after everything has been uploaded
to use multipart upload file must be larger than 5MB (usually used for files larger than 100MB)
advantages:
- it helps in case of network problems, allows you to stop and resume import and improves throughput because we can upload parts in parallel
- we can upload a file before we know its final size, or upload a file while creating it

Amazon S3 Transfer Acceleration

uses CloudFront and edge locations to speed up data uploads to S3 using an optimized network
from 50 to 500% improvement for cross-country transfer (Amazon S3 Transfer Acceleration Speed Comparison Tool allows you to check this difference)
it is worth using it when:
- we have clients all over the world who upload data to the central bucket
- if you are not fully utilizing the available bandwidth when sending data to S3 over the internet

Methods for uploading large amounts of data (one-time):

AWS Snowball (for petabytes scale) – we create a job and choose a snowball as upload method. AWS snowball is a storage optimized device that Amazon can provide you with to upload your data and then send it to S3
AWS Snowmobile (for exabyte scale) – transfer up to 100PB, it is a truck full of storage optimized devices

Relating back to traditional storage systems:

block storage in the cloud – analogous to direct-attached storage (DAS), or a storage area network (SAN)
file storage systems are often supported with a network-attached storage (NAS) server.

Choosing the Right Storage Service:

Amazon EC2 instance store:
- temporary storage of information that is constantly changing, such as buffers, caches and scratch data
Amazon EBS:
- block storage
- for data that changes frequently and must persist through instance stops, terminations or hardware failures
  - SSD-backed volumes depends on the IOPs and is ideal for transactional workloads, such as databases and boot volumes
  - HDD-backed volumes depends on megabytes per second (MBps) and is ideal for throughput-intensive workloads, such as big data, data warehouses, log processing, and sequential data I/O
- pay for what you provision (you have to provision storage in advance)
- replicated across multiple servers in a single Availability Zone
- can only be attached to a single EC2 instance at a time
Amazon S3:
- storing static web content and media, backups and archiving, and data for analytics
- object storage
- pay for what you use (you don’t have to provision storage in advance)
- replicates your objects across multiple Availability Zones in a Region
Amazon EFS:
- only cloud-native shared file system with fully automatic lifecycle management
- can automatically scale from gigabytes to petabytes of data without needing to provision storage
- thousands of compute instances can access an Amazon EFS file system at the same time
  - Amazon EFS Standard storage classes for workloads that require durability and availability
  - EFS One Zone storage classes for workloads such as development, build, and staging environments
Amazon FSx:
- native compatibility with third-party file systems (NetApp ONTAP, OpenZFS, Windows File Server, and Lustre)
- machine learning, analytics, high performance computing (HPC) applications, and media and entertainment

AWS provides a versatile suite of storage solutions tailored to various needs, including file storage, block storage, and object storage. Amazon EFS, Amazon FSx, Amazon EBS, and Amazon S3 each offer unique features and use cases, enabling users to choose the right service for their specific requirements. Whether it’s for centralized access, high-performance workloads, scalable block storage, or long-term data archiving, AWS storage services ensure efficient and reliable management of your data in the cloud.