Storage Systems - Currently Upating...
Course Evaluation (Final grade: N/A)
Reading summary
Remzi - SSD
https://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf
Hence, flash chips are organized into banks or
planes which consist of a large number of cells.
A bank is accessed in two different sized units: blocks (sometimes
called erase blocks), which are typically of size 128 KB or 256 KB, and
pages, which are a few KB in size (e.g., 4KB).
Each page has a state associated with it: Invalid, Erased, Valid
Writing a page is trickier; the entire block must first be erased
Perf read > program -> erase,
Reliability program:
1. wear out!
increasingly difficult to differentiate between a 0 and a 1 -> block becomes unusable
2. disturbance
When accessing a particular page within a flash, it is possible that some bits get flipped in neighboring pages
Internally, an SSD consists of some number of flash chips (for persistent storage).
An SSD also contains some amount of volatile (i.e., nonpersistent) memory (e.g., SRAM); such memory is useful for caching and buffering of data as well as for mapping tables.
Finally, an SSD contains control logic to orchestrate device operation (satisfy client
reads and writes, turning them into internal flash operations as need be. The flash translation layer, or FTL, goal: performance and high reliability)
Techniques:
utilize multiple flash chips in parallel,
reduce write amplification(total write traffic (in bytes) issued to the flash chips by the FTL divided by the total write traffic (in bytes)),
wear leveling(o spread writes across the blocks of the flash as evenly as possible)
FTL Organization:
Why direct map bad (pref + reliability): severe write amplification (proportional to the number of pages in a block); If file system metadata or user file data is repeatedly overwritten, the same block is erased and programmed, over and over.
Log-Structured FTL
Upon a write to logical block N, the device appends the write to the next free spot in the currently-beingwritten-to block; we call this style of writing logging.
To allow for subsequent reads of block N, the device keeps a mapping table (in its memory, and persistent, in some form, on the device)
-: old versions of data around the drive and taking up space. The device has to periodically perform garbage collection (GC). Example: same logical blocks written again, written to a new block. Old block has garbage data. find a block that contains one or more garbage pages, read in the live (non-garbage) pages from that block, write out those live pages to the log, and (finally) reclaim the entire block for use in writing.
high cost of in-memory mapping tables. page-level FTL scheme is impractical due to
large size. Issue for block-based mapping: small write (read a large amount of live
data from the old block and copy it into a new one).
hybrid mapping:
per-page mappings for these log blocks.The FTL thus logically has two types of mapping table in its memory: a small set of per-page mappings in what we’ll call the log table, and a larger set of per-block mappings in the data table.
+ Caching:
+ Wear leveling: long-lived data that does not get over-written; in this case, garbage collection will never reclaim the block, and thus it does not receive its fair share of the write load.
Design Tradeoffs for SSD Performance (Agrawal2008)
SSD performance and lifetime is highly workloadsensitive, and that complex systems problems that normally appear higher in the storage stack, or even in distributed systems, are relevant to device firmware
Interleaving:
Interleaving can provide considerable speedups when the operation latency is greater than the serial access latency.
For example, a costly erase command can in some cases proceed in parallel with other commands.
constraints: operations on the same flash plane cannot be interleaved.
SSD Basics: host interface logic to support some form of physical host
interface connection. (flash translation layer)
SSD Controller contains processor, buffer manager (holds pending and satisfied requests along the primary data path), A multiplexer(Flash Demux/Mux)emits com
mands and handles transport of data along the serial connections to the flash packages.
They are typically ASIC/FPGA.
each write of a single logical-disk block address (LBA) corresponds to a write
of a different flash page.
Variants(allocation pools): Static map, Dynamicmap, Logical page size, Pagespan
Bounded by constraints: LB, Parallel access, Block erasure.
if a large portion of the LBA space is statically mapped, then there is little
scope for load-balancing.
If a contiguous range of LBAs is mapped to the same physical die, performance for se
quential access in large chunks will suffer.
With a small logical page size, more work will be required to eliminate valid pages from erasure candidates.
If the logical page size (with unit span) is equal to the block size, then
erasure is simplified because the write unit and erase unit are the same,
however all writes smaller than the logical page size result in a
read-modify-write operation involving the portions of the logical page
not being modified.
Flash blocks are the basic units for allocation, and a garbage collector is used to recycle used
blocks by erasing and preparing them for reuse. Cleaning involves moving valid data from blocks
before they are erased. The goal is to maximize cleaning efficiency, which is the ratio of outdated
pages to total pages during block cleaning.
Wear-leveling is crucial to ensure that all blocks age evenly, preventing premature failure of any
single block. SSDs are always full relative to their advertised capacity, requiring spare blocks to
facilitate ongoing cleaning and block replacement. Unlike log-cleaning in Log-Structured File
Systems, SSDs face constant cleaning pressure due to the absence of free sectors, making efficient
block management essential.
Parallel Requests: Each flash element operates independently, handling separate I/O requests. This requires complex logic to manage queues for each element, which might be challenging in systems with limited processing power.
Ganging: Multiple flash packages operate in synchrony to optimize multi-page requests, reducing complexity. However, idle elements can occur if the request doesn't span all packages.
The choice of parallelism technique depends on the workload:
Sequential Workloads: Benefit from ganging.
Parallel Workloads: Require deep parallel request queuing.
Workloads with Poor Cleaning Efficiency: Need a cleaning strategy compatible with the foreground load.
Technique Positives Negatives
Large allocation pool
Load balancing, but Few intra-chipops
Large page size
Small pagetable, but Read-modify-writes
Overprovisioning Less cleaning, but Reduced capacity
Ganging Sparser wiring, but Reduced parallelism
Striping Concurrency, but Loss of locality
Even wear: algorithm tracks the erase count of each block and proposes managing blocks based on
their remaining lifetime relative to the average. Blocks that fall below certain thresholds are
either recycled at a lower rate or have cold data migrated into them to balance wear.Rate-limiting
Worn Blocks, Cold data is migrated into worn-out blocks before they hit critical wear thresholds.this
approach incurs a small performance overhead due to the additional operations involved in data
migration.
Scheduling Algorithms for Modern Disk Drives
examines how dynamically ordering pending disk requests can improve disk subsystem performance,
particularly in the presence of complex logical-to-physical mappings and large prefetching caches.
Incorporating complex mapping information into the scheduler provides only a marginal improvement
(less than 2%) in response times for seek-reducing algorithms.
Algorithms that utilize prefetching disk caches effectively can significantly improve performance for
workloads with read sequentiality, with the cyclical scan algorithm (C-LOOK) showing the highest
performance among such algorithms.
The highest performance is achieved by algorithms that reduce overall positioning delays while
recognizing and exploiting a prefetching cache.
Disk Scheduler Role: The disk scheduler is responsible for ordering pending requests to minimize
mechanical positioning delays and optimize overall performance, particularly in environments with
intense bursts of disk activity.
Structure and Functionality:
A disk drive consists of rotating platters with magnetic media, where data is stored in circular
tracks. The basic storage unit is a sector, typically holding 512 bytes of data.
Tracks on different surfaces but at the same distance from the disk’s center form a cylinder. Data
location on the disk is defined by the cylinder, surface, and sector.
The disk arm, which holds the read/write heads, moves to position the heads over the correct cylinder
to access data.
Accessing data involves mechanical delays such as the disk arm moving to the target cylinder (seek
time) and waiting for the target sector to reach the read/write head (rotational latency).
Additional delays can occur if the requested data spans multiple tracks or cylinders.
Modern disk drives have evolved to be more self-contained, with features that can affect scheduling
accuracy. These include the host interface, data layout, and on-board cache.
The actual media access details are typically hidden from the host, reducing the host's management
overhead but also limiting the external scheduler's knowledge of data layout, cache status, and
overhead delays.
Many systems use LBN-based approximations to reduce seek times, relying on the sequential mapping of
logical to physical blocks.
Disk drives now include large on-board caches that can automatically prefetch data, significantly
speeding up sequential read operations. Prefetched data in the cache can be accessed much faster than
data on the media, influencing scheduling decisions. Aggressive prefetching complicates scheduling
because the read/write head's position may not be predictable based on the most recent request
location and length.
3. Disk Scheduling Algorithms
FCFS (First Come First Served):
Simple but often leads to suboptimal performance.
Numerous alternative algorithms have been proposed to improve performance by considering individual request details and the disk subsystem's current state.
3.1 Seek Delay Reduction
SSTF (Shortest Seek Time First):
Overview: Prioritizes the request that requires the shortest seek distance, approximating the
shortest seek time.
Advantages: Reduces average response time across various workloads.
Drawbacks: Can lead to starvation, particularly under heavy workloads, as the disk arm may hover over
a subset of cylinders, neglecting requests from other regions.
SCAN (Elevator Algorithm):
Overview: The disk arm moves back and forth across the cylinder range, servicing all requests in its
path, changing direction only at the disk's innermost and outermost cylinders.
Advantages: Provides lower response time variance compared to SSTF with only a slight increase in
average response time. Central cylinders receive better service due to more frequent passes.
Drawbacks: Can still marginally favor requests in the middle of the disk over those at the edges.
Variations of SCAN:
C-SCAN (Cyclical SCAN):
Moves the arm in one direction only; after reaching the last cylinder, it performs a full-stroke seek
back to the first cylinder without servicing requests on the return.
Treats all cylinders equally, reducing bias towards central cylinders.
LOOK:
Changes direction if no pending requests exist in the current direction, avoiding unnecessary scans.
C-LOOK:
A combination of C-SCAN and LOOK, combining their advantages.
VSCAN(R):
Overview: Introduces a continuum of algorithms between SSTF and LOOK, controlled by the R parameter.
Advantages: Balances average response time and starvation resistance.
Example: VSCAN(0.2) is suggested as an optimal balance.
3.2 Positioning Delay Reduction
SPTF (Shortest Positioning Time First):
Overview: Considers both seek delay and rotational latency by choosing the request with the minimum
overall positioning delay.
Names: Also known as STF (Shortest Time First) and SATF (Shortest Access Time First) in other studies.
Drawbacks: Like SSTF, SPTF can suffer from starvation issues.
Mitigating Starvation in SPTF:
Priority Adjustments: Priority can be given to requests that have waited too long in the queue,
either by gradually increasing priority with age or by setting a time limit after which requests are
prioritized.
Summary
The section discusses various disk scheduling algorithms aimed at reducing seek delay and positioning
delay. While SSTF and its variations focus on minimizing seek times, algorithms like SPTF also
consider rotational latency for a more holistic reduction in access times. However, many of these
algorithms face challenges related to starvation, especially under heavy workloads. Variants like
SCAN, C-SCAN, and VSCAN(R) attempt to balance performance with starvation resistance, providing
different trade-offs based on workload characteristics.
Primary Metrics:
The average response time across all disks is the primary metric used to compare scheduling algorithms.
The squared coefficient of variation is also used to measure the variance in response times, with a lower coefficient indicating better starvation resistance.
6.1 Scheduling by Logical Block Number (LBN)
Performance Insights:
C-LOOK consistently outperforms other algorithms (LOOK, VSCAN(0.2), SSTF) in most traced workloads,
particularly when a prefetching cache is enabled. This is because C-LOOK maintains logically
ascending order, which aligns well with the sequential data prefetching behavior of modern disks.
Cello Trace Exception: In the Cello environment, which is dominated by large bursts of write requests
and has a low fraction of sequential reads, C-LOOK does not perform as well as other seek-reducing
algorithms. This is due to its reduced benefit from the prefetching cache.
Cache's Role: The performance advantage of C-LOOK diminishes significantly when the disk cache is
used only as a speed-matching buffer rather than for prefetching, indicating the importance of
prefetching in these algorithms.
Key Observations:
The advantage of algorithms like LOOK and VSCAN(0.2) over SSTF is due to their ability to maintain
sequential requests in logically ascending order during part of the scan cycle.
When requests are scheduled in a descending logical order, the performance gains from the prefetching
cache are negated.
6.2 Scheduling with Known Mapping
Impact of Full Mapping Knowledge:
Using full LBN-to-PBN mapping information results in performance improvements for LOOK, VSCAN(0.2),
and SSTF under most workloads, except for Cello. The improvement is not due to better seek time
prediction but rather the use of C-LOOK within cylinders to exploit the prefetching cache more
effectively.
The complexity of maintaining a full LBN-to-PBN mapping is generally not justified for seek-reducing
algorithms, as the performance gains are minimal (less than 1%).
Starvation Resistance:
Algorithms that satisfy all requests within a cylinder before moving to another one are slightly less
resistant to starvation, but the overall starvation characteristics of the full-map algorithms align
closely with their LBN-based counterparts
6.3 Scheduling with Full Knowledge
SPTF-Based Algorithms:
SPTF and Cache Awareness: SPTF-based algorithms, which include rotational latency information, have
the potential to provide more accurate positioning delay predictions. However, they do not inherently
preserve sequential read order, reducing their effectiveness in utilizing the prefetching cache.
Performance Trade-offs: For workloads like Cello, SPTF-based algorithms outperform seek-reducing
algorithms, especially when cache knowledge is incorporated. For other workloads, SPTF and its
variants may saturate more quickly than C-LOOK but can provide better performance under certain
conditions.
Cache Utilization: Incorporating cache knowledge into SPTF-based algorithms (SPCTF, ASPCTF) widens
the performance advantage over C-LOOK in some cases.
Starvation and Performance:
The aging factor in algorithms like ASPCTF can improve both starvation resistance and performance,
especially for workloads with a significant fraction of sequential reads.
The traced workload analysis highlights the strengths and limitations of different disk scheduling
algorithms under real-world conditions.
C-LOOK, with its alignment to sequential read patterns and effective cache utilization, generally
performs well across most workloads.
SPTF-based algorithms, while offering more precise positioning delay predictions, struggle with cache
utilization unless specifically adapted to recognize and optimize for on-board disk caches.
The study emphasizes the importance of considering disk cache behavior in scheduling decisions and
suggests that the complexity of full LBN-to-PBN mappings may not be worth the marginal performance
gains for most algorithms.
Blog Chapters
** Chapter 1: Flash SSD Operation **
Performance
- single-page random read is fast
- write is usually much worse because of erasing the block
Why have to erase by block
- energy can impact a region. To make dense design, we use some pages (block of pages), then have gaps between these blocks. Pure hardware reason.
Write Amplification
- What happens when writing 2 pages of data? read, copy from flash block to DRAM, Erase original block, copy from DRAM to erase block
- battery/capasitor enough to finish few seconds of such operation
- total write traffic (in bytes) issued to the flash chips by the FTL divided by the total write traffic (in bytes)
- Reduce? freely remap between host & NAND addresses
-
- Write LBAs to some place other than where they were before
-
- Group bunch of different small writes into full blocks
-
- Leaves “holes” in other blocks (where old data was)
-
- Reducing cleaning costs
-
-
- More overprovisioning, more aggresive GC
-
-
-
- Have host provide “delete” notifications (aka “TRIM”). Host reduces cleaning costs and need for overprovisioning
-
Wear Leveling
- floating gate insulator degrades
- Wear leveling is remapping of addresses to better balance the number of erase/program cycles seen by each block
- Simple: if any flash block has erase remaining > threshold, clean. Else, clean but migrate cold data into A(could be one of the flash block that is not rewritten for a long time). If < Throttle Threshold, change algorithm to reduce prob of choosing A.
Increased Density through more bits per cell
- lower cost-per-bit but lower endurance and lower performance Back to Blog Chapters
** Chapter 2: HDD Operation **
Performance
- Response time = Queue time + Access time
- Access time = Command + Seek + Rotation + Transfer
- Ony one head at a time, but can read and write in parallel
- Ideally, we want to have Transfer time dominated
Seek time
- accelerate
- coast at max velocity
- decelerate
- settle onto correct track
Rotational Latency
- Rotations Per Minute
- average rotational latency is time for 1/2 revolution
Media transfer time
- all on one track: sectors desired * time for revolution / sectors per track
- spread across 2+ tracks, head switch time, one or more track switch times
Enjoy Reading This Article?
Here are some more articles you might like to read next: