Course Evaluation (Final grade: N/A)

A great introduction course to distributed systems with a focus on distributed file systems! I never took such course during my undergrad, and learned a lot about DFS.

Similar to other system course, the programming assignment is interesting (and the workload is ok!)

Blog Chapters

  1. Chapter 1: Remote Procedure Call (RPC)
  2. Chapter 2: Caching
  3. Chapter 3: Scalability
  4. Chapter 4: Failure Resiliency
  5. Chapter 5: Paxos
  6. Chapter 6: Datacenter storage
  7. Chapter 7: Cluster computation

** Chapter 1: Remote Procedure Call (RPC) **

  • Try to fake procedure call to local programming
  • Why? bring down programming complexity for distributed systems
  • client-server model, per interface
  • two aspects: control flow, invocation syntax
  • with network delays (theoretically best at speed of light)
limitations of RPC
  • No address space sharing between client and server, can’t sharing pointers(call by reference), can’t share global data..
  • Delayed binding in RPC
Failure independence
  • caller and callee live and die together in local setup
  • we can witness failure case but hard to do in local
  • failure handling consider visibility of failure
  • Security: different domains
Typical RPC
  • client: makerpc(request_packet, &reply_packet): blocks until reply or failure
  • server: getrequest(&request_packet) blocks until receives request, sendresponse(reply_packet)
Stub routines
  • generated by stub generator
  • sit between high level purpose and low level network packing/unpacking send/recv..
procedure:

--: network communication

        Client                                             Server
App -> Stub -> Transport --Network Com--- Transport -> Stub -> App

App calls Stub
Stub pack/unpack
Transport transmit/receive
Server App do actual work then return
Packing and unpacking is usually not elastic in dev env
Correctness and API design is more important

  • Marshalling/Unmarshalling, Serialization/De-serialization
RPC packet format
  • Network transport header(Ethernet, IP, Transport(TCP/UDP))
  • RPC header
    RPC Version ID
    Opcode (Stub)
    Flags
    parameters + Len
    
Stub with dynamically allocated buffer
  • variable to indicate how many bytes in coming
  • malloc for new memories
Failure independence of clients & servers adds complexity
Outsourcing (for example, relying on TCP). TCP guarantees reliable, in-order, unlimited delivery.
Pain:
- no preservation of write() boundaries
- data is re-framed in transit
- read may return fewer than number of bytes requested

  1. The buck has to stop somewhere. Do it yourself.
    • Retransmission
    • duplicate delivery/execution violates RPC semantics, sequence numbering to eliminate
    • based on UDP
Timeout values in distributed systems
  • statistics -> reasonable
  • no matter what, could be too soon
What does server do when receiving duplication happens?
  • indicates one of these happening:
    1. reply lost
    2. reply crossed retransmitted request
    3. compute time was excessive
    4. client too impatient.
Knowlege at server is always stale relative to client, and vice versa
  • processing time/network transmit time, indifferent to client..
  • best for server to do is retrans
  • should preserve reply, not re-compute, because computation can be substantial
  • *my question: what if replies too large? maybe best effort
Exactly-once semantics
  • How long to keep old replies and sequence numbers
  • Rigorous interpretation of “RPC” -> forever
  • Server crashes:
    • saved in non-volatile
    • server response has to be after non-volatile write
    • disk/flash latency per RPC
    • clean undo of partial computations before crash
  • exactly-once RPC:
    • success return -> call exected exactly once
    • call blocks indefinitely, no failure return
In practice for RPC package: At-most-once semantics
Avoid indefinite blocking
Declare timeout beyond certain delay: success or not? 
many timeout reasons...
Slow Servers & Long-Running Calls
Solution: 
probes to check server health during long calls
server responds busy while working
essentially a keepalive mechanism
Orphaned Computations
network failure, then server continues, unaware its work is useless
Orphan detection and extermination are difficult - but important
At-least-once semantics (strongest)
Even simpler to implement
Requires operation idempotency
Idempotency is a property of certain operations or API requests, which guarantees that performing the operation multiple times will yield the same result as if it was executed only once.
- for example, read request on locked object or read-only object, current_stock_price(MSFT)
Choice of semantics (less strong)
  • Achieving exactly-once semantics ``` not provided in any real RPC package requrie application-level dup elim built on top of at-most-once RPC
  • have to write on disk before replying ```
  • At-most-once (mostly provided by RPC packages) ``` avoids:
  • transactional storage
  • non-volatile storageo of replies and sequence #s
  • indefinite storage of replies

if crashed, just crash without executing anything exactly once: have to undo, and do again using instruction in non-vol


##### Safety and liveness properties

safety: correct functionalty (“At most one entity can execute in a critical section”)

  • bad things never happen

liveness: characterizes timely execution progress (e.g., “This code is deadlock-free”)

  • good things will eventually happen ```
  • “Exactly-once semantics” -> safety property
  • Existence of timeout in at-most-once RPC -> liveness property
Placement of Functionality - what are you promising vs deliver
  • Protocol layering
TCP for RPC
TCP timeout -> reconnect
- new connection is unaware of old
- server need to do dup elim
- orphans still possible
- exactly-once RPC no easier with TCP

TCO simplifies at-most-once RPC
TCP hurts since it has independent acks in each direction


User TCP rather UDP:
- not simplify exactly-once imple
- worse performance in best case
- simplify at-most once impl

End-to-end argument:
For a given functionality:
- correctness is expressed relative to two endpoints (safety)
- implementation requires support of those two end points
- support below end points cannot suffice (may improve performance(liveness))

Critical question: Where to place function in a distributed system?
What guarantee promising
end point 
package
simplified implementation
guarantee properties
Performance of distributed system: delay
  1. Processing delay
  2. Queueing delay <- dominate
  3. Tranmission <- usuallly small unless very large files
Queueing
Any shared resources
Arrival pattern: uniform, poisson, ...

Service time also varies

---- without considering real world complications:----
1. Util
2. Latency
3. Freedom (how constrained of arrival discipline..) 
We can optimize at most 2 out of 3




Latency is the killer

Chapter 2: Caching

Metrics
Miss Ratio = Misses/References
Hit Ratio = Hits/References = (1 − Miss Ratio)
Expected cost of a reference = (Miss Ratio * cost of miss) + (Hit Ratio * cost of hit)
Cache Advantage = (Cost of Miss / Cost of Hit)
Fopcus: distributed file systems
Fetch policy
Full Replication?
-Storage for entire subtree consumed on every replica
-Significant update traffic on hot spots
-Machines receive updates whether they care or not
Coarse-grain, non-selective management of data 

A Much Better Approach:
on-demand caching 
- requires operating system modifications
+ total application transparency
+ enable demand caching

Multi-OS On-Demand Caching 


  1. Update propagation policy
  2. Cache replacement policy ((prefetching))
Caching in the Real World
  1. Cost of remote data access often not uniform
  2. spatial locality
  3. Remote data more coarsely addressable than local Fetch more than you need on miss
Spatial & Temporal Locality
Update Propagation
One-copy semantics
• there are no externally observable functional differences
• relative to same system without caching (or replication)

This model aims to provide the illusion that although data may be replicated across multiple servers or nodes (to improve reliability, performance, and fault tolerance), users interact with the data as if there is only one copy.

Challenges:
Physical master copy may not exist
Network may break between some users and master copy
Intense read- and write-sharing across sites
(The benefits of caching (reducing access time, decreasing bandwidth usage) are undermined in this scenario. The caches might spend more time synchronizing than serving the actual read and write requests, making them "effectively useless")
Cache Consistency Strategies
Broadcast Invalidations
Notification to All Caching Sites:every caching site in the network is notified, regardless of whether they actually have the cached object.

Handling at Cache Sites:
Upon receiving the invalidation notice, if a cache has the invalidated object, it will mark it as invalid.

Strengths
- Strict One-Copy Semantics:
- Race Condition Prevention:
If the updating process is blocked until all caches have invalidated the item, it prevents race conditions, ensuring that no stale reads occur.
- Simplicity



Limitations
- Wasted Traffic:
- Blocking Updating Process:
- Scalability Issues:
As the number of nodes in the system increases, the overhead of sending invalidations to every node and the corresponding acknowledgments becomes impractical. 
Check on Use:
Reader checks master copy before each use
Has to be done at coarse granularity (e.g. entire file or large block)
Whole file granularity-> “session semantics”

Advantages
• strict consistency at coarse granularity
• easy to implement, no server state
• servers don’t need to know of caching sites

"session semantics at open-close granularity," where changes made to a file during a session (from open to close) are not visible to other clients until the session ends (the file is closed).
principled weakening of strict one-copy semantics


Strict One-Copy Semantics With Write-Sharing
any write operation performed by any client or process in a distributed system is immediately visible to all other clients or processes. 
Sharing Taxonomy
Multiple concurrent read-only sessions 
 “read sharing”

Multiple concurrent read-only sessions + one read-write session
 “read-write sharing”

Multiple concurrent read-write sessions “write-write sharing”
1. “Last Close After Write Wins”: AFS
2. “Raise Conflict Exception”: Coda File System
3. “Live Happily and Be Blissfully Unaware”:  NFS, Google Docs, DropBox, etc.
None of these approaches prevents concurrent updates on server


3: How to do?


“Last Close After Write Wins”
Mechanism: This approach resolves write-write conflicts by accepting the changes made by the process that closes the file last. All previous writes to the file during the conflict period are overwritten by the last writer's data.
Pros: Simplifies conflict resolution by enforcing a clear rule.
Cons: Can lead to data loss for earlier writes, as changes made by all but the last writer are discarded.
Use Case: AFS uses this method, prioritizing simplicity and predictability over the preservation of every change


“Raise Conflict Exception”
Mechanism: When a write-write conflict is detected, the system raises an exception, alerting the involved parties (applications or users) of the conflict.
Pros: Prevents data loss by not automatically overwriting any data. It requires intervention to resolve the conflict, which can ensure that important data isn't inadvertently lost or overwritten.
Cons: Requires additional mechanisms for conflict resolution and may interrupt the user workflow.
Use Case: The Coda File System employs this strategy to maintain high data integrity, especially in environments where data consistency is critical.


“Live Happily and Be Blissfully Unaware”
Mechanism: This approach allows concurrent writes to proceed without immediate conflict resolution. Systems employing this strategy might merge changes automatically, keep all versions of the file, or simply ignore the conflict entirely.
Pros: Enhances user experience by avoiding interruptions. Systems like Google Docs merge changes in real-time, allowing seamless collaboration.
Cons: Can lead to inconsistencies or unexpected results if automatic merging is not possible or if changes are incompatible.
Use Case: NFS (Network File System) traditionally does not handle write-write conflicts explicitly at the file system level. Google Docs and Dropbox provide user-friendly collaboration features, allowing multiple users to edit documents simultaneously, with changes reflected in real-time or through version history.

Check-on-open diadv
slows read access on loaded servers & high-latency networks
check is almost always success: frivolous traffic
load on network and server
Callback
targeted notification of caching sites
on update, all sites with cached data notified (“callback”)

Advantages:
        excellent scalability for Unix workloads (often involve a mix of read and write operations on files)
        zero network traffic for read of cached-valid objects 
        precursor to caching for disconnected operation
        biases read performance in favor of write-performance

Disadvantages:
        sizable state on server
        complexity of tracking cached state on clients
        NAT networks with masquerading firewalls
        
        Clients may not be able to distinguish between a network failure that prevents 
        callbacks from reaching them and a situation where no data changes have occurred


        Periodic “Keepalive” Probes: To mitigate some of the issues with lost callbacks and ambiguous silence, systems might implement periodic "keepalive" probes. These probes allow clients to verify their connection to the server and the validity of their cached data. However, this approach introduces additional network traffic and can only partially address the problem, as data could still become stale between probes.


Prevents concurrent updates on server
Method 4:
Caching site obtains finite-duration control from master copy
few seconds
multiple sites can obtain read lease; only one can get write lease

lease duration = 0: Check on Use
lease duration = ∞: (Targeted Notification)

Advantages
• generalizes the check on use and callback schemes
• lease duration can be tuned to adapt to mutation rate
Lease duration is a clean tuning knob for design flexibility
For data that changes frequently, shorter leases ensure that caches are updated or invalidated promptly, maintaining data consistency. For more stable data, longer leases can reduce the overhead of lease renewals and data checks, improving system efficiency. 
• conceptually simple yet flexible

Key Assumption
Clocks tick at the same rate everywhere
• clocks do NOT have to be synchronized
• absolute time does not matter
• only relative time (i.e., clock tick rate) matters
Time becomes a hidden communication channel



Disadvantages
• lease-holder has total autonomy during lease; revocation?
• writers delayed while read lease holders complete their leases
• more traffic than callback (but less than check on use)
keepalives for callback only one per server, not per lease


What To Do When There is Intense Write-Sharing?
5. Skip Scary Parts
• When write-sharing detected, turn off caching everywhere
All references go directly master copy
• Resume caching when write-sharing ends

Original Use
• Sprite (circa 1987) (in conjunction with check on use)

Advantages
• Precise single-copy semantics (even at byte-level consistency)
• Excellent fallback position
• Good adaptation of caching aggressiveness to workload
Disadvantages
• Server maintains state
• Server aware of every use of data (open)

Even weaker safety property?

6. Faith-Based Caching
Basic Idea
• blindly assume cached data is valid for a while
• periodically check (based on time since last check)
• no communication needed during trust period
Original use
• Sun NFSv3 file system
cached file blocks assumed current for X seconds
X = 3 for files, 30 for directories
• small variant is a TTL field for each object
used in web caching, gives content creator modicum of control

Imprecise and weak approximation to one-copy semantics
• not just in the presence of network failures (partitions)
• no clear basis for reasoning about state of system
Methods 1-5 used controlled and precise approximations
• e.g., session semantics at open-close granularity
• offered clear basis for reasoning

Advantages
• Simple implementation
• Server is stateless

Disadvantages
• User-visible inconsistencies sometimes seen (make)
• Blind faith sometimes misplaced!
• Not as efficient as callback-based schemes



7. Pass the Buck
Basic Idea
• Let the user trigger cache revalidation (hit “reload”)
• otherwise, all cached copies assumed valid forever
• Equivalent to infinite-TTL faith-based caching
• Arose in the context of the Web

Advantages
• trivial to implement, no server changes
• avoids frivolous cache maintenance traffic

Disadvantages
• places burden on user
user may be clueless about level of consistency needed
• assumes existence of user
• pain for write scripts/programs



What old data do you throw out to free up space?
Cache replacement policy

Ideal victim: large object that is not needed for a long time, and cheap to refetch

Non-Serviceable Miss: The pain of evicting an object that cannot be easily refetched due to failures or disconnected operation is also a vital factor. This aspect is particularly relevant in environments with unreliable connectivity or when dealing with unique or hard-to-replace data.

Uniform vs. Non-Uniform Fetch Cost: While the assumption of uniform fetch cost might hold near the hardware level (e.g., when all data resides on the same disk or in the same data center), it becomes less tenable at higher system layers. At these layers, factors such as variable object sizes, the physical characteristics of storage media (like rotational and seek delays in disks), and differing network qualities to different servers introduce non-uniform fetch costs.

Abstract Problem Formulation: 
The problem can be abstracted to managing a set of 
equal-sized data containers (frames) and 
a large set of equal-sized, equal-importance data objects (pages), 
with access patterns represented by a sequence of integers (reference string). 
The primary metric of interest in this model is the miss ratio.

Predict distance to next ref
optimal replacement algorithm
predict is hard
• LRU is often a good approximation to OPT (not always)
assumes recent past is a good predictor of the near future

LRU
Loop (good for LRU)
- while (some test)
Sequential scan (bad for LRU)
- memset (start address, 0, many bytes)


Stack Property: “Adding cache space never hurts”
LRU has this property
Other cache replacement algorithms may not
FIFO exhibits “Belady’s anomaly


When is LRU Ineffective?
• purely sequential access (aka “scan”)
caching cannot help at all; only adds overhead

• purely random access
ratio of cache size to total data size is all that matters

Examples:
sequential scan of large files & databases in data mining

video/audio playback (“streaming data”)

hash-based data structures cause accesses to be “spread out”


Example: small tight loop accessing huge array sequentially
• code shows high locality, but data does not
• the sequential data access will “pollute” an LRU-based cache
• even code (which shows locality) will be flushed out of the cache
Working Set
Given a time interval T,
WorkingSet(T) is the set of distinct data objects accessed in T
This captures adequacy of cache size relative to program behavior
• small working set, small cache is enough,high locality
• large working set, poor locality
size and pages in working set may change over time
queueing theory
It improves utilization by reducing the load on critical resources, 
thereby preventing bottlenecks and allowing for more efficient resource usage. 
Caching also directly reduces latency
it provides greater flexibility in handling different arrival disciplines, 
ensuring the system can cope with varying patterns of demand.

utilization vs peak

The Curse of Uncacheability
Businesses want to know your every click
Client caching hides this knowledge from the server
hurts response time, network load & server load


Can businesses benefit from caching without giving up control?

Content Distribution Networks
A Weak Solution
Effectively third-party caching sites trusted by businesses
Late binding of parasitic content by caching site
Pioneered by Akamai in the late 1990s
• many examples now
• e.g., CloudFront (Google), Windows Azure CDN, Streamzilla

Not as effective as client caching
But better than no caching at all, and better monetizes cached resources

** Chapter 3: Scalabilty **

Scalability:

1. Load Scalability
Scale with more users.
How? Can't as simple as just buy more infras..
VIRTUALIZATION
- VMs
- SDN
- SDS

+: 
(1): elasticity
(2): Monitoring, fault tolerance to local 
(3): Multiple OSes
(4): Isolation, Sandboxing



VM: perfect software abstraction of
OS-visible hardware

Software Abstraction
• Behaves just like hardware
• Allows multiple OSes

Properties of VMs:

Isolation
- Fault isolation, performance isolation, software isolation

Encapsulation and portability
- Cleanly capture all VM state, Enables VM snapshots, clones
- Independent of physical hardware
- Enables migration of live, running VMs

Interposition
- Transformations on instructions, memory, I/O
- Enables encryption, compression, …


VMM (hypersivor):
• transparant between OS and hardware
• multiplex hardware among multiple VMs

+ Fidelity: provides an environment for programs which is essentially identical with the original machine

+ Performance: programs run in this environment show at worst only
minor decreases in speed

+ Safety and isolation: VMM is in complete control of system resources






Virtualization can transform CAPEX into OPEX
capital expenses -> operational expenses
Elasticity: check from advanced cloud computing topic 


Types of System Virtualization
Type 1: Native/Bare Metal Hypervisors
Type 1 hypervisors run directly on the host's hardware to control the hardware and to manage guest operating systems. This type of hypervisor is also known as a "bare metal" hypervisor because it does not need an underlying operating system to function. The hypervisor acts as the operating system itself and has direct access to physical resources, which contributes to higher performance and efficiency.

Advantages:
+ Higher Performance: Direct access to physical hardware without going through an additional operating system layer allows for better performance.
+ Isolation: Provides strong isolation between virtual machines, improving security.

Examples:
VMWare ESX/ESXi
KVM (Kernel-based Virtual Machine)
Xen
Microsoft Hyper-V (when installed as a stand-alone hypervisor)


Type 2: Hosted Hypervisors
Type 2 hypervisors run on a conventional operating system just like other computer programs. This type of hypervisor is also known as a "hosted" hypervisor because it is hosted by an operating system. The hypervisor creates and runs virtual machines (VMs) that are one level removed from the physical hardware, relying on the host operating system to manage calls to the CPU, memory, and other hardware resources.

Advantages:
+ Ease of Use: Generally easier to install and use compared to Type 1 hypervisors. Suitable for development, testing, and educational purposes.
+ Cost-effective: Often less expensive and can be more suited for smaller scale or personal use.

Disadvantages:
- Higher Latency: The additional layer (the host OS) can introduce latency and reduce performance compared to Type 1 hypervisors.
- Dependence on Host OS: Relies on the host operating system's drivers and kernel, which can affect stability and performance.

Examples:
VMware Workstation
Oracle VirtualBox
Parallels Desktop
CPU Virtualization

Privileged instructions (e.g., IO requests, Update CPU state, Manipulate page table)


Non-privileged instructions (e.g., Load from mem)


For OS:
Privileged instructions from user mode: System calls Trap to OS and executed from kernel mode
Non-privileged instructions: Run directly from user mode

For virtualization:
Privileged instructions from user mode: Trap to VMM
Non-privileged instructions: Run directly on native CPU
User Mode: Privileged Ins -> Trap -> Kernel Mode -> VMM(emulate, update to VCPU) -> return result to user mode
Trap and Emulate → Full Control for VMM




Memory Virtualization
OS assumes that it has full control over memory
- Management: Assumes it owns it all
- Mapping: Assumes it can map any Virtual→ Physical 

However, VMM partitions memory among VMs
- VMM needs to assign hardware pages to VMs
- VMM needs to control mapping for isolation
-> Cannot allow OS to map any Virtual ⇒ hardware page


Logical pages (process address space in a VM)
-> physical pages of process (abstraction of hardware memory, guest OS)
-> machine pages (physically, Managed by VMM)
Live Migration of VMs
Running guest VMs can be moved between systems, without interrupting user access to the apps
- Supported by type 1 and type 2 hypervisors
- Very useful for
-- resource management/efficiency
-- no downtime for upgrades/maintenance, etc.
- Key enabler: Encapsulation


A migrate to B

A: running guest src
A->B: establish
B: Create guest target
A: send R/O pages
A: send R/W pages
A: send dirty pages(repeat)
B->A: run guest target
A: terminate guest source


Containers vs Virtualization
Containers provide isolation not virtualization
+ less overhead (operations, boost..)
+ high density (large number per machine)
+ no CPU support
-: No encap, interposition, cannot migrate, state leak to OS


• Multiple isolated instances of programs
• Running in user-space (shared OS/kernel); no VMM
• Instances see only resources (files, devices) assigned to their container
Scaling?
Scale Up:
no app change, no new failure nodes, latency concern..
more expensive
hits limit

Scale Out
application change
complex failure modes, latency
more economical

What can be scaled? 
- physical machine
- VM
- container
- process

• all threads in a process share same virtual address space
• all processes in a VM share same local file system
• all physical machines in cluster share a distributed file system


When to Scale Out?
- !response time gets longer in nonlinear way

Latency comes from
- Queueing time 
average arrival rate
average service rate

avg server load (=fraction of time server is busy)
load = avg arrival rate / avg service rate


why non linear? why latency if load < 1
- avg!


If arrival rate  > service rate, the latency goes to infinity

We do not want
  1. Latency to spike up.
  2. Overprovision the system, that latency is very low with very low server load
Timing of scaling
  1. Scaling too late: response time
  2. Scaling too soon: overhead, underutilized resources
Sweet spot?
  1. Good heuristic: queue len
    • as the server queue builds up, use a threshold to trigger scale out
    • when load drops, we shrink scale. (like empty queues)
    • Hyteresis: gap to avoid wasterful oscillations
    • *CautionL often queue len is not sufficient
Scale out front end servers
  • Connection termination
  • TLS termination
  • Open to probing from the internet
  • Handling connections
Scale out back end servers (APP server)
  • different layer design
  • scale independently
What to do about DB
  • very difficult to sclae out
Reliability
  • Failures are likely
  • if stateless, we restart
  • storage layer? this is not simple as restart
  • logging, replications.

Back to Blog Chapters

** Chapter 4: Failure Resiliency **

Go beyond transient communication failures
Failure?
  • many factors, env, people, hardware..
    How does one build a failure-resilient system?
    Modularity & Hierachy
    
Layering Model of Resiliency
Use code and data modularity to bound impact of failures. 


Level i+2: 
---------------------------------------------------------
Level i+1: Expected event handled here, Failure masking
---------------------------------------------------------
Level i: Unexpected event occurs here, detect it!

Best strategy: detect the error close to where it happens

Limits of Failure Masking
    1. Not all failures can be masked
  • limited by fundamentals, cannot recover..
  • limited by cost, impatience of user..
    1. Unmasked failures are visible to next higher layer
  • the upper layer may become more heavyweight
  • involve much greater semantics knowledge
  • affect many more system components
Resiliency is neve free, tradeoffs..
  • How hard we make i+1 to mask the failure from i?
Step 1: Failure detection
Transient vs. Persistent Failures
Dynamic nature of internet, so retrans will work

Transient failure (like deadlock)
manifested only in very unlikely combination of circumstances
undoing and retrying could solve
== soft failures == Heisenbugs

Persistent failures (like broken ethernet cable)
continues until repaired
retry does not help
duration of failures and repair are random variables
means of distributions are MTBF and MTTR
* we cannot wait for the fix -> may need to mask some failures
* can take forever to fix -> cannot be masked

Empifical Failure Data

Observations about Gray's study
- people are the source of failures

Operator and Software failures are the main source now.


Kinds of Damage
- data corruption (memory clobber, DRAM refresh failure)
- error during computation

Step 2: Failure Confinement and cleanup
Transition: Transient -> Committed
In software, "mess" -> "system update"
- committed state (permanently visible, lasting external effects)
- transient state (temporarily and locally visible), we can throw away this state as the easiest strategy to cleanup

=> atomic transations: simplifies reasoning about complex failure-prone code
every successful transaction has a commit point
before this, undo possible by explicit abort
single point of transition from transient to committed state
- it is a group of actions that either all succeed or none
- failure or explcit abort pirior to commit has zero residual effects
- only work in fail-fast settings!

Properties of transactions:
- Atomicity: all or nothing
- Consistency: execution on a consistentx
- Isolation: as if only one transaction runnings
- Durability: permanence

Atmoicity & durability? 
- shadowing
- intensions lists
- wrote-ahead logging
- assue hardware guarantees atomiciy of swing

SHadowning?
- fast recovery
- slow forward processing
- based on atmoic pointer
- use is for dierarchical
- swig ==dddddd

Logging (and intensions lists)
- fast forward processing
- slower recovery

Crash may occur during recovery
Intense recovery storms commons

assue hardware guarantees atomiciy of swing

Stengths and weakness
- high LCA -> large subtree copy


Typical use is for hierarchical file system
- swing at least common ancester of modified nodes


Intensions lists? 
Keep list of proposed changes in "stable storage"
- in addition to changing transient state
- "stable storage" = disk (high reliably)

On commit
- write out completion record to the list, which marks as I am done
- apply changes to stable storage
- delete list after application, or mark as done

On abort
- just delete intentions list

Crash recovery
- discard all incomplete intentions lists
- reapply all complete intentions lists

Intensions lists must be idempotent:
write(x) is ok, but increment(X) is not


Crash at any point, we check intentions list until last done(before that system 
cannot be read by others), throw away other that not done

As long as we write done to the intention list, it has to be done, and user should 
be notified that this commit is successful 

We can merge intention list instructions -> like compiler optimization (propagations)


Write-Ahead Logging
Use append-only log to record intentions
• temporal order of commits accurately preserved in log
• preserve log records until log truncation (i.e. garbage collection)


Logging Strategies
Physical logging (or old-value / new-value (ov/nv) logging )
Logical logging (or operation logging )

• logical logging is more compact but requires app help
• physical logging is more general and requires no app help

4 possible recovery strategies before forward processing can resume
• undo / redo (rare)
• undo / no-redo (rare)
• no-undo / redo (most common)
• no-undo / no-redo (effectively shadowing!)
Undo rule
“Recovery requires undo iff uncommitted transactions can modify committed state”

Redo rule
“Recovery requires redo iff committed transactions may have outstanding modifications to committed state”




Robust Update Amidst Failures
Consider a data item with copies at multiple sites
Suppose we want them to be updated as one unit
• that is, all updated or none updated
• called the distributed commitment problem
Distributed commitment
• cannot be accomplished in one round of messages
• requires >= two rounds of messages
• much like arranging a dinner party for N guests
Step 3: Failure masking
Distributed Transaction
Data item copied at multiple sites
updated as one unit -> distributed commitment problem
cannnot be accomplished in one round of messages
like arranging a dinner party for N guests


- Two phase commit
3 key steps
1. prepare phase, vote gathering
2. dicision, state recording at stable storage
3. commit/abort phase, result distribution & cleanup

1. 
Coordinator -> Prepare(stable record list of sites) -> Each site (decide yes or no)

if yes, record decision in stable storage, have to locally commit(local transaction) the fact that the decision is yes.
if no, don't save state or decisions, amnesia after crash -> implicit "no"


Coordinator <- Vote
                (after that, indefinite blocking window on "yes"; data locked at this site until decision is known)
                (pending state)
                *BLOCKING means the block the use of the commited resources
                can clever protocol design avoid this window? no
                No in the face of fail-fast server and network failures

                Danger - Availability, if all answer yes and someone killed the coordinator 
Gather info about all sites
RPC timeout -> no
Unanimous yes -> commit 
Even one no -> abort 
record decision stably <- commit point

2.
Coordinator -> Distribute decision


Ideal vs Reality
A weaker consensus
"Leader election" is about weaker consensus: simple majority (Paxos)

1. exactly one new master
2. majority of slaves accept the master
3. selection process is fast and efficient

In practice: 
1. #1 is too hard, instead at most one master
2. #2 is non-negotiable
3. #3 is too hard: best effort, may take forever
Using Two-Phase Commit
Achieving consensus among N nodes, already described

Shared file called MASTER contains
1. name of new master
2. names of majority list of slaves that accept new master
any node can try to update this file using two-phase commit
if succeeds, it is the new master

it can block on coordinator failure
it achieves unanimity, too strong for entire set of nodes, only majority needed
Roles of Nodes in Paxos
Intuition: every node has a stake in the correctness of the outcome.

All nodes agree that an election is in progress, (prior to leader election)

Normal operation suspended until
- new leader is elected and
- new leader completes recovery

Any node can play role of "Proposer"
- this is currrnt coordinator, previous must have died after achieving consensus
- must respect any commitments made earlier
- can have multiple propsers(by accident)

Node works as "Acceptor"


Roles of nodes in Paxos:
all node agree, and have stakes in the correctness of the outcome, no Byzantine behavior
all cooperating

** Chapter 5: Paxos **

Paxos



v: to be accepted by all 
ak = value of v stably recorded at site k

n = small integer
- possibly 1 initially
- monotonically increases on each new round


Acceptor(k) in stable storage, remember two things
1. Np = highest n seen so far or NULL
2. already_accepterd(nk,ak) or NULL, most recently accepted


1. Proposer -> Prepare(n) -> ACC, n: round number
2. Acceptor: look at local storage. Am I working with someone with higher n? reply already_accept if so
if (n>Np){
        Np = n
        write(Np) to stable storage
}
reply? maybe: Promise(n, already_accepted(nk,ak)) or NAK(Np)
case A: "Btw, you should know that I already accepted some previous value ak, happened in nk"
case B: "I am already working at higher round!"

3. If NAK or too few Promise() received, then {give up this round; repeat 1 with higher n}
If Promise() received from majority of Acceptors then {proceed to Phase2, v=ak of highest nk seen}
If all feedback has NULL, then proposer can pick any values

Special Example: if both proposers send the same N, they can both get NAK and give up the round.

Then send PleaseAccept(n,v) to all acceptors as responses
4. if(n>=Np), the acceptor write(n,v) to stable storage (this may or may not happen)
The acceptor can send back
A: Accept OK()
B: NAK(Np)
If (Accept_OK() from majority){
        we have leader;
        broadcast new leader to everyone;
}
If (NAK or too few ACCEPT_OK){
        end this round;
        Start new round with higher n;
}



Alg works no matter how frequent of proposers

Back to Blog Chapters

** Chapter 6: Datacenter storage: Cluster Filesystems **

Datacenter Storage
Topics GFS, HDFS
GFS
Latency between and within rack varies
Many commodity servers, disks
Failures are normal...
Unavailability is high..

Workload assumption: Large files, streaming reads, sequential writes mostly append
Concurrent appends by multiple clients
Want atomicity for appends without sync overhead among clients


* DESIGN GOALS
- High availability 
- Handle failures transparently (cannot have downtime)
- Low synchroniztion overhead between entities of GFS
- Exploit parallelism of numerous disks/servers
- choose high sustained throughput for individual reads/writes
(more important than low latency)

GFS Arch
One master server
- ony store metadata:
Name space
Access control per file
mapping from files to chunks
location of chunks

All metadata in RAM for fast operations, What if lost?

- Migrate chunks between chunkservers
Why migration needed?


Many chunk servers, chunk with 64MB fixed size of parts of files
- each with version number and checksum
- why 64MB? Reduce GFS overhead per chunk, master server does
not have to store a lot of metadata in RAM
- Chunks replicated to 3 different chunk servers, 2 in same rack, 1 in different rack
- No caching of file data (Long streaming read, caching may not be helpful)
- Send periodic heartbeats to Master

Many clients accessing different files stored on same cluster

Client:
Issues control(metadata) requests to master server
Issues data requests directly to chunkservers (master server may not be the bottleneck)
No caching of data at client, similarly, coupled with the workload
However, Caches metadata!!! which chunkserver associated to a chunk? 

No file system interface at the operating-system level,
not a traditional in-kernel file system
- user-level
- does not support all features of POSIX FS access

* Two special operations:

- Append, append data to file as atomic ops without having to lock a file
- Snapshot, creates a cp of a file or dir at low cost



Client Read Operation
Client sends master: 
read(file name, chunk ID)


Master's reply: chunk ID, version num, location of replicas

Client sends request to "closest" chunkserver with replica
- read(chunk ID, byte range)

Chunkserver replies with actual data
Clien Write Operation
3 replicas for each chunk -> must write to all 

Key points:
• Data pushed linearly along a chain
• Flow of data decoupled from flow of
control
Why?
Helps to
• fully utilize each machine’s network
bandwidth
• avoid network bottlenecks and highlatency links
• minimize the latency to push through
all the data. 

GFS FT
High Availability
- Chunk replication
        - Each chunk is replicated on multiple chunkservers
- Master (i.e., state of the master) replication
        - Operation log and checkpoints replicated on multiple machines

Data Integrity
- Checksum checks
        - Each chunk has checksums
        - Checksum verified for every read and write
        - Checksum also verified periodically for inactive chunks


What if master node down?
- replays logs from disk
- - recover namespace(dirs) and file-to-chunk-ID mapping(but not location of chunks)
- asks chunkservers for what chunks they have
- - recover chunk-ID-to-chunkserver mapping
- If a chunk server has newer chunk, adopt its version number
- - master may have failed granting lease

Logs cannot be too long -> master uses log to rebuild the fs state
- How to avoid too long? Checkpoints
GFS Consistency Model
Changes to namespace(ie metadata) are atomic
- file creation
- Due to mamespace lock(granular) + operation log

Changes to data are ordered by a primary
- concurrent writes can be overwritten
- record appends complete at least once, at offset of GFS's choosing
-> Applications must cope with possible duplicates


- Failed operations can cause incosist
- concurrent successful writes to the same region
- ..

Replication vs erasure codes
3-replication: 
a
a
a
b
b
b

Storage overhead: 3x

erasure code:
a
b
a+b
a+2b
Storage overhead 2x

Erasure codes:
10 data chunks, 4 parity chunks
HDFS

Back to Blog Chapters

** Chapter 7: Cluster Computation **

1. HPC: MPI
Characteristics: 
- Long-lived interdependent processes
- Paritioning problem space: exploit spatial locality
+: High utilization of resources, effective for scientific analysis
-: intolerant, 

MPI framework:
- Virtual topology, Synchronization, Communication

Tightly coupled processes
-> Failure of one processes prevents all others from processing


2. Cluster computing: MapReduce


Back to Blog Chapters