Course Evaluation (Final grade: A)

This is definitely a course you should take if you are interested in how the program got compiled into high-performance executables. The workload is large, but this is one of the most interesting courses I have taken in CMU.

Does it help one become a better SDE? Yes! I can start thinking about why using one option to code is better (Dynamic dispatch vs Static dispatch in Rust).

It’s also a great opportunity for one to learn a new language, like Rust / OCaml. I learned Rust in this course, and really love this programming language!

Blog Chapters

  1. Chapter 1: Overview of Compiler Design
  2. Chapter 2: Instruction Selection
  3. Chapter 3: Register Allocation
  4. Chapter 4: Elaboration(after par/lex..)
  5. Chapter 5: Static Semantics
  6. Chapter 6: Grammars
  7. Chapter 7: Lexical Analysis
  8. Chapter 8: Function calls
  9. Chapter 9: SSA
  10. Chapter 10: Dynamic Semantics
  11. Chapter 11: Mutable
  12. Chapter 12: Dataflow Analysis
  13. Chapter 13: Register allocation optimization
  14. Chapter 14: Peephole optimization
  15. Chapter 15: Memory optimization
  16. Chapter 16: Loop optimization
  17. Chapter 17: Garbage Collection
  18. Chapter 18: More advanced features

**Chapter 1: Overview of Compiler Design **

What makes a good Compiler: metrics

  • correctness
  • code quality: compiled code runs fast
  • efficiency: compilation runs fast
  • usability: provides errors/warnings, …

Compiler Design

  • structure compilers
  • applied alg & data structures
  • focus on sequential imperative programming languages
    • not functional, parallel, distributed, OOP…
  • code generation and optimization

Organizing a compiler

Front

  • split work into different phases
  • Lexical analysis -> Token stream
  • Parsing -> Abstract syntax tree (mark body of while loop…)
  • Sementic analysis (type check, variable initialization)

Middle

  • IR(intermediate representation) Generation -> Intermediate representations
  • Optimize (most challenging)

Back

  • Instruction selection -> Abstract assembly
  • Register allocation -> ASM Middle and Back has unclear distinctions

Back to Blog Chapters

**Chapter 1.1: Instruction Selection **

  • Compiler phase
  • IR tree -> abstract assembly

Example:

x = 5
return x+x+x*2

->>> Instruction selection
x = 5
temp1 = x + x 
temp2 = x * 2
ret_reg = t1 + t2
ret
IR tree (more expressions, statements..)
Programs p ::= s1,...sn (sequence of statements)
statements s ::= x = e 
                return e

Expressions:
e ::= c int const
      x variable
      e1 ⊕ e2 binary OP (nested)
      ⊕ ::= +1 * 1 / 1 ...
Abstract Assembly (flat)
Program: p ::= i1, ... in (instructions)
Instructions i::= d <- s move
                = d <- s1 ⊕ s2 bin op (sometimes one of the source works as dst)
                = ret return
Operands:
    d,s ::= r register (usually* finite numbers as defined)
          = c int const
          = t temps (variables)
          = x var

Translations Expr
translate(e1 ⊕ e2) = translate(e1); translate(e2);
res1 ⊕ res2?
Better: 
trans(d,e): seq of instructions that stores value of e in destination d. 


e           trans(d,e)
x           d <- x
c           d <- c
e1 ⊕ e2     trans(t1, e1), trans(t2, e2), d<-t1⊕t2, (t1 and t2 are fresh temps)


Translate statements
trans'(s): seq of instru that inlements s
s           trans'(s)
x = e       trans(x,e)
return e    trans(ret,e) return (ret: return register)
Example
IR prog: 
z = (x + 1) * (y * 4)
return z

trans'(p) 
= trans'(z = (x + 1) * (y * 4)), trans'(return z)
= trans(z,(x + 1) * (y * 4)),trans(ret, z), return
= trans(t1,x+1), trans(t2,y * 4), z<- t1 * t2, ret<-z, return
= t3 <- x, t4 <- 1, t1 <- t3 + t4, t5 <- y, t6 <- 4, t2 <- t5 * t6, z <- t1*t2, ret<-z, return
Optimize? directly use x and y instead of moving them to temps

How to improve
  1. Add special cases: for example c ⊕ e2
  2. Optimization pass after the first pass of translation (common approach)
  3. Different translation
Constant propagation
  • goal: eliminate move t <- c, p by replacing t with c in p
  • But: stop replacing t if it’s written again
    Example: 
    t <- 4
    x <- t+1   ---> x <- 4+1 
    t <- 5
    ret <- t   --NO--> ret <- 4
    return 
    
Copy propagation
  • goal: elim move d <- t,p by replacing d with t in p, But: step replacing if d is written or if t is written


t <- 5+1
d <- t 
x <- d+1 ----> x <- t+1
t <- 5+2
ret <- d+1 ---No---> ret <- t + 1
ret
Static Single Assignment Form
- every temp is assigned at most once
- don't have to check "writes" but can replace all occurances in propagations
- Conversion to SSA -> user version nums

t <- 5+1
d <- t 
x <- d+1 ----> x <- t+1
t <- 5+2
ret <- d+1 ----No---> ret <- t + 1
ret

----->>

t0 <- 5+1
d0 <- t0
x0 <- d0+1
t1 <- 5+2
...

Back to Blog Chapters

**Chapter 2: Register Allocation **

  • Goal: assign registers and stack locations to temps
    X86-64: 16 registers, no temps
  • stack locations, when keeping track of more variables than registers
    Strategy
    1. Store all temps on the stack (CON: inefficient, still need registers for efficiency)
IR trees(simplified syntax tree) 
--> Instruction selection --> ASM 
--> reg alloc --> ASM
--> x86 asm


Example: 
d <- s1 ⊕ s2
-> reg alloc

rlld <- exd * 4(rsp)
Difficulty: x86 has 15 gen purpose registers
  • Goal: assign each variable a register
  • may have to use stack locations and clever use of registers for variables
Interference:
x <- 14
y <- 15 
z <- x+y

x <- 14
y <- 15 + x 
ret <- 4+y
ret
if x is not used again, we can use overwrite the register for y
Rigth IR for reg alloc?
3 addr abs asm
d <- s1 + s2
d <- s1 / s2
...
2 addr abs asm
d <- s1
d <- d + s2

d <- s1
d <- d / s2
...

abstract x86
MOVL S1,d
ADDL S2,d
...
IDIVL
edx:eax / s2 -> eax

MOVL s1, %eax
CLTD (sign-extends eax into edx:eax.)
----
https://stackoverflow.com/questions/17170388/trying-to-understand-the-assembly-instruction-cltd-on-x86
What this means in practice is that edx is filled with the most significant bit of eax (the sign bit). For example, if eax is 0x7F000000 edx would become 0x00000000 after cdq. And if eax is 0x80000000 edx would become 0xFFFFFFFF.
----
IDIVL s2 (edx:eax / s2)
MOVL %eax, d

Reg Alloc at 3-Addr Assem
  • leave one register unassigned for later conversion (r11d)
For example

d <- s1 ⊕ s2
--> 
MOVL s1, %r11d
ADDL s2, %r11d
MOVL %r11d, d
(one of them has to be a register
this will always work, but may not be the optimzied solution)

Reg Alloc at 2-Addr Assem
For example

d <- s1 + s2
--> 
d <- s1
d <- s2 + d
How to allocate registers?
Goal: minimize spilling to the stack
One method: graph-based, greedy register allocation
Build interference graph
Vertices are registers and temps
Edges indicate interference (need diff registers)

Example: 
t1  -- t2 -- t3 -- eax
        |           |
         -----------

Complexity pf deciding if two temps interfere? Undecidable
Why? Reduction to the halting problem
x <- 5
y <- 6
complex code (doesn't use x and y)
ret<- x + y
return

x and y interfere iff complex code terminates


Approximate Interference with liveness
Liveness: temp t is live at line l if t might be used in the 
following computation

For L1: work backwards
temp t is live at l if either of them is valid:
- t is read at l
- t is live at l+1 and not written at line l

Example: 
x1 <- 1            -
x2 <- 1           x1
x3 <- x1 + x2     x2,x1
x4 <- x3 + x2     x3,x2
x5 <- x4 + x3     x3,x4
ret <- x5         x5
return ret        ret register

life range of t1: 2-3
life range of x2: 3-4
..


t1 <- 1     -
t1 <- t1+1  t1
t2 <- t1    t1

Construct interf graph:
Option 1: Add edge between t1 and t2 if they have overlapping live ranges
Option 2: rule 1: For every instruction d <- s1 ⊕ s2
                        add an edge {d,t} if t is  live at the next instruction
          rule 2: For every move d <- s, add an edge {d,t} 
                  that t not in {s,d}, and is live at the next instruction


Find coloring with as few colors as possible: Adjacent vertices have different colors
Given: Graph G = (V,E) and k colors
Question: Is there a coloring of V with k colors such that adjacent verticies have different colors
Complexity: NP complete for k >= 3 

What folllows for reg allocation? 

For a Turing-comp lang
It's undecidable for a prog if there's an equiv prog that uses k registers (and no tem)

Assign colors to registers and stack locations
Greedy graph coloring: get a minimal coloring for most programs

Assume colors are number 1,2,3...

Let N(v) be the nbs of v
Input: G(V,E) and ordering V=v1,...,vn
Output: Coloring of V: col:V->{1,..k}

- Order matters

x5 

x4-x3-x2-x1

Order 1: x5,x4,x3,x2,x1,ret

Order 2: x1,x4,x3,x2,x5,ret


Th: There exists an ordering that produces an optimal coloring
How to find optimal ordering for chordal graphs?

Def: A graph is chordal if every cycle with >= 4 vertices has a chord 
(edge between two verticies in the cycle that is not in the cycle)

Example

a - b
|   |
c - d
not chordal

a - b
| / |
c - d
chordal

a - b
| x |
c - d
chordal


Intuition: how to get long cycle w/o chord?
a over lap with b and c
we want d to be not overlapping with a


Input: G, Op: ordering of v
for each vertex in V, set wt(v) <- 0

For all v ∈ V set wt(v) ← 0
Let W ← V
For i ← 1 to n do
      Let v be a node of maximal weight in W
      Set vi ← v
      For all u ∈ W ∩ N(v) set wt(u) ← wt(u) + 1
      Set W ← W \ {v}


MCS Ordering returns: simplicial elimination ordering if G is chordal
Def: v is simplicial in G in N(v) is a clique
A simplicial vertex is one whose neighbors form a clique: every two neighbors are adjacent. 


Def: A simpl elim ordering is an ord v1, ... vn st vi is simplicial in
Gv1,..vi <- subgraph induced by v1, .. vi (picked)
Theorem 1: The graph is chordal iff it has an simplicial elim ordering
Proof..
Theorem 2: The greedy coloring alg finds an opt coloring if run with a simpl elim ordering
Proof: Let k be the number of colors used
Observations: 
- k <= max|Ni(vi)| + 1
- #min colors >= max|Ni(vi)| + 1
-> Theorem: MCS returns a sompl elem ordering iff G is chordal
Spill: assign remaining colors to stack loc.
strategies: 
1. number of uses of temps in code, higher use freq get the registers
2. incorp loop nestings
3. highest colors (high color may be used in less time, approx for 1)

Summary

  1. Build the interference graph (different ways)
  2. Order the vertices with MCS
  3. Color with greedy alg
  4. Spill if # colors > 13

Liveness analysis & interence rules

live(l,x) ~ x i live at line l
d != u
l: d <- x ⊕ y live(l+1, u)
--------------------------
      live(l, u)


l: d<- x ⊕ y
------------------
    live(l, x)
    live(l, y)

l: return
------------------
   live(l, ret)


l: x <- c
live(l+1, u), x!=u
-------------------
      live(l,u)

l: x<-y, u!=x
live(l+1, u)
--------------
live(l,u)


l:x<-y
-------
live(l,y)

Using derivation tree to prove if x live at line l

General Saturation Alg
Can be used for all predicates
1: start with empty set of facts <- derived predicates
2: pick arg (here: l and t) and a rule with live(l,t) in the conclusion
so that the premises are already facts
3: Repeat until no facts can be derived

Will always stop
Refactoring liveness rules
use(l,x)
---------
live(l,x)

*: l' is a possible succesor
live(l'x) succ(l,l') ~def(l,x) 
---------------------------------
      live(l,x)

Need to define:
use(l,x) ~ x is used at l 
def(l,x) ~ x is def at l
succ(l,l') ~ l' can be a succ of l
Example
l: d <- x ⊕ y
-------------
use(l,x)
use(l,y)
def(l,d)
succ(l,l+1)
Adding loops and conditionals:
i = l: d <- x ⊕ y
.
.
.
l: goto l' (uncond jump)
l: if(x?c) then lt then lt else lf(cond jump)

New rules:

l: goto l'
-----------
succ(l,l')

l: if(x'c) then lt
else lf
-------------------
succ(l,lt)
succ(l,lf)
use(l,x)


Keep doing iterations until we cannot add more new to the liveness sets
pass 1, pass 2, ...

Interference graph
1. Overlapping live ranges

live(l,x) live(l,y)
-------------------
inter(x,y)

But doesn't work in the presence of that code dead code

Example: 
a <- 1
b <- 2
ret <- a
return


2. Assignment based
More sparse

l: x <- y ⊕ z live(l+1, u), u!=x
---------------------------------
inter(x,u)

l:x<-y, y!=u
live(l+1, u) x!=u
------------------
inter(x,u)

l: x<-c, u!=x
live(l+1, u)
--------------
inter(x,u)

Chapter 3: Elaboration

Goal: minimal repr and clear of progr
  • remove syntactic sugar
  • make scope explicit
Parsing -> Parse tree -> elaboration -> AST(<-Semantic analysis)

Example: 
turn for lops into while loops

for(int x=4, x<8128, x++){
      y = y+x
}
------> elab
use declaration(x,int,s)
{int x=4; while(x<8128)..}
decl(x, int, while(x<8218){..})

// Before elaboration
for (int x = 4; x < 30; x++) { y = y + x }


// After elaboration
{
int x = 4;
while (x < 30) { y = y + x; x += 1 }
}

abstract syntax

declare(x, int,seq(assign(x, 4),
                  while(x < 30,
                  seq(assign(y, y + x), assign(x, x + 1)))))

The extra scope is necessary
Rather than continuing to perform manipulations on surface syntax, we introduce the BNF of an elaborated abstract syntax


Expressions e ::= n | x | e1 ⊕ e2 | e1 /o e2 | f(e1, . . . , en)
| e1 ? e2 | !e | e1 && e2 | e1 || e2

/0:  potentially effectful operators (such as division or shift, which could raise an exception)

⊕ for effect-free operators
? for comparison operators returning a boolean, !, &&, and || for logical negation, conjunction, and disjunction, respectively

Statements s ::= 
declare(x, τ, s) | 
assign(x, e) | 
if(e, s1, s2) | 
while(e, s)| 
return(e) | 
nop | 
seq(s1, s2)


Inf rule
<tp> τ
<ident> x
<exp1> e1
<exp2> e2
<stmt1> s1
<stmt2> s2
-----------------------------------------------------
for (<tp> <ident> = <exp1>; <exp2>; <stmt1>) <stmt2>
 declare(x, τ,while(e1,seq(s2, s1)))

When we want to translate a for loop that
matches the pattern at the bottom, we have to do six things: 
elaborate the type,
elaborate the identifier, 
elaborate both expressions, 
elaborate the afterthought, and
then elaborate the loop body. 
x++ -> assign(x, x + 1)
make sure any nested for loops in <stmt2>, the body of our for loop, are elaborated

Why IR tree
isolate potentially effectful expressions, making their order of execution explicit.simplifies instruction selection and also means that the remaining pure expressions can be optimized much more effectively.
make the control flow explicit in the form of conditional or unconditional branches, which is closer to the assembly language target and allows us to apply standard program analyses based on an explicit control flow graph.
Pure Expressions p ::= n | x | p1 ⊕ p2

Commands c 
::= 
x ← p
| x ← p1 /0 p2
| x ← f(p1, . . . , pn)
| if (p1 ? p2) then lt else lf
| goto l
| l :
| return(p)
Programs r ::= c1 ; . . . ; cn


Translating Expressions
tr(e) = <ˇe, eˆ>
where eˇ is a sequence of commands r that we need to write down to compute the
effects of e and eˆ is a pure expression p that we can use to compute the value of e back up. 


tr(n) = <·, n>
tr(x) = <·, x>
tr(e1 ⊕ e2) = <(ˇe1 ; ˇe2), eˆ1 ⊕ eˆ2>?


Translating Statements
tr(assign(x, e)) = ˇe ; x ← eˆ
tr(return(e)) = ˇe ; return(ˆe)
tr(nop) = ·
tr(seq(s1, s2)) = ˇs1 ; ˇs2


tr(if(e, s1, s2)) = ˇe ;
if (ˆe != 0) then l1 else l2 ;
l1 : ˇs1 ;
goto l3 ;
l2 : ˇs2 ;
l3 :



tr(while(e, s)) = l1 : ˇe;
if (ˆe != 0) then l2 else l3 ;
l2 : ˇs ; goto l1;
l3 : 
Translating Boolean Expressions
cp(e1 ? e2, l, l') = 
ˇe1 ; ˇe2 ;
if (ˆe1 ? eˆ2) then l else l'


cp(!e, l, l') = cp(e, l', l)

cp(e1 && e2, l, l') = 
cp(e1, l2, l');
l2 : cp(e2, l, l')


cp(e1 || e2, l, l') = left to the reader

cp(0, l, l') = goto l'

cp(1, l, l') = goto l

cp(e, l, l') = 
ˇe ;
if (ˆe != 0) then l else l'


tr(if(b, s1, s2)) = cp(b, l1, l2)
l1 : tr(s1) ; goto l3
l2 : tr(s2) ; goto l3
l3 :


tr(e) = <cp(e, l1, l2);
l1 : t ← 1 ; goto l3
l2 : t ← 0 ; goto l3
l3 :
, t>
Extended basic blocks
 instead of basic blocks being sequences of commands ending
in a jump, they are trees of commands that branch at conditional statements and
have unconditional jumps (goto or return) as their leaves

Chapter 4: Static Semantics

abstract syntax
After lexing and parsing, a compiler will usually apply elaboration to translate the parse tree to a high-level intermediate form often called abstract syntax. Then we verify that the abstract syntax satisfies the requirements of the static semantics.


C0:
• Initialization: variables must be defined before they are used.
• Proper returns: functions that return a value must have an explicit return statement on every control flow path starting at the beginning of the function.
• Types: the program must be well-typed.
Abstract Syntax as defined in last chap
Def
we would like to raise an error if there is a possibility that
an unitialized variable may be used. 

may-property
use(e1/2, x)
-----------
use(e1 && e2, x)

use(e1/2, x)
-----------
use(e1 ⊕ e2, x)


must-property
def(s, x) if the execution of statement s will define x.


--------------------
def(assign(x, e), x)



def(s1, x) def(s2, x)
-----------------------
def(if(e, s1, s2), x)

A conditional only defines a variable if is it defined along
 both branches, and a while loop does not define any variable 
 (since the body may never be executed).

def(s1/2, x)
-----------------
def(seq(s1, s2), x)

def(s, x) y != x
---------------------
def(decl(y, τ, s), x)
x is declared at most once on a control-flow
path. 

--------------------
def(return(e), x)
Since a return statement will never
pass the control flow to the next instruction, any subsequent statements are unreachable. It is therefore permissible to claim that all variables currently in scope have been defined. 
Liveness

We observe that liveness is indeed a may-property, since a variable is live in a conditional if is used in the condition or live in one or more of the branches.


use(e, x)
----------------
live(assign(y, e), x)


use(e, x)
----------------
live(if(e, s1, s2), x)


use(s1, x)
----------------
live(if(e, s1, s2), x)


use(s2, x)
----------------
live(if(e, s1, s2), x)

use(e, x)
----------------
live(while(e, s), x)

live(s, x)
----------------
live(while(e, s), x)

use(e, x)
----------------
live(return(e), x)

no rule for
live(nop, x)

live(x, s) y != x
----------------
live(decl(y, τ, s), x)


live(s1, x)
----------------
live(seq(s1, s2), x)

¬def(s1, x) live(s2, x)
----------------
live(seq(s1, s2), x)

Initialization
captures the general condition.


should be read from the premises
to the conclusion.
decl(x, τ, s) in p   live(s, x)
------------------------------
error


from the conclusion to the premises


------------------------------
init(nop)


init(s1) init(s2)
------------------------------
init(seq(s1, s2))


init(s) ¬live(s, x)
------------------------------
init(decl(x, τ, s))
From Judgments to Functions
init : stm → bool
init(nop) = T
init(seq(s1, s2)) = init(s1) ∧ init(s2)
init(decl(x, τ, s)) = init(s) ∧ ¬live(s, x)

live(nop, x) = ⊥
live(seq(s1, s2), x) = live(s1, x) ∨ (¬def(s1, x) ∧ live(s2, x))
live(decl(y, τ, s), x) = y != x ∧ live(x, s)
. . .

...
init(δ, s, δ'): 
assuming all the variables in δ are defined when s is reached, 
no uninitialized variable will be referenced and after its execution 
all the variables in δ' will be defined.

use(δ, e): e will only reference variables defined in δ.

δ |- s ⇒ δ' for init(δ, s, δ').
δ |- e for use(δ, e).

δ ⊢ s1 ⇒ δ1
δ1 ⊢ s2 ⇒ δ2
______________
δ ⊢ seq(s1, s2) ⇒ δ2


δ ⊢ e
_________
δ ⊢ assign(x, e) ⇒ δ ∪ {x}


δ ⊢ e
δ ⊢ s1 ⇒ δ1
δ ⊢ s2 ⇒ δ2
_________________
δ ⊢ if(e, s1, s2) ⇒ δ1 ∩ δ2


δ ⊢ e
δ ⊢ s ⇒ δ'
_________
δ ⊢ while(e, s) ⇒ δ



In particular, declare(x, τ, s) means that the variable x is declared (only) within the statement s.
δ ⊢ s ⇒ δ'
___________
δ ⊢ decl(y, τ, s) ⇒ δ' - {y}


δ ⊢ e
_________
δ ⊢ return(e) ⇒ {x | x in scope}



Typing Judgment for Statements:
The notation Γ ⊢ s : [τ] is a typing judgment for statements,
where Γ is the type environment (context), 
s is a statement, 
and τ is the type. 
This notation signifies that the statement s is well-typed within the 
context Γ and ultimately returns a value of type τ

**Chapter 5: Grammars **

Grammars are a general way to describe languages.
For a given grammar G with start symbol S, a derivation in G is a sequence of rewritings
S → γ1 → · · · → γn = w

A context-free grammar consists of a set of productions of the form X −→ γ, where X is a non-terminal symbol and γ is a potentially mixed sequence of terminal and non-terminal symbols.

While the parse tree removes some ambiguity, it turns out that the example grammar is ambiguous in other, more important, ways.

we can add that string specifically as the new base case
It is important that programming languages be unambiguous in practice. 
 
We can usually rewrite grammars to remove ambiguity, but sometimes we extend the language of context-free grammars to resolve ambiguity. One example is explicitly stating precedence and associativity as a way of resolving ambiguities.

CYK Parsing
CYK Parsing

The Cocke-Younger-Kasami (CYK) parsing algorithm is a method used to check if a given string can be generated by a context-free grammar (CFG). 
It's especially useful in computational linguistics and computer science for parsing strings. The algorithm requires the CFG to be in Chomsky Normal Form (CNF), 
where production rules are either a non-terminal producing two non-terminals, or a non-terminal producing a terminal.

How CYK Works
Input and Initialization: The algorithm takes a string w of length n and a CFG G in CNF. It initializes a table to store which grammar symbols can generate which substrings of w.

Filling the Table: The algorithm fills a table P, where P[i,j] holds non-terminal symbols that can generate the substring w[i,j]. It starts with the smallest substrings 
(single characters) and combines them to form larger substrings.

For each terminal in w, it adds corresponding non-terminals to P as per the grammar rules.
For each possible split of substrings, it checks if their combination can be generated by any non-terminals as per the grammar rules.
Iterating Over Substrings: It considers all possible substrings of w by their length, from shortest to longest, and checks all possible splits for each substring, looking 
for non-terminals that could generate these parts.

Checking for Acceptance: After filling the table, it checks if the start symbol S is in the top-right cell of the table (P[1,n]). 
If S is present, the string w is accepted by the grammar.

Complexity
The CYK algorithm has a time complexity of O(n^3 * |G|), where n is the length of the string and |G| is the size of the grammar. This complexity comes from considering all substrings, 
all possible splits of each substring, and the grammar size.

Applications
The CYK algorithm is used in syntax checking, natural language processing, and compiler design, where it's important to determine if strings follow specific grammatical rules.

By using dynamic programming, the CYK algorithm efficiently parses strings and determines their grammatical structure under a given CFG.

https://www.youtube.com/watch?v=VTH1k-xiswM
Shift-reduce parsing
Deductive parsing involves two primary rules originally used in the CYK parsing algorithm, but when we interpret these for shift-reduce parsing, 
we make some modifications.

Rule D1: This is an assertion rule where for any terminal symbol 'a' found in the input string, we can deduce that 'a' derives itself. This is represented as:

a : a


Rule D2: This is a combination rule that states if you have a production in the grammar where a non-terminal X leads to a sequence of symbols (γ1 ... γn), 
and you have established that for substrings w1 through wn of your input string w0, each wi derives γi, 
then you can conclude that the concatenated string w1...wn is derived from X. This rule is written as:

[r] X → γ1 ... γn
w1 : γ1 ... wn : γn
------------------------
w1 ... wn : X


Certainly! Let's put the explanation of the rules for deductive shift-reduce parsing in plain text form:

Deductive parsing involves two primary rules originally used in the CYK parsing algorithm, but when we interpret these for shift-reduce parsing, we make some modifications.

Rule D1: This is an assertion rule where for any terminal symbol 'a' found in the input string, we can deduce that 'a' derives itself. This is represented as:


a : a
Rule D2: This is a combination rule that states if you have a production in the grammar where a non-terminal X leads to a sequence of symbols (γ1 ... γn), 
and you have established that for substrings w1 through wn of your input string w0, each wi derives γi, 
then you can conclude that the concatenated string w1...wn is derived from X. This rule is written as:


[r] X → γ1 ... γn
w1 : γ1 ... wn : γn
------------------------
w1 ... wn : X

When we adapt these rules for shift-reduce parsing, we modify the form of the facts used:

Instead of w : γ, where γ is a sequence of terminals and non-terminals, we look to conclude facts of the form w : β, 
where β is a possibly empty series of terminals and non-terminals.

We then have two operations: shift and reduce, focusing on the rightmost elements of w or β. The rules are:

Start Rule:
This is the base case for parsing, where you start with an empty string deriving an empty string:
ε : ε

Shift Rule:
If you have a fact w : β and 'a' is a terminal symbol next in the input string, you can append 'a' to both w and β:
w : β
w a : β a


Reduce Rule:
If there's a production where X leads to α and you have a string w followed by α in β, you can reduce this by recognizing α as a unit derived from X:

[r] X → α
w : β α
------------
w : β X

The reduce rule is the key to recognizing higher-level structures. Each time you apply it, you replace a sequence of symbols 
(which matches the right-hand side of a production rule) with the non-terminal symbol from the left-hand side of that production rule.

If a grammar is unambiguous, this means
that, as we apply a series of rules to try to derive w : S from ε : ε, there is at most one rule we can apply that will lead us to success.

If we can successfully predict what the next step should be at every point, then we can implement this proof search with a stack 
holding the terminals and nonterminals (on the left) and a queue or array index tracking the unprocessed tokens (on the right).

What would we need to know how to always make the right decision for the grammar above?



We will begin constructing a parse table, where the columns
correspond to the next unshifted token and the rows correspond to patterns that we match against the stack β.
Parse Table
Conflicts
Cases where we can make different decisions and still successfully parse the string are called conflicts.

Example:
we have decided to resolve the conflicts by giving a precedence to the operators and declaring both of them to be left-associative

It is also possible to have reduce/reduce conflicts

For many parser generators, the default behavior of a shift/reduce conflict is to shift, and for a reduce/reduce conflict to apply the textually first production in the grammar.
LR(1) and more?

The set of languages that we can parse with shift-reduce parsers that have lookahead 1 is called LR(1). But even though a language may be 
describable with an LR(1) grammar, it’s not necessarily the case that every grammar for an LR(1) language can be parsed with a shift-reduce parser. 


Even though the grammar is unambiguous, to parse it correctly, we’d need arbitrary lookahead – we’d need to look over an arbitrary 
number of b tokens to find whether they were followed by a c or a d.

**Chapter 6: Lexical Analysis **

turn a raw byte or character input stream coming from the source file into a token stream by chopping the input into pieces and skipping over irrelevant details.

classify input tokens into types like INTEGER or IDENTIFIER or WHILE-keyword or OPENINGBRACKET...

Lexers are specified by regular expressions. Classically, however, they are implemented by finite automata.


deterministic finite automaton (DFA). At every state and every input there is at most one edge enabling a transition.


But in general, finite automata can be nondeterministic finite automata (NFA). That is, for the same input, one path may lead to
an accepting state while another attempt fails.


Regular Expressions -> Nondeterministic Finite Automata
Nondeterministic Finite Automata -> Deterministic Finite Automata
 instead of a single state, we now consider the set of states in which we could be.


Another operation that is often done by lexer generator tools is to minimize the resulting DFA by merging states and reducing the number of states and transitions in the automaton.

Lexical analysis reduces the complexity of subsequent syntactical analysis by first dividing the raw input stream up into a shorter sequence of tokens, 
each classified by its type (INT, IDENTIFIER, REAL, IF, ...). 
The lexer also filters out irrelevant whitespace and comments from the input stream so that the parser does not have to deal with that anymore. 
The steps for generating a lexer are


1. Specify the token types to be recognized from the input stream by a sequence of regular expressions

2. Bear in mind that the longest possible match rule applies and the first production that matches longest takes precedence.

3. Lexical analysis is implemented by DFA.

4. Convert the regular expressions into NFAs (or directly into DFAs using derivatives).

5. Join them into a master NFA that chooses between the NFAs for each regular expression by a spontaneous ε-transition

6. Determinize the NFA into a DFA

7. Optional: minimize the DFA for space

8. Implement the DFA for a recognizer. Respect the longest possible match rule by storing the last accepted token and 
backtracking the input to this one if the DFA run cannot otherwise complete.

Back to Blog Chapters

**Chapter 8: Function Calls **

IR
two lower-level forms of intermediate representation for function calls. 
1. d ← f(s1, . . . , sn)
2. call f

calling conventions
Parameters
first six arguments are passed in registers, the remaining arguments
are passed on the stack
all arguments take 8 bytes of space on the stack, even if the type of
argument would indicate that only 4 bytes need to be passed

callee, should set up its stack frame, reserving space for local variables, spilled temps that could not be assigned to registers, 
and arguments passed to functions it calls in turn

It is recommended to calculate the total space needed statically and then decrementing the stack pointer %rsp by the appropriate amount only one time within the function

%rsp should be aligned 0 mod 16 before another function is called,
and may be assumed to be aligned 8 mod 16 on function entry.


Return
The result is returned in a specific return register %rax.





...
n + 16(%rsp) argument 8
n + 8(%rsp) argument 7
n + (%rsp) return address
local variables         Callee
(%rsp) end of frame     Callee


Registers convention
Abstract x86-64 
form Register Usage function Preserved accross calls
res0 %rax return value∗ No
arg1 %rdi argument 1 No
arg2 %rsi argument 2 No
arg3 %rdx argument 3 No
arg4 %rcx argument 4 No
arg5 %r8 argument 5 No
arg6 %r9 argument 6 No
ler 7 %r10 caller-saved No
ler 8 %r11 caller-saved No

lee9 %rbx callee-saved Yes
lee10 %rbp callee-saved∗ Yes
lee11 %r12 callee-saved Yes
lee12 %r13 callee-saved Yes
lee13 %r14 callee-saved Yes
lee14 %r15 callee-saved Yes
%rsp stack pointer Yes


Typical Calling Sequence
d ← f(s1, s2, s3):

arg3 ← s3
arg2 ← s2
arg1 ← s1
call f
d ← res0


l : call f
caller-save(r)
---------------- J'8
def(l, r)

caller-save(r) is true of register r among %rax, %rdi, %rsi, %rdx, %rcx, %r8, %r9, %r10, and %r11.

Now if a temp t is live after a function call, we have to add an infererence edge connecting t with any of the fixed registers 


Callee-saved Registers
%rbx, %rbp, %r12, %r13, %r14 and %r15
The standard approach is to save those that are needed onto the stack in the function prologue and restore them from the stack in the function epilogue, just before returning.
**: saving and restoring them all is safe, but may be overkill
for small functions

If we need more than the available number of caller-saved registers, we assign callee-save registers before we resort to spilling,
but make sure the save them at the beginning of a function and restore them at the end. 


f :
t1 ← lee9
t2 ← lee10
· · ·
function body
· · ·
lee10 ← t2
lee9 ← t1
ret


Liveness about calling convention
1. During a call, res0, arg1-6 and ler7, ler8 are defined.
2. For each line l and each temp t defined at l, we create an edge between t and any variable live in the successor.



All precolored registers implicitly interfere with each other, so we don’t include that in the interference graph.


**Chapter 9: SSA **

relabel variables in the code so that each variable is defined
only once in the program text. If the program has this form, called static single assignment (SSA),


then we can perform constant propagation immediately
There are other program analyses and optimizations for which it is convenient to have this property..



Basic blocks
int dist(int x, int y) {
x = x * x;
y = y * y;
return isqrt(x+y);
}

dist(x0,y0):
x1 <- x0 * x0
y1 <- y0 * y0
t0 <- x1 + y1
t1 <- isqrt(t0)
return t1



Loops
int pow(int b, int e)
//@requires e >= 0;
{
int r = 1;
while (e > 0)
//@loop_invariant e >= 0;
// r*b^e remains invariant
{
r = r * b;
e = e - 1;
}
return r;
}

pow(b0,e0):
      r0 <- 1
      goto loop(b0,e0,r0)
loop(b1,e1,r1):
      if (e1 > 0)
            then body(b1,e1,r1)
            else done(b1,e1,r1)
body(b2,e2,r2):
      r3 <- r2 * b2
      e3 <- e2 - 1
      goto loop(b2,e3,r3)
done(b3,e4,r4):
      return r4

Minimize SSA Form
φ Function form to minimized form.
It’s quite simple: repeatedly remove φ-functions of the form
ti = φ(tx1, tx2, . . . txk)


pow(b0,e0):
r0 <- 1
      goto loop
loop:
      b1 <- phi(b0,b2)
      e1 <- phi(e0,e3)
      r1 <- phi(r0,r3)
      if (e1 > 0) then body else done
body:
      b2 <- phi(b1)
      e2 <- phi(e1)
      r2 <- phi(r1)
      r3 <- r2 * b2
      e3 <- e2 - 1
      goto loop

done:
      b3 <- phi(b1)
      e4 <- phi(e1)
      r4 <- phi(e1)
      return r4

-------------------------------------------
pow(b0,e0):
      r0 <- 1
      goto loop
loop:
      e1 <- phi(e0,e3)
      r1 <- phi(r0,r3)
      if (e1 > 0) then body else done
body:
      r3 <- r1 * b0
      e3 <- e1 - 1
      goto loop
done:
      return r1


The new form on the right is, of course, no longer in SSA form. Therefore one cannot apply any SSA-based optimization. Conversion out of SSA should therefore be one of the last steps before code emission. 

At this point register allocation, possibly with register coalescing, can do a good job of eliminating redundant moves.



**Chapter 10: Dynamics **

Denotational Semantics
  • Each part of a program is associated with a denotation
Axiomatic Semantics
  • Strongly related to program logic
  • Gives meaning to phrases using logical axioms
    Operational Semantics
  • Related to interpreters and abstract machines
  • Most popular and flexible form of semantics
Expressions
• Many different styles
- Natural semantics (or big-step semantics or evaluation dynamics)
- Structural operational semantics
- Substructural operational semantics
- Abstract machine (or small-step with continuation)

abstract machine:
Very general
Low-level and elaborate

e > K
valuate expression e and pass the result to the K

Binary operations
With effects

Constant and the empty continuation, stop

Boolean, short cutting

Variable: env that maps variables to values
Never changes when evaluating expressions




Executing Statements
Don't pass values to the continuation
Usually have effects on the env
Function Calls
What needs to happen at a function call?
• Evaluate the arguments in left-to-right order
• Save the environment of the caller to continue the execution after the function call
• Save the continuation of the caller
• Execute the body of the callee in a new environment that maps the
formal parameters to the argument values
• Pass the return value to the environment of the caller

Back to Blog Chapters

**Chapter 11: Mutable **

Pointers and Arrays
• Static semantics of pointers
Extend:
types with pointer types
expressions with alloc, deref, and null ptr
use an indefinite (polymorphic) type any* for synthesis for NULL
We can compare two pointers using p==q if the have the same type


• Dynamic semantics of pointers
value of a type ptr* is an address that stores a value of type
(or a special address 0)
Allocations return fresh (unused) addresses
Dereferencing retrieves a stored value
Need heap that maps addresses to values


• Static semantics of arrays
type[]
expression has: alloc_array(type, e) | e1[e2]
destination: d[e]

• Dynamic semantics of arrays
Array Evaluation: Access

H ; S ; η ⊢ e1[e2] ▷ K
   ⟶ H ; S ; η ⊢ e1 ▷ (_[e2], K)

H ; S ; η ⊢ a ▷ (_[e2], K)
   ⟶ H ; S ; η ⊢ e2 ▷ (a[_], K)
   // Need types.

H ; S ; η ⊢ i ▷ (a[_], K)
   ⟶ H ; S ; η ⊢ H(a + i*τ|τ) ▷ K
   // a ≠ 0, 0 ≤ i < length(a), a : τ[]
   // Need array sizes.

H ; S ; η ⊢ i ▷ (a[_], K)
   ⟶ exception(mem)
   // a = 0 or i < 0 or i ≥ length(a)


Default Values of Array Type
We also need a default value for array types
• We will just use 0 as the default value again
• It represents an array of length 0
• We can never legally access an array element in the default array
• Warning: Arrays can be compared with equality
• Make sure that alloc_array(t,0) returns a fresh address different from 0
• If arrays have address a=0 then you should not access M[a-8]

* We cannot translate d1[e2] += e3 to d1[e2] = d1[e2] + e3
Effects of e2 and d1 would be evaluated twice

Heap
H : (N U {next}) -> Val


Memory exceptions -> SIGUSR2
better for debugging

Array len
Alternative 1: Add length at the front, array address points to the start
Alternative 2: Array address points to the first element
Simplifies address arithmetic
Allows to pass pointers to C
Struct
Decl:
struct s;

Defn:
struct s {t1 f1; ... tn fn; };


Arrays are represented with pointers (but cannot be dereferenced)
-> they can be compared and stored in registers

Structs are usually also pointers but they can be dereferenced

Structs are large types that do not fit in registers
C0 restrictions
Local variables, function parameters, and return values must have
small type

Left and right-hand sides of assignments must have small type

Conditional expressions must have small type

Equality and disequality must compare expressions of small type

Expressions used as statements must have small type
Struct static semantics rules

Field names occupy their own namespace: allowed to overlap with
variable, function, or type names (but they must be distinct from
keywords)
• The same field names can be used in different struct definitions
• In a given struct definition, all field names must be distinct
• A struct may be defined at most once

Types struct s that have not (yet) been defined may be referenced as
long as their size is irrelevant

Size is relevant for
- alloc(struct s)
- alloc_array(struct s,e)
- definitions of structs if structs are types of fields

• Struct declarations are optional
‣ An occurrence of struct s in a context where its size is irrelevant
serves as an implicit declaration of the type struct s.

Types:
Extend types with struct types

Expr:
Field access

Lval

Define:
Elab

struct s* x = alloc(struct s)
e->f = (*e).f
must be defined before access

Struct sizes
Struct sizes determined by laying out the fields from left to right
Ints/bools aligned by 4
Pointers aligned by 8
Struct aligned by most restrictive fields


- Need to pick the right instructions (movl vs movq, cmpl vs cmpq)
- Could always use 8 bytes for spilling.
- Maintain size information in IRs!
- It is a good idea to keep temp/registers of different sizes separate
- If you want moves from small to large temps then make conversion
explicit
zeroextend s^32
signextend s^32

Back to Blog Chapters

**Chapter 12: Dataflow analysis **

  1. Liveness analysis -> interfere?
  2. Neededness analysis -> dead code?
  3. reaching definitions -> const/copy prop
Recap liveness
use(l,x) => instr at line l uses x 
def(l,x) => instr at line l defines x
succ(l,l') => line l' is a successor of l
live(l,x) => temp x is live at line l

use(l,x)
----------
live(l,x)

propagate backwards
only if not defined in l' 
use(l',u), succ(l,l'), -def(l', u)
----------------------------------
live(l,u)

Memory instr in IR
M[y] <- x => store x at addr y
x <- M[y] => load addr y'


we use the src and dest
l: M[y] <- x
-------------
use(l,x)
use(l,y)
succ(l, l+1)


we define dst as x
l: x <- M[y]
-------------
def(l,x)
use(l,y)
succ(l, l+1)
Dead code elimination
Dead code: operations that don’t influence the result of progr/function
Result of progr
- mem effects
- return val
- errors
- non-terminate ? 
Can exist in either source code
or translated ASM,... 
Usually we run it multiple times 




Example: Remove dead code in code Factorial using liveness info
l1: p <- 1
l2: p <- p * x
l3: z <- p + 1 ----------- dead
l4: x <- x-1
l5: if(x > 0) then l2 else l6
l6: return p

Liv anal


l1: p <- 1                          x
l2: p <- p * x                      p,x
l3: z <- p + 1                      p,x
l4: x <- x-1                        p,x   
l5: if(x > 0) then l2 else l6       p
l6: return p                        p      

z is not live at l4 --> can replace l3 with nop
- only if op at l3 has no effects
- like memory operations, division

but if we change that line:
because we propagate liveness of z, we can't use this rule to remove though 
z <- z + 1 is dead code
l1: p <- 1                          x,z
l2: p <- p * x                      p,x,z
l3: z <- z + 1                      p,x,z
l4: x <- x-1                        p,x,z
l5: if(x > 0) then l2 else l6       p,x,z
l6: return p                        p


Conclusion: Liveness does not help much in some cases for removing dead codes.
Neededness Analysis
----Goal: remove ops that----
(a) have no effect and 
(b) don't have a "needed" result

nec(l,x) => x is necessary at l 

l: x<- y effect ops z
-----------------------
nec(l,y)
nec(l,z)


l: return x
-----------------------
nec(l,x)


l: M[y] <- x
-----------------------
nec(l,y)
nec(l,z)


c0 no need to worry about this:
l: x <- M[y]
-----------------------
nec(l,y)


l: if (x?c) then l' else l"
---------------------------
nec(l,x)


~~~~~~~~~~ Like liveness ~~~~~~~~~~

nec(l,x)
----------
needed(l,x)


needed(l',x), -def(l,x), succ(l,l')
-------------------------------
needed(l,x)



A more complex rule: 
l: x <- y
.
.
l': return x 

needed(l',x), succ(l, l'), def(x, l)
--------------------------------------
needed(l,x)



We would not accidentally end non-termination:
For infinite loop:

f:
l1: goto f
l2: ret 0

Complexity O(#vars * #lines)
Look at this problematic version again using neededness:
z <- z + 1 is dead code

First pass, we start at l6, propagate
                                    Needed
l0L z <- 1                          
l1: p <- 1                          x,
l2: p <- p * x                      p,x (the added complex rule)
l3: z <- z + 1                      p
l4: x <- x-1                        p
l5: if(x > 0) then l2 else l6       p
l6: return p                        p


Second pass, we start at l5, propagate
                                    Needed
l0L z <- 1                          
l1: p <- 1                          x,
l2: p <- p * x                      p,x (the added complex rule)
l3: z <- z + 1                      p,x
l4: x <- x-1                        p,x
l5: if(x > 0) then l2 else l6       p,x
l6: return p                        p


?? TODO: Maybe merge in one pass? 

Example for optimization
s => elem size
a => addr of array
n => length of array

Step 1:
l1: i <- 0
l2: if ( 0 <0) then error else l3 -------> const propagation ----> const folding---> jump l3 ----> one more pass ---> (nop)
l3: if ( 0 >=n) then error else l4
l4: t <- 0 * s ----------------------> t <- 0 --------> then maybe t is not needed, so removed
l5: u <- a + t ----------------------> copy propagation, u <- a
l6: x <- M[u]

what if 
l7: i <- i+1
l7: if (i<n) then l2 else l9
l9: return x 

Then problem is this is not in SSA form, that i is defined twice
defs in l1 and l7 each l2
Cannot do constant propagation, so we are not sure i is 0

predicate we want to define: 

reach(l,x,l') 
complexity: O(#line * #line), much larger than O(#var * #line)







Back to Blog Chapters

**Chapter 13: Optimizations **

Reaching definitions
  • if we do not do SSA, then this helps us figure out what definitions reach which line
Rules of reach analysis
reaches(l,x,l') = def of temp x at line l reaches line l'


def(l,x)
succ(l,l')
----------
reaches(l'x)


reaches(l,x,l')
not def(l',x)
succ(l', l'')
----------------
reaches(l,x,l'')


l' could be like this: x <- x + y


Complexity O((#line) ^ 2)

Higher than O((#var) * (#line))
What is an optimization?
  • Compiler phase
  • optional
  • goal: make code more efficient
What to optimize
  • *Runtime
  • *Mewory use
  • code size
  • domain specific(data center network…)
  • energey use (battery cause trouble..use environment, then it becomes important)
How to pick optimizations?
  • Cannot implement all
  • Compare compiled code to other compilers
Register Allocation
  1. Construct inerference graph
  2. Find coloring (MCS + greedy coloring)
  3. Assign 13 colors to registers
  4. Assign remaining colors to stack locations
  5. Coalesce non-interfering moves
Coalescing
Remove move t <- s
- 1. if t and s do not interfere
      so we can use the same register
      self move t <- t has no effect
      this move can also be eliminated

Spend more 
but graph gets denser

Greedy Coalescing
1. Consider each move t <- s 
2. If there's an edge between in the interference graph then skip
3. Otherwise: if there's a color c not in col(N(s) U N(t)) that corresponds
to a reg, then we coalesce c and s into fresh temp w. 
      - add w to graph
      - create edges between w and N(s) U N(t)
      - remove s and t from the graph

Can do before stack alloc?
Spitting live ranges
 introduce move 
-> split a temp
-> graph get sparser

problem: no good and simple heuristics

done:             liveness
a <- x + 8128     x
b <- a + a        x,a
return x * b      x,b

Example:
                                    live vars
x <- y                              y,u,v
n <- u + v                          x,u,v
i <- n                              x,n
l1: if i <= 0 then l2 else done     x,i
l2: i <- i-1                        x,i
      x <- x * x                    x,i
      goto l1                       x,i

where to split and when to split?
Graph:
u, v, b, i, a , b

u(0)-|    4     |-i(1)
v(1)-|--- x ----|-a(2)
n(2)-|          |-b(3)


make x' to replace x
                                    live vars
x <- y                              y,u,v
n <- u + v                          x,u,v
i <- n                              x,n
x' <- x
l1: if i <= 0 then l2 else done     x',i
l2: i <- i-1                        x',i
      x' <- x' * x'                 x',i
      goto l1                       x',i


u(0)-|    3 0     |-i(1)
v(1)-|--- x x'----|-a(2)
n(2)-|            |-b(3)


which to split? may use a heuristic
where to split?
One round of reg allocation -> another round of reg alloca

Back to Blog Chapters

**Chapter 13: Peephole Optimizations **

local optimizations that modify a few lines
Constant folding
Example:
replace binop with a const

l: x <-  c1 ⊕ c2 ->
l:  x <- c if c = c1 ⊕ c2

l: x <-  c1 ⊕ c2 ->
l: raise(arith) if c1 ⊕ c2 is undefined

Gen notation
l: inst0 -> inst0'
...
l: instn -> instn'

in LLVM >= 1000 peephole optimizations -> source of bugs

Optimize conditionals
l: if c1 ? c2 then l1 else l2 -> {l: goto li if c1 ? c2} -> block could become dead code


Example:
l1: x <- y + c1
l1: z <- x + c2

-> 
l1: x <- y + c1  <--- dead code ?
l2: z <- y + c
where c =  c1 + c2


Strength reduction
Replace complex instruction with simpler one

Examples:
a + 0 = a 
a - 0 = a
a * 0 = 0
a * 1 = a
a * 2^n  = a << n
a * b + a * c  = a * (b + c)

Don't introduce bugs!
x + 1 > x 

NULL sequences
instructions that have no effect
l: x <- x -> l: nop


l: x <- y -> l1: x <- y (dead code)
l: y <- x -> l2: nop



Example: 
a <- a + x
a <- a + y

-----------> a,x,y are on the stack

r1 <- a
r1 <- x + r1
a <- r1 ----> dead code
r1 <- a -----> dead code
r1 <- y + r1
a< r1

l: goto l'
l' instr
=> 
l: nop
l' instr

Common subexpression elim
l: x <- s1 ⊕ s2
..
l': y <- s1 ⊕ s2


-> 
l: x <- s1 ⊕ s2
..
l': y <- x

problem: every code path that reaches l' needs to go through l

Example:
lab1: 
x <- a + b
if a < 0 then lab2
      else lab3

lab2:
y <- a + b
goto lab4

lab3:
z <- x + b
goto lab4

lab4:
n <- a + b ? can we replace a + b? appear three times
1. replace u <- a + b with u <- y
2. replace u <- a + b with u <- x


Back to Blog Chapters

**Chapter 15: Memory Optimizations **

Dominator Graph
l > l' = l dominates l' = every path to l' goes through l

(1) compute l > l' on the CFG
- Lengauer-Tarjau (Asympt Faster)
- Cooper et al (Simpler & Faster in practice) <- Idea: construct dominator tree

Dominator tree: 
- Nodes are lines
- Edge beween each line and its immidiate dominator

decide l > l' by
- start at l' in dom tree
- traverse tree until
      - we find l -> l > l'
      - we arrive at function entry -> l not larger l'

Example: 
loops

while(e,s)

l0: instr(e)
l0': if (res(e)==0) then l2 else l1
l1: insts(s)
l1': goto l0
l2: 

control flow graph
                                            ------------ l2
                                           |
k(before loop) -----> l0 - --e- - -> l0'  -| 
                        ^                   ------ l1 - s -> l1'
                        |                                      |
                         --------------------------------------

l0' has predecessor of K
l1,l2 has predecessor of l0'



High-level view:
algs solve data flow equation: 

Dom(l) = 
 {l0} if l = l0 (func entry)
 {l1} union U(all dominators of l') otherwise


Naive alg:
1. Dom(l0) = {l0}
   Dom(l!=l0) = L (all lines)
2. use 2nd part of eq until no dom


Construct dom tree during code gen
Example: conditional if(e,s1,s2)
l0: instr(e)
l'0: if(res(e)!=0) then l1 else l2
l1: instr(s1); goto l3
l2: instr(s2); goto l3
l3: 


l3 -> l0'
l1 -> l0'
l2 -> l0'

Memory Optimization
- can give significant speedups
- difficult to impl

Example: 
M[q] <- 4
M[p] <- 8128
x <- M[q] ---> X <- 4

p and q don't alias (q!=p ?)
Alias analysis!
Alias analysis: One approach..
Assume we already have 
may-alias (a,b) = temps a and b can contain the same address

l: t <- M[a]
.
.
.
l': t' <- M[a]


=>

l: t <- M[a]
.
.
.
l': t' <- t 


What are the conditions that need to hold for us to do this? 
- No other def of t reaches l' (SSA, good)
- a has the same value in l and l' (SSA, good)
- M[a] is still "available" <- not overwrtitten
- t is defiend at l'

copy propagation applies at basic blocks


if avail(l, l') and l > l'
Easier: def unavail(l,l')

unavail(l,l') = l: t <- M[a]
M[a] is potentially defined on a path from l to l' and l > l'(for efficiency only)



l: t<- M[a]
l > l'
l': M[b] <- S
may-alias(a,S)
succ(l',k)
-----------------------
unavailable(l,k)
l > k


unavail(l,k)
succ(k,k')
l > k'
--------------
unavail(l,k')


Function call can be similar
if not used pointer not use memory, easier



Loop optimizations
hoisting loop-invariant computation out of a loop, 
and optimizations based on induction variables.

loop(h, l) if line is in the loop with header h

Fusion 
If there are two adjacent loops that iterate over the same data then these
loops can be combined in some situations.

Interchange 
Sometimes it can improve locality to switch the position of an inner
and outer loop.


Unrolling 
Sometimes it is beneficial to place a copy of the body of the loop in a
new code block in front of the loop. This can lead to new opportunities for
optimization and in some situations we can avoid some executions for the
loop guard.

Hoisting invariant computation 
If the result of an operation in the loop body is
the same in every loop iteration then we can perform it before entering the
loop body.


Inversion 
Replacing a while loop with a do while loop can reduce the number of
jumps and improve the effectiveness of instruction pipelines.
Induction variable substitution If a variable changes by a constant between loop
iterations then it can enable optimizations to make this relationship explicit.

A (pure) expression is loop invariant if its value does not change throughout the
loop. We can then define the predicate inv(h, p), where p is a pure expression, as
follows:

c constant
----------------
inv(h,c)

def(l, x) ¬loop(h, l)
-----------------------
inv(h, x)


inv(h, s1) inv(h, s2)
-----------------------
inv(h, s1 ⊕ s2)

Since we are concerned only with programs in SSA form, it is easy to see that variables are loop invariant if they are not parameters of the header label.

l : t ← p inv(h, p) loop(h, l)
-------------------------------
inv(h, t)

In order to hoist loop invariant computations out of a loop we should have a
loop pre-header in the control-flow graph, which immediately dominates the loop
header. We then move all the loop invariant computations to the pre-header, in
order.


Hoisting Loop-Invariant Computation

Back to Blog Chapters

**Chapter 17: Garbage Collection **

Memory Management
so far: we only allocate
mem: HOW TO FREE? 
two options: manual or automatic

Autom mem mgmt:
advantages: 
- easier for the program 
- correct: "no use after free"
- prevents memory leak(some)
disadvantages:
- some mem leaks possible
- potentially less efficient
- more mem use
- time overhead

Two main approaches to autom mem mgmt: 
1. garbage collection
2. reference counting


Tracing GC
- Identify objects on the heap that are still needed
- start with "root pointers"
- root pointers: addresses on the stack(in local variable) and registers (and global vars)
- find all objs that are reachable from the roots

1:
Conservative collector
- all words on the stack are treated as pointers
Better 
- store type info for stack frames
* Need to maintain type info for allocated obj (to ident pointers)
Mark and Sweap
  • reserve a bit for GC in each object
  • during tracing: mark obj
  • sweap: reclaim mem that is not marked
  • “Stop the world” GC, undesirable in terms of performance
  • fragmentation
Copying GC
divide heap into two "semi spaces"


Two semi-spaces:
A | B
Mark phase at A, 
instead of sweap, copy the reachable objects from A to B, 
move reachable to the other space

Can be done incrementally:
Copy some parts of A to B

Back to Blog Chapters

**Chapter 18: More advanced features **

Function pointers

typedef int optype(int, int);

typedef int (*optype_pt)(int, int);


int f(int x, int y){
      return x + y;
}

int (*g)(int x, int y) = &f;

int main(){
      (*g)(1,2);
}


rewrite local f2 by linking... 
similar idea to xz Utils Backdoor, function hooking with broken dynamic linking semantics

Static resource analysis
Info about resources comsumption at dev time
- Upper and lower bounds, security..

Why not testing:
- doesn't provide guarantees, doesn't show the worst case like quick sort

Resource Usage in Safety-Critical Systems
Memory Usage -> Toyota cars
Timing -> ICE 3 Velaro D delivery delay 
Cloud computing, job scheduling if we know the resource usage

Software Security
- side channels: meltdown, spectre..
- Algorithmic complexity attacks: worse-case input for low bdw DOS atks

Analysis of Alg: 
Big-O
Abstracts from low level details 
Disconnected from actual code
Asymptotic bounds not useful


Automatic Amortized Resource Analysis
- Given program, worse-case resource consumption of P as a function of the 
size of the input?
- potential functions p(before) >= p(after) + cost --> p(init state) >= sum(cost)


Example:
append(x,y)
Heap-space usage is 2n if n is len of x

f(x,y,z)
let t = append(x,y) in append(t,z)

Heap usage of f(x,y,z) is 2n + 2(n+m)

Evaluator
= Emult 2, Einc 1, Econst 0
Bound: 2n where n is the number of nodes in e
Better boundL 2m + k (m: Emult nodes, l: Einc nodes)




Back to Blog Chapters