

## Simple Machine Model

- ♦ Instructions are executed in sequence.
  - Fetch, decode, execute, store results.
  - One instruction at a time.
- For branch instructions, start fetching from a different location if needed.
  - Check branch condition.
  - Next instruction may come from a new location given by the branch instruction.

Advanced Compiler Techniques 22/04/2005

| Model                                 | Simple Execution Model 5 Stage pipe-line:    |       |        |         |        |            |                                       |  |  |
|---------------------------------------|----------------------------------------------|-------|--------|---------|--------|------------|---------------------------------------|--|--|
|                                       | Cycle:                                       | 1     | 2      | 3       | 4      | 5          |                                       |  |  |
| <b>fachine</b>                        |                                              | fetch | decode | execute | memory | write back |                                       |  |  |
| Instruction Scheduling: Machine Model | Fetch: get the next instruction.             |       |        |         |        |            |                                       |  |  |
|                                       | Decode: figure out what that instruction is. |       |        |         |        |            |                                       |  |  |
|                                       | Execute: perform ALU operation.              |       |        |         |        |            |                                       |  |  |
| struct                                | address calculation in a memory op           |       |        |         |        |            |                                       |  |  |
| Ins                                   | Memory: do the memory access in a mem. op.   |       |        |         |        |            |                                       |  |  |
|                                       | Write Back: write the results back.          |       |        |         |        |            |                                       |  |  |
|                                       |                                              |       |        |         |        |            |                                       |  |  |
|                                       |                                              |       |        |         |        | Ad         | vanced Compiler Techniques 22/04/2005 |  |  |



























```
Example

r2 = *(r1 + 4)

r3 = *(r1 + 8)

noop

r4 = r2 + r3

r5 = r2 - 1

goto L1

noop
```



# Example r2 = \*(r1 + 4)r3 = \*(r1 + 8)r5 = r2 - 1goto L1 r4 = r2 + r3

# Example

r2 = \*(r1 + 4)r3 = \*(r1 + 8)r5 = r2 - 1goto L1 r4 = r2 + r3

Final code after delay slot filling

# From a Simple Machine Model to a Real Machine Model

- Many pipeline stages.
  - ♦ MIPS R4000 has 8 stages.
- Different instructions take different amount of time to execute.
  - mult 10 cycles
  - ♦ div 69 cycles
  - ♦ ddiv 133 cycles
- ♦ Hardware to stall the pipeline if an instruction uses a result that is not ready.

#### Real Machine Model cont.

- Most modern processors have multiple execution units (superscalar).
  - If the instruction sequence is correct, multiple operations will take place in the same cycles.
  - Even more important to have the right instruction sequence.

## **Instruction Scheduling**

Goal: Reorder instructions so that pipeline stalls are minimized.

#### Constraints on Instruction Scheduling:

- Data dependencies.
- Control dependencies .
- Resource constraints.

### Data Dependencies

- If two instructions access the same variable, they can be dependent.
- Kinds of dependencies:
  - ◆ True: write → read. (Read After Write, RAW)
     ◆ Anti: read → write. (Write After Read, WAR)

  - ◆ Anti (Output): write → write. (Write After Write, WAW)
- What to do if two instructions are dependent?
  - The order of execution cannot be reversed.
  - Reduce the possibilities for scheduling.

### Computing Data Dependencies

- For basic blocks, compute dependencies by walking through the instructions.
- Identifying register dependencies is simple.
  - is it the same register?
- For memory accesses.
  - simple: base + offset1 ?= base + offset2
  - ♦ data dependence analysis: a[2i] ?= a[2i+1]
  - interprocedural analysis: global ?= parameter
  - pointer alias analysis: p1 ?= p

## Representing Dependencies

- Using a dependence DAG, one per basic block.
- Nodes are instructions, edges represent dependencies.



Edge is labeled with latency:

 $v(i \rightarrow j)$  = delay required between initiation times of i and j minus the execution time required by i.

## Example

1: 
$$r2 = *(r1 + 4)$$

2: 
$$r3 = *(r2 + 4)$$

$$3: r4 = r2 + r3$$

4: 
$$r5 = r2 - 1$$



## Another Example

1: 
$$r2 = *(r1 + 4)$$

2: 
$$*(r1 + 4) = r3$$

$$3: r3 = r2 + r3$$



### Control Dependencies and **Resource Constraints**

- ♦ For now, let's only worry about basic
- ♦ For now, let's look at simple pipelines.

## Example

- ♦ Assume:
  - Memory cached, available in 1 cycle.
  - ♦ Mul 3 cycles
  - ♦ Div 4 cycles
  - ♦ Other 1 cycle

#### Example Results available in 1: LA r1, array 1 cycle 2: LD r2,4(r1) 1 cycle 3: AND r3,r3,0x00FF 1 cycle 4: MULC r6,r6,100 3 cycles r7,4(r6) 5: ST 6: DIVC r5, r5, 100 4 cycles 7: ADD 1 cycle r4,r2,r5 8: MUL r5, r2, r4 3 cycles 9: ST r4.0(r1) 14 cycles! 1 2 3 4 st st 5 6 st st st 7

## List Scheduling Algorithm

- ♦ Idea:
  - ◆ Do a topological sort of the dependence DAG.
  - Consider when an instruction can be scheduled without causing a stall.
  - Schedule the instruction if it causes no stall and all its predecessors are already scheduled.
- Optimal list scheduling is NP-complete.
  - Use heuristics when necessary.

Advanced Compiler Techniques 22/04/2005

### List Scheduling Algorithm

- ♦ Create a dependence DAG of a basic block.
- ◆ Topological Sort.

**READY** = nodes with no predecessors.

Loop until **READY** is empty.

Schedule each node in READY when no stalling READY += nodes whose predecessors have all been scheduled.

Advanced Compiler Techniques 22/04/201 tp://lamp.epfl.ch/teaching/advancedCompiler

#### Heuristics for selection

Heuristics for selecting from the READY list (the priority of the node):

- 1. pick the node with the longest path to a leaf in the dependence graph.
- 2. pick a node with the most immediate successors.
- 3. pick a node that can go to a less busy pipeline (in a superscalar implementation).

Advanced Compiler Techniques 22/04/2005 tp://lamp.epfl.ch/teaching/advancedCompiler

#### Heuristics for selection

Pick the node with the longest path to a leaf in the dependence graph

Algorithm (for node x)

- If x has no successors  $d_x = 0$
- $\bullet \ \mathbf{d}_{\mathbf{x}} = \mathbf{MAX}_{\forall \mathbf{v} \in succ(\mathbf{x})} (\ \mathbf{d}_{\mathbf{y}} + \mathbf{v}(\mathbf{x} \rightarrow \mathbf{y})).$

Use reverse breadth-first visiting order

Advanced Compiler Techniques 22/04/2005

#### Heuristics for selection

Pick a node with the most immediate successors.

Algorithm (for node x):

 $\bullet$  f<sub>x</sub> = number of successors of x

Advanced Compiler Techniques 22/04/2005 //lamp.epfl.ch/teaching/advancedCompiler/

# Heuristics for selection from the READY list The priority of the node: 1. pick the node with the longest path to a leaf in the dependence graph: Largest d<sub>x</sub>. 2. pick a node with the most immediate successors: Largest f<sub>x</sub>.

```
Example
                          Results available in
1: LA
          r1,array
                               1 cycle
                               1 cycle
2: LD
          r2,4(r1)
          r3,r3,0x00FF
                               1 cycle
3: AND
  MULC
          r6,r6,100
                               3 cycles
5: ST
          r7,4(r6)
6: DIVC
          r5, r5, 100
                               4 cycles
7: ADD
          r4,r2,r5
                               1 cycle
8: MUL
          r5, r2, r4
                               3 cycles
          r4,0(r1)
9: ST
```













































| le                                                |             |   |     |   |   | Ex | an | npl | e |                    |                                                                            |
|---------------------------------------------------|-------------|---|-----|---|---|----|----|-----|---|--------------------|----------------------------------------------------------------------------|
| :-Examp                                           | READY = { } |   |     |   |   | 1  |    | 3   | 4 |                    |                                                                            |
| ist scheduling                                    |             |   | = { | } |   |    | 2  |     |   | 6                  | 5                                                                          |
| eduling: L                                        |             |   |     |   | 7 |    |    |     |   |                    |                                                                            |
| Instruction Scheduling: List scheduling - Example |             |   |     |   |   |    |    | 8   |   | 9                  |                                                                            |
|                                                   | 6           | 1 | 2   | 4 | 7 | 3  | 5  | 8   | 9 |                    |                                                                            |
|                                                   |             |   |     |   |   |    |    |     |   | Ad<br>http://lamp. | vanced Compiler Techniques 22/04/2000<br>epfl.ch/teaching/advancedCompiler |



#### **Resource Constraints**

- Modern machines have many resource constraints.
- ♦ Superscalar architectures:
  - can run few parallel operations.
  - but have constraints.

Advanced Compiler Techniques 22/04/200 ttp://lamp.epfl.ch/teaching/advancedCompiler

# Resource Constraints of a Superscalar Processor

#### Example:

- ◆1 integer operation, e.g., ALUop dest, src1, src2# in 1 clock cycle In parallel with
- ◆ 1 memory operation, e.g., LD dst, addr # in 2 clock cycles ST src, addr # in 1 clock cycle

Advanced Compiler Techniques 22/04/20 http://lamp.epfl.ch/teaching/advancedCompile

# List Scheduling Algorithm with Resource Constraints

- Represent the superscalar architecture as multiple pipelines.
  - Each pipeline represents some resource.

Advanced Compiler Techniques 22/04/200 c://lamp.epfl.ch/teaching/advancedCompiler

# List Scheduling Algorithm with Resource Constraints

- Represent the superscalar architecture as multiple pipelines
  - Each pipeline represents some resource
- ♦ Example:
  - One single cycle ALU unit.
  - One two-cycle pipelined memory unit.

ALUop
MEM 1
MEM 2

Advanced Compiler Techniques 22/04/2005 //lamp.epfl.ch/teaching/advancedCompiler/

















































# Find the most common trace of basic blocks. Use profile information. Combine the basic blocks in the trace and schedule them as one block. Create compensating (clean-up) code if the execution goes off-trace.

















## Scheduling for Loops

- ♦ Loop bodies are typically small.
- ◆ But a lot of time is spend in loops due to their iterative nature.
- Need better ways to schedule loops.

Advanced Compiler Techniques 22/04/20 http://lamp.epfl.ch/teaching/advancedCompile

# Loop Example

#### Machine:

- ♦ One load/store unit
  - ♦load 2 cycles
  - ♦store 2 cycles
- ◆ Two arithmetic units
  - ♦add 2 cycles
  - ♦ branch 2 cycles (no delay slot)
  - multiply 3 cycles
- Both units are pipelined (initiate one op each cycle)

Advanced Compiler Techniques 22/04/2005

# 











# Advanced Computer Trichiague 27/14/2085 Loop Unrolling Rename registers. Use different registers in different iterations. Eliminate unnecessary dependencies. again, use more registers to eliminate true, anti and output dependencies. eliminate dependent-chains of calculations when possible.

```
| Loop Example | loop: | ld r6, (r1) | mul r6, r6, r3 | st r6, (r1) | add r2, r1, 4 | ld r7, (r2) | mul r7, r7, r3 | st r7, (r2) | add r1, r2, 4 | ble r1, r5, loop | ld r6, (r1) | mul r6, r6, r3 | st r6, (r1) | add r2, r1, 4 | ld r7, (r2) | mul r7, r7, r3 | st r7, (r2) | add r1, r2, 4 | ble r1, r5, loop | ld r6, (r1) | mul r6, r6, r3 | st r6, (r1) | add r2, r1, 4 | ld r7, (r2) | mul r7, r7, r3 | st r7, (r2) | add r1, r2, 4 | ble r1, r5, loop
```



# Software Pipelining Try to overlap multiple iterations so that the slots will be filled. Find the steady-state window so that: all the instructions of the loop body is executed. but from different iterations.







# Software Pipelining

- Optimal use of resources.
- Need a lot of registers.
  - Values in multiple iterations need to be kept.
- Issues in dependencies.
  - Executing a store instruction in an iteration before branch instruction is executed for a previous iteration (writing when it should not have).
  - Loads and stores are issued out-of-order (need to figure-out dependencies before doing this).
- Code generation issues.
  - Generate pre-amble and post-amble code.
  - Multiple blocks so no register copy is needed.

Advanced Compiler Techniques 22/04/20 http://lamp.epfl.ch/teaching/advancedCompile