#### CDA 4150 Lecture 4 Vector Processing CRAY like machines #### Amdahl's Law $T_S$ = Time Spent in Sequential Processing $T_P$ = Time Spent in Parallel Processing $$S_P =$$ Speedup P = Number of Processors ### Amdahl's Law (cont.) $$S_{p} = \frac{T_{s}}{T_{p}}$$ $$S_{p} = \frac{P}{P\alpha + (1-\alpha)}$$ $$S_{p} = \frac{1}{\frac{1}{P} + (1-\frac{1}{P})\alpha}$$ $$S_{p} = \frac{T_{s}}{\alpha T_{s} + \frac{(1-\alpha)T_{s}}{P}}$$ $$\lim_{P \to \infty} S_{p} = \lim_{P \to \infty} S_{p} = \lim_{P \to \infty} S_{p} = \frac{1}{\alpha}$$ $$\lim_{P \to \infty} S_{p} = \frac{1}{\alpha}$$ $$S_{p} = \frac{P}{P\alpha + (1 - \alpha)}$$ $$S_{p} = \frac{1}{\frac{1}{P} + (1 - \frac{1}{P})\alpha}$$ $$\lim_{P \to \infty} S_{p} = \lim_{P \to \infty} \frac{1}{\frac{1}{P} + (1 - \frac{1}{P})\alpha}$$ $$\lim_{P \to \infty} S_{p} = \frac{1}{\alpha}$$ # Amdahl's Law (revisited) $$Speedup = \frac{1}{\frac{1}{p} + \left(1 - \frac{1}{p}\right)\alpha} \Rightarrow \lim_{p \to \infty} Sp = \frac{1}{\alpha}$$ • Using $\alpha$ as a function of n, where $\alpha(n) = \frac{1}{n}$ , then Speedup = $$\frac{p}{1 + (p-1)\alpha(n)} = \lim_{n \to \infty} \frac{p}{1 + (p-1)\frac{1}{n}} = p$$ # An extension of Amdahl's Law in terms of a matrix multiplication equation (AX = Y). $$\begin{bmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ y_4 \end{bmatrix}$$ $$y_1 = a_{11}x_1 + a_{12}x_2 + a_{13}x_3 + a_{14}x_4$$ $$y_2 = a_{21}x_1 + a_{22}x_2 + a_{23}x_3 + a_{24}x_4$$ $$y_3 = a_{31}x_1 + a_{32}x_2 + a_{33}x_3 + a_{34}x_4$$ $$y_4 = a_{41}x_1 + a_{42}x_2 + a_{43}x_3 + a_{44}x_4$$ Compute each vector element in parallel by partitioning. | CPU | CPU | CPU | CPU | |------------|------------|------------|------------| | | | | | | | | | | | <b>A</b> 1 | <b>A</b> 2 | <b>A</b> 3 | <b>A</b> 4 | | X | X | X | X | | ^ | | | <b>A</b> | Introduces CRAY-1 as a vector processing Architecture #### CRAY-1 #### **Functional Units** | Instruction | Operation | Function | |----------------------------|----------------------------------------------------|---------------------------------------------------------------------| | ■ADDV | ■V1, V2, V3 | <b>■</b> V <sub>1</sub> ← V <sub>2</sub> + V <sub>3</sub> | | ■ADDSV (add scalar vector) | ■V <sub>1</sub> , F <sub>0</sub> *, V <sub>2</sub> | ■V1← V2 + F0 | | ■MULTV | ■V1, V2, V3 | <b>■</b> V <sub>1</sub> ← V <sub>2</sub> + V <sub>3</sub> | | ■LV (load vector) | ■V1, R1 | Load V1 with memory<br>address location starting<br>at address [R1] | | ■SV (store vector) | ■R1, V1 | ■Store V1 into memory starting at location [R1] | | | | | <sup>\*</sup> F<sub>0</sub> – a floating point number NOTE: Each vector register (Rn) holds floating point numbers. ## **Timing** A pipeline machine can initiate several instructions within 1 clock tick, which are then being executed in parallel. - Related Concepts: - Convoys - Chimes ### Convoy The set of vector instructions that could potentially begin execution together in one clock period. #### Example: ## Convoy Note: MULTSV V2, F0, V1 | LV V3, RY is an example of a convoy, where 2 independent instructions are initiated within same chime. | LV | V1, Rx | Load vector X | |--------|------------|----------------------------------| | MULTSV | V2, F0, V1 | <br>Vector scalar multiplication | | LV | V3, RY | Load vector X | | ADDV | V4, V2, V3 | <br>Add | | SV | RY, V4 | <br>Storing results | | | | | #### Chime - Not a specific amount of time, but rather a timing concept representing the number of clock periods required to complete a vector operation. - CRAY-1 chime is 64 clock periods. - Note: CRAY-1 clock cycle takes 12.5 ns. - 5 chimes would take : 5 \* 64 \* 12.5 = 4000 ns ### Chime – Example #1 How many chimes will the vector sequence take? #### Chime - Example #1 ANSWER: 4 chimes ``` 1st chime: LV V1, Rx ``` 2<sup>nd</sup> chime: MULTSV V2, F0, V1 || LV V3, RY 3<sup>rd</sup> chime: ADDV V4, V2, V3 4th chime: SV Ry, V4 Note: MULTSV V2, F0, V1 || LV V3, RY is an example of a convoy, where 2 independent instructions are initiated within same chime. #### Chime - Example #2 #### CRAY-1 For I $$\leftarrow$$ 1 to 64 A[I] = 3.0 \* A[I] + (2.0 + B[I]) \* C[I] #### To execute this: 1st chime : $V_0 \leftarrow A$ 2nd chime : $V_1 \leftarrow B$ $V_3 \leftarrow 2.0 + V_1$ $V_4 \leftarrow 3.0 * V_0$ 3rd chime : $V_5 \leftarrow C$ $V_6 \leftarrow V_3 * V_5$ $V_7 \leftarrow V_4 + V_6$ 4<sup>th</sup> chime: A ∨<sub>7</sub> Can initiate operations to use array values immediately after they have been loaded into vector registers. ## Chaining Building dynamically a larger pipeline by increasing number of stages. ### Chaining – Example #1 For J ← 1 to 64 C[J] ← A[J] + B[J] D[J] ← F[J] \* E[J] END \* No chaining - these are independent!! ### Chaining – Example #2 #### Latency # More Chaining and Storing Matrices Thanks to Dusty Price 64 Elements in sequence: $T_s = 64 * (8 + 9) = 1088$ #### Using Pipeline Approach... Using pipelining it takes 8 units of time to fill pipeline and produce first result, each unit of time after that produces another result $T_{p+} = 8 + 63$ The multiplication pipeline takes 9 units of time to fill, and produces another result after each additional unit of time $$T_{p*} = 9 + 63$$ The combination of the two $T_p = T_{p+} + T_{p*} = 8 + 63 + 9 + 63 = 143$ Operation using Chaining $$T_c = 17 + 63 = 80$$ Review of time differences in the three approaches... Sequential: $$T_s = 17 * 64 = \boxed{1088}$$ Pipelining: $$T_p = 8 + 63 + 9 + 63 = \boxed{143}$$ Chaining: $$T_c = 17 + 63 = 80$$ #### Storing Matrixes for Parallel Access (Memory Interleaving) #### Matrix 4 Memory Modules $$M_1$$ $M_2$ $M_3$ $M_4$ $$A_{11}$$ $A_{21}$ $A_{31}$ $A_{41}$ One column of the matrix can be accessed in parallel. $$A_{12}$$ $A_{22}$ $A_{32}$ $A_{42}$ $$A_{13}$$ $A_{23}$ $A_{33}$ $A_{43}$ $$A_{14}$$ $A_{24}$ $A_{34}$ $A_{44}$ #### Storing the Matrix by Column... #### Matrix | A <sub>11</sub> | A <sub>12</sub> | A <sub>13</sub> | A <sub>14</sub> | |-----------------|-----------------|-----------------|-----------------| | A <sub>21</sub> | A <sub>22</sub> | A <sub>23</sub> | A <sub>24</sub> | | A <sub>31</sub> | A <sub>32</sub> | A <sub>33</sub> | A <sub>34</sub> | | A <sub>41</sub> | A <sub>42</sub> | A <sub>43</sub> | | 4 Memory Modules $$M_1$$ $M_2$ $M_3$ $M_4$ $$A_{11}$$ $A_{12}$ $A_{13}$ $A_{14}$ $$A_{21}$$ $A_{22}$ $A_{23}$ $A_{24}$ $$A_{31}$$ $A_{32}$ $A_{33}$ $A_{34}$ $$A_{41} A_{42} A_{43} A_{43}$$ One Row can be accessed in parallel with this storage technique. Sometimes we need to access both rows and columns fast... #### Matrix | A <sub>11</sub> | A <sub>12</sub> | A <sub>13</sub> | A <sub>14</sub> | |-----------------|-----------------|-----------------|-----------------| | A <sub>21</sub> | A <sub>22</sub> | A <sub>23</sub> | A <sub>24</sub> | | A <sub>31</sub> | A <sub>32</sub> | A <sub>33</sub> | A <sub>34</sub> | | A <sub>41</sub> | | A <sub>43</sub> | | #### 4 Memory Modules By using a skewed matrix representation, we can now access each row and each column in parallel. Sometimes we need access to the main diagonal as well as rows and columns... At the cost of adding another memory module and wasted space, we can now access the matrix in parallel by row, column, and main diagonal. #### **Program Transformation** ## Scalar Expansion ### **Loop Unrolling** ``` FOR I \leftarrow 1 TO n do X[I] \leftarrow A[I] * B[I] X[I] \leftarrow A[I] * B[1] X[I] \leftarrow A[I] * B[1] X[I] \leftarrow A[I] * B[1] X[I] \leftarrow A[I] * B[I] X[I] \leftarrow A[I] * B[I] X[I] \leftarrow A[I] * B[I] ``` ## Loop Fusion or Jamming ``` FOR I \leftarrow 1 TO n do X[I] \leftarrow Y[I] * Z[I] ENDFOR FOR I \leftarrow 1 TO n do M[I] \leftarrow P[I] + X[I] ENDFOR ``` ``` a) FOR I ← 1 TO n do X[I] ← Y[I] * Z[I] M[I] ← P[I] + X[I] ENDFOR b) FOR I ← 1 TO n do M[I] ← P[I] + Y[I] * Z[I] ENDFOR ```