Problem C
Parallel Analysis
                                                                                    
  Today’s computer architectures are moving away from single processing cores toward multiple cores. Due to this shift, many computer programmers are honing their parallel programming skills to take advantage of new hardware capabilities.
Writing efficient parallel programs can be difficult, but software tools can make the job easier for programmers. You have decided to write a program to help with the analysis of existing serial programs. Your program will take as input a trace of the executed instructions in a serial program and identify how fast it might run on a parallel platform with many cores.
Your program will measure the time to execute the trace on a shared-memory parallel architecture. Assume that every instruction executes in one clock tick, and that all cores share a single clock. The parallel execution time is measured from the beginning of execution until all instructions have finished. Thus, if a serial trace has $n$ executed instructions, it will take $n$ clock ticks to execute on a single core. With $p$ cores, execution time could be as short as $\lceil n/p\rceil $, or as long as $n$ clock ticks. The actual time will depend on how much parallelism is available in the trace – that is, how many instructions can be executed simultaneously.
Two instructions can be executed simultaneously if they do not have a data dependency. Suppose instruction $B$ occurs after instruction $A$ in the serial trace. Then $B$ has a data dependency on $A$ if $B$ reads from a memory address that $A$ writes to, and no intervening instruction writes to that memory address. This type of dependency is called read-after-write. If $B$ has such a data dependency on $A$, instruction $A$ must complete execution before $B$ begins execution. If two instructions do not share a dependency, they may be executed in any order. For now, we will ignore other types of dependencies like write-after-write and write-after-read, since they can be handled by memory renaming techniques.
Execution traces are described in a simplified format that includes just the memory references of each instruction. This format ignores the semantics of the instructions executed (e.g. add, subtract) and the data values. Each instruction reads zero or more values from memory, and produces exactly 1 value and writes it to memory.
Here is an example trace. The middle column, which would be the input to your program, shows the number of memory references followed by the references for each instruction. The read addresses are first, followed by the write address. The right column gives pseudocode to help explain what these references mean.
instruction memory references pseudocode ------------------------------------------------------- A 2 1 0 max <- x B 3 1 2 3 if (y > x) C 3 3 2 0 max <- y D 3 1 4 4 a <- a + x E 3 2 4 4 a <- a + y F 3 0 4 4 a <- a + max
In the trace given above, instruction B is not dependent on instruction A, so they can be executed in parallel (or out of order). Thus, if two cores were available, in one clock tick they could complete the first two instructions. Instructions D, E, and F must execute sequentially (due to dependencies involving memory address 4), but instruction D may start at any time. However, instruction F may not execute until instruction C has completed. Further investigation would yield other insights about data dependencies in this trace.
Input
Input contains multiple traces. The first line of input contains an integer $1 \le t \le 10$ indicating the number of traces that follow. Each trace starts with an integer $1 \leq n\leq 40\, 000$ indicating the length of the serial instruction trace. This will be followed by $n$ lines, each representing a single instruction in the serial execution order. Each line will start with an integer $1 \leq m \leq 20$, indicating the number of memory references for that instruction, followed by $m$ space-separated memory references for that instruction. The last reference for an instruction is always its write address. Every memory reference will be a decimal integer in the range $[0,2^{31}-1]$.
Output
For each trace, print out the trace number and the minimum parallel run time for that trace running on $p=n$ cores. Follow the format shown in the sample output.
| Sample Input 1 | Sample Output 1 | 
|---|---|
| 4 6 2 1 0 3 1 2 3 3 3 2 0 2 1 4 2 2 4 2 0 4 7 2 0 1 2 1 2 2 2 3 2 3 4 2 4 5 2 5 6 2 6 7 8 4 0 1 2 3 3 4 5 6 2 7 8 1 9 3 10 11 12 1 13 3 14 15 16 4 17 18 19 20 6 1 0 1 1 1 2 2 0 3 2 0 4 4 0 1 2 5 | 1 3 2 7 3 1 4 2 | 
