Concepts of Parallel and Distributed Processing 
Fall 2002
U.C.F.

This page is under construction and more-or-less out of date. I did add the beowulf batch mode files to the ray tracer section, though!

Erik Reinhard
Computer Science Department
University of Central Florida


mailto:reinhard@cs.ucf.edu
Thanks to Prof. Charles Hughes for creating these course notes. I have made only small changes (dates, assignments).

Erik.


Structure: TR 14:30-15:45, CSB 221; 28 class periods, each 75 minutes long.
Go To Week 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

Instructor: Erik Reinhard; CSB 250; 823-4289; reinhard@cs.ucf.edu
Office Hours:

Text: Gregory R. Andrews, Multithreaded, Parallel and Distributed Programming, Addison-Wesley, 2000.
Supplement: Kumar et al, Introduction to Parallel Computing, Benjamin-Cummings, 1994.

Prerequisites: COP3530 (CS3), COT3100 (Discrete Structures), CDA3103 (Computer Organization), COP3402 (Computer System Concepts).

Implementation Environments: You will be regularly using Java. You do not need to be a Java expert, but you must be ready to learn the language and the environment quickly. Feel free to ask others for guidance to quickly solve trivial problems with Java syntax or with using Borland JBuilder and the JDK. Later in the term you may be using a Beowulf system. This will use the Linux operating system and C programming language. You do not need to be UNIX internals experts, but you do need to develop a reasonable level of comfort. Again, feel free to ask others for guidance to quickly solve trivial problems with C syntax or with using UNIX commands.  If you have serious problems, come see me.

Assignments: 4 to 6 small to moderate programming assignments (some are multi-part) using a variety of parallel and distributed programming paradigms. Around 4 non-programming assignments. One large project using a combination of distributed paradigms.

Exams: One quiz, a midterm and a final.

Material: From Andrews: Chapters 1 through 10, plus parts of 11 and 12. Class notes. Java Overheads. Web tutorials on Java: (http://www.javasoft.com/docs/books/tutorial/index.html).

Important Dates (Quiz Dates are Subject to Change): Quiz#1-- September 24; MidTerm -- October 3; Withdraw Deadline -- October 11; Final -- at end of course.

Evaluation (Tentative):
One Quiz -- 60 points
Mid Term -- 90 points
Final Exam -- 150 Points
Assignments -- 300 Points
Total Available: 600
Grading will be  A-90%+, B-80%+, C-70%+, D-50%+, F-below 50%

Resources:
Free Software IDEs: JBuilder 7
Java Tutorial
Java at Sun
Java Madness (help with JBuilder)
Information about MPI - Message Passing Interface
MPI Resources
A book on MPI


Ray tracer

Sequential ray tracer for programming assignment

After unzipping and untarring this file, type 'make' to compile and 'run' to render an example image. There is some documentation in the files 'run' and 'balls.txt'. The file 'datastructure.h' contains most of the important data structures and is commented to some extent. The function 'main' is located in ray.c, and the loop that loops over all pixels is located in 'trace.c'.

New: Using the beowulf cluster

Thanks to Matt O'Connor, here's a few files that may help you run your program on the beowulf cluster. You will have to adapt these scripts to your own situation.

The Batch script will call your run script in batch mode. The first two lines redirect standard error and standard output. It currently calls the run script with a request for 27 processors.

The Run script takes the number of processors on the command line, and calls mpirun. It uses the -machinefile flag to specify which nodes may be used. A useful machine file can be found here (the nodes that are currently not available are commented out in this file).

To run this ensemble of files in batch mode, you type 'qsub batch1'. To view the status of your batch jobs, you may type 'qstat'.

He has also figured out a way to kill hanging processes. Hope this helps!

Beowulf programming assignment 1.

Familiarize yourself with the code and rendering images. Then, using MPI, create two processes. The first process renders an image. The second process displays the images on your display. The process that displays the image will be the master. It will first tell the other process that it should start tracing an image. When the image is completed, the master process will receive all pixels from the rendering process and will display the resulting image on screen. Later, we will parallelize the render process.

Beowulf assignments 2 - 4.

These were discussed in class. If you can't remember what they were, please come and see me.


Week#1: (8/20, 8/22) -- Chapters 1 and 2 from Andrews
  1. Concurrent Programming Concepts
  2. Introduction, even-odd transposition algorithm, analysis
  3. Concepts of analysis of parallel algorithms
  4. SIMD solution to even/odd transposition sort


    Assignment #1:

      Problem 1.5 from text.
    Due: Tuesday, Week#3 (September 3)

Week#2: (8/27, 8/29) -- Chapters 1 and 2 from Andrews
  1. Taxonomies (control, address space, interconnection network, granularity)
  2. Taxonomies (control)
  3. Taxonomies (address space)
  4. Taxonomies (interconnection network)

Week#3: (9/3, 9/5)
  1. More on static networks
  2. Routing in static networks
  3. Reduction (all to one), broadcast (one to all), all to all on hypercube
  4. Granularity (BSP model)
  5. PRAM Model
  6. Programming Styles
  7. Example of Max in book's notation
  8. State, history, properties
  9. Notation for concurrency
  10. Programming Assignment #1:
      Using Quicksort as a model, develop a multi-threaded Java program that solves the sorting problem.
      Turn in a zip file with your project (be sure that the zip keeps all directory structures) and a write-up. You must give reasons for your choice of programming style (iterative or recursive parallelism, producer/consumer, client/server or communicating peers) and you must analyze your algorithm. Your analysis must include time, cost, work, cost efficiency and work efficiency.
    Due: September 24

Week#4: (9/10, 9/12)
  1. Critical reference
  2. Locks and Barriers
  3. Axiomatic Semantics (very brief)
  4. Fairness
  5. SpinLocks
  6. TicTacToe: An Example of Java Concurrency
  7. Java Support for Concurrency
  8. Locks and Our First Two Java Programs
  9. Fair Solutions
  10. Barrier Synchronization


    Assignment #2:

      Problem 2.16 from the book.
    Due: September 19

    Programming Assignment #2:

      Problem 2.35(b) from Andrew's book, in multi-threaded java. Please hand in a zip file and a write-up, like in the previous programming assignment. What is the time, cost, work, cost efficiency and work efficiency of your solution? Is your program safe?
    Due: October 8

Week#5: (9/17, 9/19)
  1. Data Parallel
  2. Semaphores
  3. Race conditions in EOSort
  4. Counter Example
  5. Hangman's Noose
  6. Multiplayer Hangman


    Assignment #3:
            Problems 3.2 and 3.4 from Andrews text
    Due: September 26


Week#6: (9/24, 9/26)
Quiz#1 on 9/24
  1. Skipping 4.5 and 4.6 (done in OS course)
  2. Single lane bridge problem
  3. Monitor overview
  4. Single lane bridge problem using monitors (Chapter 5)
  5. Paths -- declarative approach to concurrency control
  6. Bitonic Sort
Assignment #4:
        Look at slides 97 and 98 that deal with monitor solutions for the one-way bridge problem. Analyse these with respect to non-exclusion and deadlock.
Due: October 1

Week#7: (10/1, 10/3)
  1. Bitonic Sort
  2. Monitors from chapter 5 in text
  3. flip atomic action with n processes
  4. Discussion of Quiz#1
  5. Topics and Promises for MidTerm
    MidTerm on October 3

Week#8: (10/8, 10/10)
  1. Distributed Computing Paradigms
  2. Concurrent Objects
  3. Remote Method Invocation (Related to RPC from Chapters 8 and 10 in Text)
  4. Tuple Space


    Note: Last Day to Withdraw is October 11.
     


Week#9: (10/15, 10/17)
  1. Java's Solution:
  2. RMI Examples
  3. Tuple Space Implementations
  4. Sun JavaSpace
  5. IBM TSpaces
  6. Bid.com in TSpaces

    Week#10: (10/22, 10/24)
    1. More analysis of sort algorithms + Brent's scheduling
      • Specifically look at max and sum algorithms and try to determine when we are using an appropriate number of processors.
      • Measuring the number of processors that can be used and still have O(1) efficiency. For Max or Sum algorithm, Tp = O(N/p) + O(lg p).
      • So E = O(N)/(O(N + p lg p) = 1/(1+p lg p / N).
      • This is O(1) if p = N/lg n (Brent's scheduling)
    2. Accelerated Cascading
      • Review of analysis of binary tree reduction and CRCW max
      • Concept of doubly log tree
      • CRCW max and doubly log tree
      • Analysis of doubly log tree (T=lg lg N; W=N lg lg N)
      • Idea of using binary tree reduction to reduce problem size and CRCW max to pick up speed at end
      • Formal analysis
      • Reduce for lg lg lg N steps (T<=lg lg N; W<=N; Remaining Size=N/lg lg N)
      • Use CRCW to complete (T=lg lg N; W=N)
      • Combined algorithm is reasonably fast (lg lg N) and work efficient (N)
    3. Virtualizing sorts
      • Tree (merge sort)
      • Linear array
      • Bitonic
    4. Oblivious Comparison Exchange Sorts
      • Proof of correctness for 0-1 data implies from for all
      • Correctness of Even-Odd Transposition Sort
    5. Analysis of Sorts
      • ShearSort on Maspar
      • Oblivious Compare Exchange (OCE) sorts
      • Shear Sort and RevSort
      • Order, Cost, Work, Cost Efficiency, Work Efficiency.
      • Discuss the general principle of "getting out of the way" employed in Shearsort.
      • Extend this notion to the technique used in Revsort. Note this is not a snake sort like shear.
      • Note that Revsort is not a sort. It just gets close (within 8 rows of being right.)
      • Revsort gets there fast. It cuts number of dirty rows, not in halves, but to square root of current number of dirty ones.
        Do a formal proof of ShearSort correctness and timing. (Detailed Proof)
      Assignment #6:
        Analyze fast/efficient algorithm if binary tree reduction cutoff is:
          lg lg lg lg N
          lg lg N
          lg N
        See which are fast and/or efficient. Do precise analysis
      Due: Tuesday of Week#12, November 5.

    Week#11: (10/29, 10/31)
    1. Brief discussion of fork and join from chapter 6 in text
      • Unix fork and wait
      • Implementing monitors via semaphores
    2. Message Passing from chapter 7 in text
      • channels: send (non-blocking); receive (blocking)
      • simple channel examples: char-to-line; sorting network
      • client server examples
        • one op; multi-ops; condition variables
        • resource allocator
        • disk server
        • file server
        • centralized vs symmetric vs ring reduction algorithms
        • synchronous message passing and deadlock
    3. PCN (Program Composition Notation)
    4. Parallel Logic Programming (Prolog, CLP(R), Strand)
    5. CSP; reasoning through CSP; Modern CSP
      • synchronous communication; guarded communication
      • dest ! port(expression list); source ? port(parameter list)
        • blocks until match of sender and receiver
      • Guarded communication (succeed, block, fail)
        • if B1 ; C1 -> S1; [] B2 ; C2 -> S2; ... fi
        • do B1 ; C1 -> S1; [] B2 ; C2 -> S2; ... od
        • all fail (no effect; we're done)
        • one or more successful (choose one non-det.)
        • all block (wait for one to succeed)
      • Seive of Eratosthenes
      • CSP and event-based reasoning
        ParaDOS style reasoning
        Assignment #7:
          Stable Marriage Problem (7.15 (a) in text)
        Due: Tuesday of Week#13, November 12.

      Week#12: (11/5, 11/7)
      1. MPI
        • Library of commands
      2. Broadcasting on various architectures
        • One-to-all SF on ring, mesh and hypercube
          Ring (ts + tw m) ép/2ù
          Mesh 2(ts + tw m) éÖ p/2ù
          3d Mesh 3(ts + tw m) p1/3 /2
          HyperCube (ts+ tw m) lg p
          All-to-one variation - reverse order and direction, accumulate messages
          All-to-one can be used for sum and other associative operations
        • One-to-all CT on ring, mesh and hypercube
          Ring (ts + tw m) lg p + th(p–1)
          Mesh (ts + tw m) lg p + 2th(Ö p–1)
        • HyperCube (ts + tw m) lg p -- no gain! (why?)
      3. All-to-All Broadcasting on various architectures
        • All-to-all SF on ring, mesh and hypercube
          Ring (ts + tw m) (p - 1)
          Mesh 2 ts (Ö p-1) + tw m (p - 1)
          HyperCube ts lg p + tw m (p - 1) requires bi-directional links
          Note lower bound of tw m (p - 1)
        • Reduction on hypercube
          (ts + tw ) lg p
          Prefix Sums as extension to reduction
        • One-to-all SF personalized on ring, mesh and hypercube
          Ring (ts + tw m) (p - 1)
          Mesh 2 ts (Ö p-1) + tw m) (p - 1)
          HyperCube ts lg p + tw m (p - 1) requires bi-directional links
          Note lower bound of tw m (p - 1)
        • All-to-all SF personalized
          Ring (ts+ tw m p/2) (p - 1) the ubiquitous sum of i goes from 0 to p-1
          Ring (ts+ tw m p/2) (p - 1) the ubiquitous sum of i goes from 0 to p-1
          Mesh 2( ts + tw m p) (Ö p-1)
          HyperCube (ts + tw m p/2) lg p
          Note lower bound of tw m (p - 1)
        • All-to-all CT personalized
          HyperCube (ts + tw m) (p–1) + (th/2) p lg p
        • Circular q-shift SF
          Ring (ts + tw m) * ë p/2û
          Mesh (ts + tw m) * (2ë Ö p/2û+1)
          Hypercube (ts + tw m) * (2 lg p – 1)
        • Circular q-shift CT
          Hypercube ts + tw m + th ( lg p – g(q))
          g(q) is the number of times 2 divides q. lg p – g(q) is longest path for a circular q-shift
        • Note lower bound tw m (p - 1) for all-all, one-all personalized, all-all personalized
      4. SF Broadcasting
           
          Operation Ring 2d Mesh Hypercube
          One-to-all (ts + tw m) 
          * ép/2ù
          2(ts + tw m) 
          * éÖ p/2ù
          (ts + tw m) lg p
          All-to-all (ts + twm) 
          * (p–1)
          2ts(Ö p–1) 
          + twm (p–1)
          ts lg p 
          + twm (p–1)
          One-to-all personalized (ts + tw m) 
          * (p–1)
          2ts(Ö p–1) 
          +twm (p–1)
          ts lg p 
          + twm (p–1)
          All-to-all personalized (ts + tw m p/2) 
          * (p–1)
          (2ts + tw m p) 
          * (Ö p–1)
          (ts + tw m p/2) 
          * lg p
          Circular 

          q-shift

          (ts + tw m) 
          * ë p/2û
          (ts + tw m) 
          * (2ë Ö p/2û+1)
          (ts + tw m) 
          * (2 lg p – 1)
      5. Cut Through Broadcasting
           
          Operation Ring 2d Mesh Hypercube
          One-to-all (ts + tw m) lg p 
          + th(p–1)
          (ts + tw m) lg p 
          + 2th(Ö p–1)
           
          All-to-all personalized     (ts + tw m) (p–1)
          + (th/2) p lg p
          Circular 

          q-shift

              ts + tw
          + th ( lg p – g(q))
        1. g(q) is the number of times 2 divides q. lg p – g(q) is longest path for a circular q-shiftProgram Flow Analysis (Related to Chapter 12.2 in text)
      6.  Control Flow
        • Basic terminology
        • Control vs data flow
        • Inter vs intraprocedural analysis
        • Program flowgraph
        • Basic blocks
        • Domination
        • Loop extraction
        • Depth first ordering (reverse postorder)
        • Categorizing arcs (forward, back, cross)
      7. Data Flow
        • Data flow analysis
              Notation including May/Must, Forward/Backward Flow
        • Reaching Definitions Algorithm
      8. Flow Analysis and Parallelizing Code
        • Scalar data dependence
          • true, anti and output dependencies
        • Vector data dependence
        • Diophantine analysis
        • GCD Test
          • Vectorizing loops


      Week#13: (11/12, 11/14)
      1. Running RMI Examples
        • The Bid.com example
          • You may need to change my paths to match your arrangement of directories
            • I assume you are placing Bid.com directly under the root (C:\Bid.com)
              • This is important for http paths;
                    if you do otherwise change the root path in 1_https.bat
            • I assume the jdk bin file is on your executable path (used in all bats except 2_gormi.bat)
            • I assume jini is in the directory \jini1_1 (only needed in 1_https.bat)
          • run 1_https.bat; 2_gormi.bat
          • You can run the server from bidServer via bat file or in JBuilder
          • You can run the clients from bidClient via three bat files or three projects in JBuilder
          • You can run a pretty reasonable working version of the assignment from the Working subdirectory
            • This uses a different bid service name, so both examples can run simultaneously
            • There is no attached source
      2. Flow Analysis and Parallelizing Code
        • Transformations Used to Parallelize Code
      3. Parallelizing Graph Algorithms
        • Greedy algorithms
        • Spanning trees
        • Minimum spanning tree (Prim's Algorithm)
          • alternate data structures for adjacency (N2 verus E lgN)
        • Block Striped Partitioning
        • Analysis of Prim's using p processors
              computation cost N2/p
              communication cost
                  hypercube N lg p
                  mesh NÖp
      4. Scheduling Problems
        • Basic definition
        • Motivation for scheduling problems
              scheduling of independent tasks and bin packing
              heuristics versus perfect scheduling
        • Examples with two processors, no dependencies
        • Anomalous behaviors with precedence graphs
        • NP problems
        • NP Completeness
        • How bad can anomalies be?
        • How bad can a bad schedule be?
        • Fast scheduling algorithm for unit execution trees
        • Adaptations for forests and anti-trees
        • Discussion of fast DAG scheduling with 2 processors
          • Union-Find algorithm (amortization)

      Week#14: (11/19, 11/21)
      1. Showtime for Projects

      Week#15: (11/26)
      1. Showtime for Projects
      2. Discussion and Return of Assignments
      3. Review for Final Exam

      Final Exam:

      © Charles Hughes, Erik Reinhard, reinhard@cs.ucf.edu -- Last Modified September 2, 2002