Concepts of Parallel
and Distributed Processing Fall
2002 |
This page is under construction and more-or-less out of date. I
did add the beowulf batch mode files to the ray tracer section,
though!
mailto:reinhard@cs.ucf.edu
Thanks to Prof. Charles Hughes for creating these course notes. I have made only small changes (dates, assignments).Erik.
Structure: TR 14:30-15:45, CSB 221; 28
class periods, each 75 minutes long.
Go To Week
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15
Instructor: Erik Reinhard; CSB 250; 823-4289; reinhard@cs.ucf.edu
Office Hours:
Text: Gregory R. Andrews, Multithreaded, Parallel and Distributed
Programming, Addison-Wesley, 2000.
Supplement: Kumar et al,
Introduction to Parallel Computing, Benjamin-Cummings, 1994.
Prerequisites: COP3530 (CS3), COT3100 (Discrete Structures), CDA3103
(Computer Organization), COP3402 (Computer System Concepts).
Implementation Environments: You will be regularly using Java. You do
not need to be a Java expert, but you must be ready to learn the language and
the environment quickly. Feel free to ask others for guidance to quickly solve
trivial problems with Java syntax or with using Borland JBuilder and the JDK.
Later in the term you may be using a Beowulf system. This will use the Linux
operating system and C programming language. You do not need to be UNIX internals
experts, but you do need to develop a reasonable level of comfort. Again, feel
free to ask others for guidance to quickly solve trivial problems with C syntax
or with using UNIX commands. If you have serious problems, come see me.
Assignments: 4 to 6 small to moderate programming assignments (some
are multi-part) using a variety of parallel and distributed programming
paradigms. Around 4 non-programming assignments. One large project using a
combination of distributed paradigms.
Exams: One quiz, a midterm and a final.
Material: From Andrews: Chapters 1 through 10, plus parts of
11 and 12. Class
notes. Java Overheads. Web tutorials on Java: (http://www.javasoft.com/docs/books/tutorial/index.html).
Important Dates (Quiz Dates are Subject to Change): Quiz#1--
September 24; MidTerm -- October 3; Withdraw Deadline -- October 11;
Final -- at end of course.
Evaluation (Tentative):
One Quiz -- 60 points
Mid Term -- 90
points
Final Exam -- 150 Points
Assignments -- 300 Points
Total
Available: 600
Grading will be A-90%+, B-80%+, C-70%+, D-50%+, F-below
50%
Resources:
Free Software IDEs:
JBuilder 7
Java Tutorial
Java at Sun
Java Madness
(help with JBuilder)
Information about MPI - Message Passing Interface
MPI Resources
A book on MPI
Ray tracer
Sequential ray tracer for programming assignment
After unzipping and untarring this file, type 'make' to compile and
'run' to render an example image. There is some documentation in the
files 'run' and 'balls.txt'. The file 'datastructure.h' contains most
of the important data structures and is commented to some extent.
The function 'main' is located in ray.c, and the loop that loops over
all pixels is located in 'trace.c'.
New: Using the beowulf cluster
Thanks to Matt O'Connor, here's a few files that may help you run your
program on the beowulf cluster. You will have to adapt these scripts
to your own situation.
The Batch script will call your run script in
batch mode. The first two lines redirect standard error and standard
output. It currently calls the run script with a request for 27
processors.
The Run script takes the number of processors on the
command line, and calls mpirun. It uses the -machinefile flag to
specify which nodes may be used. A useful machine file can be found here (the nodes that are currently not available
are commented out in this file).
To run this ensemble of files in batch mode, you type 'qsub batch1'.
To view the status of your batch jobs, you may type 'qstat'.
He has also figured out a way to kill hanging processes.
Hope this helps!
Beowulf programming assignment 1.
Familiarize yourself with the code and rendering images. Then, using
MPI, create two processes. The first process renders an image. The
second process displays the images on your display. The process that
displays the image will be the master. It will first tell the other
process that it should start tracing an image. When the image is
completed, the master process will receive all pixels from the
rendering process and will display the resulting image on
screen. Later, we will parallelize the render process.
Beowulf assignments 2 - 4.
These were discussed in class. If you can't remember what they were,
please come and see me.
Week#1: (8/20, 8/22) -- Chapters 1 and 2 from Andrews
- Concurrent Programming Concepts
- Introduction, even-odd transposition algorithm, analysis
- Introduce basic concepts of concurrency and some of the course's goals.
- Introduce sorting on a linear array and on a tree. In particular,
present a pretty detailed description of the even-odd transposition sort on
an array of processors. Discuss the theoretical bounds on sorting with such
an architecture. Discuss what happens when you have a ring, not just an array
of processors. Discuss the notion of oblivious comparison exchange sorting
as a forward pointer to later proofs of correctness.
- Concepts of analysis of parallel algorithms
- Architectural considerations -- synchronous versus asynchronous; barrier
synchronization; centralized control
- Issues of communication and coordination in parallel and distributed
implementations
- Present notions of Time, Cost, Speedup, Work, Cost Efficiency and Work
Efficiency.
- Virtualizing an algorithm -- focus on even-odd transposition.
- Look at scalability of even-odd transposition sort.
- SIMD solution to even/odd transposition sort
- Discussion of implications of this architecture
Assignment #1:
Problem
1.5 from text.
Due: Tuesday, Week#3
(September 3)
Week#2: (8/27, 8/29) -- Chapters 1 and 2 from Andrews
- Taxonomies (control, address space, interconnection network,
granularity)
- Taxonomies (control)
- SISD, SIMD and MIMD. Data-parallel versus task parallel. Data parallel's
relation to SIMD.
- Taxonomies (address space)
- Regarding address-space organization, differentiate between private
memory (separate address spaces), also called distributed memory, and shared
address space (often called shared memory). Discuss UMA (uniform / symmetric
multiprocessors (SMP)) versus NUMA (non-uniform) memory access. Discuss
concepts of cache and the cache coherence (consistency) problem.
- A multicomputer is a distributed memory multiprocessor in which the
nodes and network are in a single cabinet. Such a system is tightly
coupled and communication is over a dedicated high speed interconnection
network.
- Note that a network of workstations is a form of distributed memory
multiprocessor. Such a system is loosely coupled and communication is
usually through message passing.
- Beowulf is a network or cluster of workstations generally running
Linux.
- Distributed, shared memory refers to a distributed implementation of
the shared memory model.
- Make the connection of multicomputer to distributed memory and
multiprocessor to shared address space.
Differentiate between
shared-address and shared-memory parallel computers. Note how the same
picture showing all processors having memory, and no global memory can
represent either a distributed or a shared address space
(NUMA).
- Taxonomies (interconnection network)
- Regarding interconnection network, we can have static or dynamic
interconnections. dynamic interconnections of bus and crossbar versus static
interconnections of linear and completely connected.
- Dynamic interconnection networks use switches (indirect), rather than
direct connects. Discuss crossbar, bus-based networks.
- Regarding the crossbar, discuss the cost of scaling (high) versus the
performance degradation of scaling (low) for a crossbar. Note the cost of
O(p2) switches.
- Regarding bus-based, relate this to networks of workstations. Note its
cost scalability, but the difficulty of scaling its performance
- Discuss multistage. Regarding multistage, present the notion of a
butterfly.
- Introduce static interconnection networks, noting connections are
direct. Discuss completely-connected, star connected, linear array, ring,
2-d mesh, torus, 3-d mesh and torus, tree, hypercube.
- Note that completely connected are static counterparts of crossbars,
but they allow multiple messages to emanate from one node to several
destinations.
- Note that the central node in a star is the bottleneck, just as the
bus is in a bus scheme. This is also true of the root of a tree.
- Discuss trees with switches on interior nodes and note the use of fat
trees to avoid the bottleneck of the root.
- Hamming distance as shortest path in a Hypercube
- Introduce the notions of diameter, connectivity, and bisection width.
- Show these for star connected, completely connected, linear array,
ring, 2-d mesh, torus, and 3-d mesh and torus.
- Generalize to k-ary, d-cubes
- Note the hypercube is a 2-ary, d-cube, having 2d
processors. A ring is a p-ary, 1-cube. A 2d torus of p processors is a
Öp-ary, 2-cube. A k-ary, d-cube can
be created from k k-ary (d-1) cubes by connecting identical positions.
- Embedding Other Networks in Hypercubes
- Reflected Gray Code; rings and meshes
Week#3: (9/3, 9/5)
- More on static networks
- Reflected Gray Code; trees
- Discuss reduction on a hypercube vs a linear array. Note this can be
viewed as a mapping of the tree reduction onto hypercube.
- Routing in static networks
- Present XY routing and E-cube routing as examples of dimension-ordered
routing.
- Describe communication costs associated with static networks. Parameters
are Startup Time (ts), Per-Hop Time (th), Per-Word Transfer
Time (tw).
- Switching Techniques: Store-and-forward versus cut-through.
Store-and-forward cost for m words traversing l links is
tcomm = ts + (mtw
+ th) l. Since th is usually
quite small, we simplify this to tcomm = ts +
mtwl. Cut-through routing advances the message as soon as
it arrives at a node. Wormhole routing is a specific type of cut-through in
which the message is divided into small parts called flits (flow-control
digits). Flits are pipelined through the network with an intermediate node
needing to store the flit, but not the whole message. Since flits are of
fixed size, the communication cost is tcomm =
ts + lth+ mtw . Thus,
store and forward is O(ml), whereas cut-through is
O(m+l).
- Discuss deadlocking in wormhole routing.
- Reduction (all to one), broadcast (one to all), all to all on
hypercube
- Granularity (BSP model)
- PRAM Model
- Programming Styles
- Iterative parallelism: co // and process notation
- Recursive parallelism
- Producer / Consumer
- Client / Server
- Peers: worker, send and receive notation
- Example of Max in book's notation
- State, history, properties
- s1 -> s2 -> s3 ... ->sk
: trace or history states; can have many traces in concurrent system
- states are altered by atomic action
- safety property : never enter a bad state
- liveness property : eventually enter a good state
- mutual exclusion is a safety property
- partial correctness is a safety property
- termination is a liveness property (finite histories)
- total correctness is both a safety and liveness property
- Notation for concurrency
- co s1; // s2; // ... // sn; oc : concurrency
- process name { ... } : background process
- < S; > : atomic action; critical section; mutual exclusion;
granularity considerations
- < await(B) > : conditional synchronization; barrier
synchronization
- < await(B) S; > : conditional atomic action
- { precondition } actions { postcondition } : basis for axiomatic proofs
of correctness
-
Programming Assignment #1:
Using Quicksort as a model,
develop a multi-threaded Java program that solves the sorting
problem.
Turn in a zip file with your project (be sure that the
zip keeps all directory structures) and a write-up. You must give
reasons for your choice of programming style (iterative or recursive
parallelism, producer/consumer, client/server or communicating
peers) and you must analyze your algorithm. Your analysis must
include time, cost, work, cost efficiency and work
efficiency.
Due: September 24
Week#4: (9/10, 9/12)
- Critical reference
- Critical reference is one changed by another process
- At Most once Property (x = e); appearance of atomicity
- e contains at most one critical reference and x is not read by any
other process; OR
- e contains no critical references
- Examples
- Locks and Barriers
- Critical section problem (vaguely stated here)
- Simple producer/consumer
- Axiomatic Semantics (very brief)
- Triples {P} S {Q}
- Note this is a safety condition; need all histories finite for liveness
- Fairness
- Unconditional Fairness:
- every unconditional eligible atomic action is eventually executed
- Weak Fairness
- unconditionally fair; OR
- every conditional eligible atomic action is eventually executed
- assumes condition is true long enough to be observed
- Strong Fairness (an impractical consideration)
- unconditionally fair; OR
- every conditional eligible atomic action is eventually executed,
assuming the condition is infinitely often true
- SpinLocks
- Critical section problem
- mutual exclusion
- absence of deadlock and livelock
- absence of unnecessary delays
- eventual entry (relates to fairness)
- We now want to consider how to implement the < ... > primitive of
text
- How do we handle code like <await (!lock) lock = true;> critical;
lock = false;?
- test and set from IBM 360/67 2 processor machine
- while (TS(lock)) ; // returns entry value of lock (before this set)
- one memory cycle -- basically an atomic spin lock
- no guarantee of fairness
- results in serious memory contention for shared lock
- while (lock); while (TS(lock)) { while (lock); } // Test and Test and
Set
- reduces memory contention
- To implement unconditional atomic action < S; >
- CSEnter; S; CSExit; // CSEnter is entry protocol; CSExit is exit
protocol
- To implement conditional atomic action <await (B) S; >
- CSEnter; while (!B) { CSExit; delay; CSEnter; } S; CSExit;
- if B satisfies at most once property can do < await(B);> as
while(!B);
- Relation to Java synchronized
- synchronized (lock) { S; } is like <S;> // every process uses
same lock object
- synchronized (lock) { while (!B) try{wait();}catch(...){} S; notify();
} is like <await(B) S;>
- TicTacToe:
An Example of Java Concurrency
- Java Support for Concurrency
- Threads : either inherit from Thread class or implement Runnable
interface
- constructor in Runnable specifies object that provides code for thread
- constructor can also specify a string to identify thread
- Synchronize : specifies critical section using an object as lock
- can do at granularity of method
- can do at granularity of a block
- Locks are reentrant
- Locks can be temporarily given up : wait and notify
- Locks and Our First Two Java Programs
- Look back at our EOSort. It is risky in that it depends on a one
processor environment
- EOSort can backtrack (undo its own progress)
- Look back at our TictacToe. It uses a spin lock as a barrier and so can
suffer from livelock
- livelocked threads are alive, but never making progress
- Fair Solutions
- Tie Breaker
- Ticket Algorithm
- Bakery Algorithm
- Barrier Synchronization
- Shared Counter
- Flags and Coordinators
- Symmetric Barriers
Assignment #2:
Problem 2.16 from the book.
Due: September 19
Programming Assignment #2:
Problem 2.35(b) from Andrew's book, in multi-threaded
java. Please hand in a zip file and a write-up, like in the previous
programming assignment. What is the time, cost, work, cost
efficiency and work efficiency of your solution? Is your program
safe?
Due: October 8
Week#5: (9/17, 9/19)
- Data Parallel
- MasPar Examples
- Parallel Prefix
- Semaphores
- abstraction with two services P (wait) and V (signal)
- sem s;
- P(s): <await(s>0) s--;>
- V(s): <s++;>
- internal state is
- a non-negative int value -- counting or general semaphore; or
- a binary value (0 or 1) -- binary semaphore
- fairness can be assured with proper implementation of await.
- Critical section problem and semaphores
- Java synchronized and semaphores
- Barriers and semaphores
- Producer / Consumer Problem
- Dining Philosophers Problem
- Race
conditions in EOSort
- Counter
Example
- UDP solution
- TCP/IP solution
- Running both at same time; cache coherence problem
- Hangman's
Noose
- TCP/IP solution (client-server approach)
- Possibility of server-less approach (blackboard)
- Multiplayer
Hangman
Assignment #3:
Problems 3.2 and 3.4 from
Andrews text
Due: September 26
Week#6: (9/24, 9/26)
Quiz#1 on 9/24
- Skipping 4.5 and 4.6 (done in OS course)
- Single lane bridge problem
- Monitor overview
- monitors and conds
- wait(cv), signal(cv), empty(cv)
- signal and wait versus signal and continue
- semaphores implemented via monitors
- Single lane bridge problem using monitors (Chapter 5)
- Paths -- declarative approach to concurrency control
- Bitonic Sort
Assignment #4:
Look at slides 97 and 98 that deal with monitor solutions for the one-way bridge problem. Analyse these with respect to non-exclusion and deadlock.
Due: October 1
Week#7: (10/1, 10/3)
- Bitonic Sort
- Monitors from chapter 5 in text
- monitors and conds
- wait(cv), wait(cv, rank), signal(cv), signal_all(cv), empty(cv),
minrank(cv)
- signal and wait versus signal and continue
- queues, priority queues, BPOTs, heaps and analysis
- bitonic lists
- monitor examples
- semaphores
- bounded buffers
- readers/writers
- shortest-job-next
- timers
- covering condition versus priority wait
- sleeping barber
- CSCAN disk scheduler
- SCAN disk scheduler
- Java synchronized, wait/notify/notify_all
- flip atomic action with n processes
- Discussion of Quiz#1
- Topics and Promises for MidTerm
MidTerm on October 3
Week#8: (10/8, 10/10)
- Distributed
Computing Paradigms
- Concurrent Objects
- Synchronous versus asynchronous method invocation
- Single versus multiple server threads
- Remote Method Invocation (Related to RPC from Chapters 8 and 10 in
Text)
- Serializing (marshaling) objects / persistency
- Distributed Objects:
- Creating a service:
Compile code producing skeleton
(server side) and stub (client side)
Note:
Usually involves IDL (Interface Definition Language)
- Server action:
Register service with Object Manager
(sometimes just registry on server)
Note:
Object manager is often called an ORB (object request broker)
- Client action:
Request service from Object Manager
Note: This usually involves a Name Server
- Object manager action:
Negotiate with service
providers for an object
Note: This sometimes
involves Reflection
Deliver remote object to client
- Client/Server Interaction:
Client uses remote object
to get services
Server receives remote messages, provides
results
Arguments and results can be
by-value (local copy) or
by reference (remote handle)
This exchange involves serializing/unserializing objects
Note: Serializing is sometimes called
marshalling
- Roles are reversed when server sends messages to remote
objects from client
- Tuple Space
- General concepts and C-Linda
out( tuple )
in( tuple ) and
inp( tuple )
rd( tuple ) and rdp( tuple )
eval( tuple
)
Note: Last Day to Withdraw is October 11.
Week#9: (10/15, 10/17)
- Java's Solution:
- Remote
Method Invocation
- Serialization
- RMI IDL is Java Interface; rmic compiles skeleton and stub
Write interface
which must extend Remote interface
Generate skeleton
and stubs
- Server Application:
Write service implementation
corresponding to interface
Server registers
a service by name on some RMI port
Server must
start an RMI
registry prior to this
Note: default port
is 1099
- Client Applet:
Write client
applet
Client "looks up" service
Note: Need server IP address, port, and name of
service
Lookup returns remote object
- Note: Class of remote object is name of
Interface (stub)
- RMI
Examples
- Tuple Space Implementations
- Sun JavaSpace
- Entry
- templates
- nulls as wild cards
- matching by type and value; just type on wildcard
- matching is via MarshalledObject.equals
- write(entry, transaction, lease)
- read(entry, transaction, lease)
- take(template, transaction, lease)
- notify(template, transaction, listener, lease, handback)
- IBM
TSpaces
- Field
- Tuple and SuperTuple versus SubclassableTuple
- TupleSpace (many)
- write( tuple )
- take( template_tuple )
- waitToTake( template_tuple ).
- read( template_tuple )
- waitToRead( template_tuple )
- scan( template_tuple )
- consumingScan( template_tuple )
- countN( template_tuple )
- eventRegister( op, template_tuple, callback, newThread )
- transaction model
- adding spaces to a Transaction
- beginTrans ... commitTrans or abortTrans
- query types
- match queries
- index queries
- and/or queries
- XML queries
- management
- status, exists, delete, deleteAll, cleanup
- access control
- command handlers
- others
- multiWrite
- FieldPS
- readTupleById, deleteTupleById, update( tupleId, tuple ), multiUpdate
- Bid.com in
TSpaces
Week#10: (10/22, 10/24)
- More analysis of sort algorithms + Brent's scheduling
- Specifically look at max and sum algorithms and try to determine when we
are using an appropriate number of processors.
- Measuring the number of processors that can be used and still have O(1)
efficiency. For Max or Sum algorithm, Tp = O(N/p) + O(lg p).
- So E = O(N)/(O(N + p lg p) = 1/(1+p lg p / N).
- This is O(1) if p = N/lg n (Brent's scheduling)
- Accelerated Cascading
- Review of analysis of binary tree reduction and CRCW max
- Concept of doubly log tree
- CRCW max and doubly log tree
- Analysis of doubly log tree (T=lg lg N; W=N lg lg N)
- Idea of using binary tree reduction to reduce problem size and CRCW max
to pick up speed at end
- Formal analysis
- Reduce for lg lg lg N steps (T<=lg lg N; W<=N; Remaining Size=N/lg
lg N)
- Use CRCW to complete (T=lg lg N; W=N)
- Combined algorithm is reasonably fast (lg lg N) and work efficient (N)
- Virtualizing sorts
- Tree (merge sort)
- Linear array
- Bitonic
- Oblivious Comparison Exchange Sorts
- Proof of correctness for 0-1 data implies from for all
- Correctness of Even-Odd Transposition Sort
- Analysis of Sorts
- ShearSort
on Maspar
- Oblivious Compare Exchange (OCE) sorts
- Shear Sort and RevSort
- Order, Cost, Work, Cost Efficiency, Work Efficiency.
- Discuss the general principle of "getting out of the way" employed in
Shearsort.
- Extend this notion to the technique used in Revsort. Note this is not a
snake sort like shear.
- Note that Revsort is not a sort. It just gets close (within 8 rows of
being right.)
- Revsort gets there fast. It cuts number of dirty rows, not in halves,
but to square root of current number of dirty ones.
Do a formal proof of
ShearSort correctness and timing. (Detailed
Proof)
Assignment #6:
Analyze fast/efficient algorithm if binary tree reduction cutoff is:
lg lg lg lg N
lg lg N
lg N
See which are fast and/or
efficient. Do precise analysis
Due: Tuesday of
Week#12, November 5.
Week#11: (10/29, 10/31)
- Brief discussion of fork and join from chapter 6 in text
- Unix fork and wait
- Implementing monitors via semaphores
- Message Passing from chapter 7 in text
- channels: send (non-blocking); receive (blocking)
- simple channel examples: char-to-line; sorting network
- client server examples
- one op; multi-ops; condition variables
- resource allocator
- disk server
- file server
- centralized vs symmetric vs ring reduction algorithms
- synchronous message passing and deadlock
- PCN (Program Composition Notation)
- Parallel Logic Programming (Prolog, CLP(R), Strand)
- CSP; reasoning through CSP; Modern CSP
- synchronous communication; guarded communication
- dest ! port(expression list); source ? port(parameter list)
- blocks until match of sender and receiver
- Guarded communication (succeed, block, fail)
- if B1 ; C1 -> S1; [] B2 ; C2 -> S2; ... fi
- do B1 ; C1 -> S1; [] B2 ; C2 -> S2; ... od
- all fail (no effect; we're done)
- one or more successful (choose one non-det.)
- all block (wait for one to succeed)
- Seive of Eratosthenes
- CSP and event-based reasoning
ParaDOS style reasoning
Assignment #7:
Stable Marriage Problem (7.15 (a) in text)
Due:
Tuesday of Week#13, November 12.
Week#12: (11/5, 11/7)
- MPI
- Broadcasting on various architectures
- One-to-all SF on ring, mesh and hypercube
Ring (ts
+ tw m) ép/2ù
Mesh 2(ts + tw m) éÖ p/2ù
3d Mesh 3(ts + tw m)
p1/3 /2
HyperCube (ts+ tw m) lg
p
All-to-one variation - reverse order and direction, accumulate
messages
All-to-one can be used for sum and other associative operations
- One-to-all CT on ring, mesh and hypercube
Ring (ts +
tw m) lg p + th(p–1)
Mesh (ts +
tw m) lg p + 2th(Ö p–1)
- HyperCube (ts + tw m) lg p -- no gain!
(why?)
- All-to-All Broadcasting on various architectures
- All-to-all SF on ring, mesh and hypercube
Ring (ts
+ tw m) (p - 1)
Mesh 2 ts (Ö
p-1) + tw m (p -
1)
HyperCube ts lg p + tw m (p - 1)
requires bi-directional links
Note lower bound of tw m (p -
1)
- Reduction on hypercube
(ts + tw ) lg p
Prefix Sums as extension to reduction
- One-to-all SF personalized on ring, mesh and hypercube
Ring
(ts + tw m) (p - 1)
Mesh 2 ts
(Ö p-1) + tw m) (p - 1)
HyperCube ts
lg p + tw m (p - 1) requires bi-directional links
Note lower
bound of tw m (p - 1)
- All-to-all SF personalized
Ring (ts+ tw
m p/2) (p - 1) the ubiquitous sum of i goes from 0 to p-1
Ring (ts+ tw m p/2) (p - 1) the ubiquitous
sum of i goes from 0 to p-1
Mesh 2( ts + tw
m p) (Ö p-1)
HyperCube (ts
+ tw m p/2) lg p
Note lower bound of tw m (p - 1)
- All-to-all CT personalized
HyperCube (ts + tw
m) (p–1) + (th/2) p lg p
- Circular q-shift SF
Ring (ts + tw m) * ë p/2û
Mesh (ts +
tw m) * (2ë Ö p/2û+1)
Hypercube (ts + tw m) * (2
lg p – 1)
- Circular q-shift CT
Hypercube ts + tw m +
th ( lg p – g(q))
g(q) is the number of times 2 divides q. lg p – g(q) is longest path for a circular q-shift
- Note lower bound tw m (p - 1) for all-all, one-all
personalized, all-all personalized
- SF Broadcasting
| Operation |
Ring |
2d Mesh |
Hypercube |
| One-to-all |
(ts + tw m)
* ép/2ù |
2(ts + tw m)
* éÖ p/2ù |
(ts + tw m) lg
p |
| All-to-all |
(ts + twm)
* (p–1) |
2ts(Ö
p–1) + twm (p–1) |
ts lg p +
twm (p–1) |
| One-to-all personalized |
(ts + tw m)
* (p–1) |
2ts(Ö
p–1) +twm (p–1) |
ts lg p +
twm (p–1) |
| All-to-all personalized |
(ts + tw m p/2)
* (p–1) |
(2ts + tw m p)
* (Ö p–1) |
(ts + tw m p/2)
* lg p |
| Circular
q-shift |
(ts + tw m)
* ë p/2û |
(ts + tw m)
* (2ë Ö p/2û+1) |
(ts + tw m)
* (2 lg p – 1) |
- Cut Through Broadcasting
| Operation |
Ring |
2d Mesh |
Hypercube |
| One-to-all |
(ts + tw m) lg
p + th(p–1) |
(ts + tw m) lg
p + 2th(Ö p–1) |
|
| All-to-all personalized |
|
|
(ts + tw m) (p–1)
+ (th/2) p lg p |
| Circular
q-shift |
|
|
ts + tw m +
th ( lg p – g(q)) |
- g(q) is the number of times 2 divides q. lg p –
g(q) is longest path for a circular
q-shiftProgram Flow Analysis (Related to Chapter 12.2 in text)
- Control Flow
- Basic terminology
- Control vs data flow
- Inter vs intraprocedural analysis
- Program flowgraph
- Basic blocks
- Domination
- Loop extraction
- Depth first ordering (reverse postorder)
- Categorizing arcs (forward, back, cross)
- Data Flow
- Data flow analysis
Notation including May/Must,
Forward/Backward Flow
- Reaching Definitions Algorithm
- Flow Analysis and Parallelizing Code
- Scalar data dependence
- true, anti and output dependencies
- Vector data dependence
- Diophantine analysis
- GCD Test
Week#13: (11/12, 11/14)
- Running RMI Examples
- The Bid.com
example
- You may need to change my paths to match your arrangement of
directories
- I assume you are placing Bid.com directly under the root
(C:\Bid.com)
- This is important for http paths;
if you do
otherwise change the root path in 1_https.bat
- I assume the jdk bin file is on your executable path (used in all
bats except 2_gormi.bat)
- I assume jini is in the directory \jini1_1 (only needed in
1_https.bat)
- run 1_https.bat; 2_gormi.bat
- You can run the server from bidServer via bat file or in JBuilder
- You can run the clients from bidClient via three bat files or three
projects in JBuilder
- You can run a pretty reasonable working version of the assignment from
the Working subdirectory
- This uses a different bid service name, so both examples can run
simultaneously
- There is no attached source
- Flow Analysis and Parallelizing Code
- Transformations Used to Parallelize Code
- Parallelizing Graph Algorithms
- Greedy algorithms
- Spanning trees
- Minimum spanning tree (Prim's Algorithm)
- alternate data structures for adjacency (N2 verus E lgN)
- Block Striped Partitioning
- Analysis of Prim's using p processors
computation
cost N2/p
communication cost
hypercube N lg p
mesh NÖp
- Scheduling Problems
- Basic definition
- Motivation for scheduling problems
scheduling of
independent tasks and bin packing
heuristics versus
perfect scheduling
- Examples with two processors, no dependencies
- Anomalous behaviors with precedence graphs
- NP problems
- NP Completeness
- How bad can anomalies be?
- How bad can a bad schedule be?
- Fast scheduling algorithm for unit execution trees
- Adaptations for forests and anti-trees
- Discussion of fast DAG scheduling with 2 processors
- Union-Find algorithm (amortization)
Week#14: (11/19, 11/21)
- Showtime for Projects
Week#15: (11/26)
- Showtime for Projects
- Discussion and Return of Assignments
- Review for Final Exam
Final Exam:
© Charles Hughes, Erik Reinhard, reinhard@cs.ucf.edu -- Last
Modified September 2, 2002