COP 4020 Lecture -*- Outline -*- * Parallelism Techniques Most of this material taken from chapters 2 and 3 of "Parallel and Concurrent Programming in Haskell" by Simon Marlow, published by O'Reilly Media, Inc, 2013. http://chimera.labs.oreilly.com/books/1230000000929/index.html > module Strategies where > import Control.Parallel.Strategies hiding (parMap) ** Basic Concepts ------------------------------------------ DO WE NEED LANGUAGE SUPPORT FOR PARALLELISM? Consider e0 e1 e2 Why doesn't Haskell just evaluate all subexpressions in parallel? ------------------------------------------ ... - overhead for managing parallel threads scheduling, etc. - granularity will be wrong most subexpressions too small to justify overhead - changes semantics drastically no longer lazy Conclusion: we need the programmer to divide up the work (still writing algorithms) ------------------------------------------ EXAMPLE PARALLEL ALGORITHM How to sort a list in parallel? psort :: (Ord a) => [a] -> [a] ------------------------------------------ see paper "Parallel Sorting Pattern" by Vivek Kale and Edgar Solomonik at ParaPLoP 2010. ideas: quicksort: partition the data into 2 lists, keep doing that until run out of processors then merge > qsort :: (Ord a) => [a] -> [a] > qsort [] = [] > qsort (x:xs) = let (small,large) = split x xs > in merge (qsort small) (x:(qsort large)) > > split x xs = ([e | e <- xs, e <= x], [e | e <-xs, e > x]) > merge [] ys = ys > merge xs [] = xs > merge (x:xs) (y:ys) = if x <= y > then x:(merge xs (y:ys)) > else y:(merge (x:xs) ys) sample sort: each processor helps figure out a set of keys to partition move data to appropriate partition, sort each, then merge radix sort: use bitwise representation of keys, 1 bucket per processor *** what has to be expressed? ------------------------------------------ WHAT DO WE HAVE TO EXPRESS? Say: - how to divide up work into tasks - how to order parts of computation (what cannot be done in parallel) ------------------------------------------ Q: Do we need to say what computations execute on what processors? no Q: Do we need to say what percentage of a processor each task gets? no ------------------------------------------ HASKELL'S EVAL MONAD in module Control.Parallel.Strategies data Eval a = Done a runEval :: Eval a -> a rpar :: a -> Eval a rseq :: a -> Eval a instance Monad Eval where return = Done m >>= k = case m of (Done x) -> k x ------------------------------------------ ... comments on the Eval operations: runEval -- get the result (magic here) rpar -- spark (magic) argument for parallel evaluation rseq -- evaluate the argument before moving on See http://hackage.haskell.org/packages/archive/parallel/3.2.0.3/doc/html/Control-Parallel-Strategies.html#t:Eval NFData types are ones that can be fully evaluated (Int, Bool, and other primitive values) Defs in Eval (skip), which explain it in terms of underlying primitives: rseq x = x `pseq` (return x) rpar x = x `par` (return x) where pseq and par are defined in Control.Parallel in e1 `par` e2, e1 should be a shared expression ------------------------------------------ CREATING THREADS import Control.Parallel.Strategies rpar exp - sparks exp as a thread - result is a promise (will eventually be value of exp) - type is (Eval t), where exp :: t - use runEval to get the value out Advantages of Eval and Strategies: - works with lazy evaluation - compositional strategies for parallel evaluation ------------------------------------------ Sparks are put in a "spark pool" which is used to find work for cores ------------------------------------------ EXAMPLE: PARMAP > parMap :: (a -> b) -> [a] -> Eval [b] > parMap f [] = return [] > parMap f (a:as) = > do b <- rpar (f a) > bs <- parMap f as > return (b:bs) from "Parallel and Concurrent Programming in Haskell" by Simon Marlow, p.13 ------------------------------------------ Note the use of of b and bs (shared expressions) in parMap! *** separating and composing specifications of parallelism ------------------------------------------ STRATEGIES Goal: separate specification of how to parallelize from specification of data computation module Control.Parallel.Strategies type Strategy a = a -> Eval a using :: a -> Strategy a -> a x `using` s = runEval (s x) r0 :: Strategy a -- do nothing rseq :: Strategy a -- evaluate argument to WHNF (minimally) rdeepseq :: NFData a => Strategy a -- evaluate arg to NF (fully) ------------------------------------------ Helps automate the use of the Eval monad ------------------------------------------ EXAMPLE: PARLIST Problem: mix of algorithm + parallelism directives parMap :: (a -> b) -> [a] -> Eval [b] parMap f [] = return [] parMap f (a:as) = do b <- rpar (f a) bs <- parMap f as return (b:bs) Goal: extract the strategy for parallelism, separate it from the algorithm idea: map a given strategy on elements to strategy on whole list parList :: Strategy a -> Strategy [a] parList strat [] = parList strat (x:xs) = do With parList can define > myParMap :: (a -> b) -> [a] -> [b] > myParMap f xs = (map f xs) `using` parList rseq ------------------------------------------ ... parList strat [] = return [] parList strat (x:xs) = do x' <- rpar (x `using` strat) xs' <- parList strat xs return (x':xs') Notice that myParMap results in a result of type [b], not Eval [b]. This is a benefit of using strategies (from the type of "using"). *** Examples **** Hailstone or 3x+1 problem again Now try to parallelize hailstone, putting code in HailstonePeaksRun2 and compile with ghc -O2 HailstonePeaksRun2.hs -threaded -rtsopts -eventlog run with ./HailstonePeaksRun2 +RTS -N -s -ls threadscope HailstonePeaksRun2.eventlog Problems: Too many sparks, overflows the heap! the reason for this is that parMap eagerly creates sparks from an infinite list. The right way is to use the parBuffer strategy (as in HailstonePeaksRun3 and HailstonePeaksRun4) which uses a rolling buffer. It sparks the first n computations and then when the ith is removed from the buffer, it sparks the i+nth, where n is the size of the buffer: `using` (parBuffer 100 rdeepseq) According to the reference by Marlow, a buffer size of 50-5000 usually works well. compile with ghc -O2 HailstonePeaksRun4.hs -threaded -rtsopts -eventlog run with ./HailstonePeaksRun4 +RTS -N -s -ls threadscope HailstonePeaksRun4.eventlog **** Sudoku See par-tutorial-1.2/code/sudoku3.hs compile with cd par-tutorial-1.2/code/ ghc -O2 sudoku3.hs -rtsopts -threaded run with ./sudoku3 sudoku17.1000.txt +RTS -N -s ** Amdhal's law ------------------------------------------ AMDHAL'S LAW Speedup = serial clock time / parallel time Why can't we speed up our program 4 times if we have 4 Cores? - There are parts of the program we can't parallelize Let P be the fraction of computation that can be parallelized Let S be the speedup achieved for P The the serial execution fraction is (1-P) The fraction of the time taken by parallel part is P/S So overall speedup is: 1 ___________ (1-P) + (P/S) ------------------------------------------ Speedup is what you divide the time by ("4 times faster" = 1/4)