CS 541 Lecture -*- Outline -*-

* Overview of Distributed Systems

	with discussion of Argus and SR...

	Reference:  G. LeLann, "Motivations, Objectives, and Characterization
		of Distributed Systems", in Distributed Systems, Architecture
		and Implementation, LNCS Vol. 105, Springer-Verlag, 1981.

** Characteristics of Distributed Systems
	System with physically separted, autonomous, computers
		 that are used to solve a common problem.

*** Examples (problems to solve in a language)
	electronic banking systems (money cards),
		  operating system
		  airline reservation systems.

	Note that these examples have persistent data

	OS type Services: 
		file service, printer-spooler, process creation,
		terminal management, mail, time

*** Logical characteristics

****	Modular architecture, with possibly varying number of processors

****	Implementation by message passing on a network (no shared memory)
	that doesn't mean that the system can't provide
		an abstraction of shared memory
		e.g., Linda

****	Interprocess message delays are variable
	some non-zero time always exists between
			production and reciept of a message.
	So a processor cannot have a complete, consistent view
			of system.

****	independent failures of processors and network
	(nodes crash, messages are lost)
	Garbling and duplication of messages can be prevented
		by various techniques (checksums, ...)

****	reactive
		that is, users can talk to the system and get answers

** Objectives of Distributed Systems
*** Performance (more processors)
		shorter response time (time from initiation to completion)
		higher throughput (number of jobs per minute)
*** Extensibility (ease of adding new hardware or software)
		add or delete processors, even dynamically!
		integrate new functions at low cost, because of modular design
		ease of maintenance (unplug an element, fix it, plug it in)
*** Availability
		(probability that at an average length user task,
			 started at any instant can be successfully finished)
		from redundancy of hardware and replication of data + function
*** Reliability (probability that critical data is not lost in a year).
		from redundant storage of data
*** Resource sharing
		physical devices such as printers, CPUs.
	
** Issues for Distributed Systems
	describe how Argus resolved some of these

*** Fault tolerance (availability, reliability)
	crashes of computers
	links failing
	Can specify systems that are nondeterministic
		 and reflect some failures.
**** Will users see failures?
	does system try to behave as if it were centralized?
		e.g., keep user waiting indefinitely until problems fixed?
	does system show some failures to users?
		e.g., exceptions
	for programming languages:
		are the system's guarantees such that no failures can
			be dealt with in programs?
		can programmer tune (program) availability and reliability
			or is it all provided by the system?
	Argus doesn't try to hide failures...
**** what tools are available to deal with failures?
	atomic actions, stable storage
	exceptions
	what primitives do users have to help with these questions?
		Argus has atomic actions, stable storage, exceptions,
			at-most-once RPC.
*** Transparency (will users be made aware of distributed aspects?.)
	logically centralized system: behaves like centralized system.
	do users get to place computations, etc. at different nodes in the net?
	What tools are there to deal with distribution?
		in Argus: guardians
		in SR: virtual machines, resources
*** Communication primitives
****	Request, reply (send, receive)
		reliable?
			shot in dark (programmer handles acks, etc.)
			received at most once (expensive)
			exactly once (unrealistic)
		blocking?
			nonblocking is more flexible, but harder to program
			can use blocking if process fork is cheap
		buffering?
			can do flow control if sender blocks until received,
				and don't need buffers
			buffering raises several issues:
				flow control, protection, priorities, etc.
****	RPC
		semantics?
			like procedure call with value-result passing.
			but what about failures? (transparency)
			it will be more expensive,
				so perhaps should look different.
		representation of data in mesages
			standard format in general for each type (Herlihy)

*** Consistency (due to lack of global state information).
	What guarantees for an RPC (e.g., one that has an external effect)?
		zero or more, zero or one (at most once), at least one
		exactly once is impossible (?)
	Need transactions or physical redundancy
		to ensure at most once execution.
**** Transactions
	read-compute-update (in a database)
	Serializibility:  appear to have executed in serial order.
		can commit (make permenant) or abort.
		important because the invariant is preserved be each
			transaction
	Recoverability:  effect of commit transactions is not lost.
			implemented with physical redundancy (stable storage)

*** Overhead of communications.
	message send efficiency should reduce copying
	end-to-end arguments: reliability can only be achieved at highest
		level of protocol, therefore a waste to worry about reliability
		being 100% at lower levels.
		(system may crash after message is delivered)

*** Naming
	Name servers are availability and reliability bottlenecks
		if unavailable, can't find other services
	so name service has to be distributed.
	Can partition name space (joined at global top node)
		not too available
	multiple name servers
		single name space, but no redundancy
			(pointers to other servers)
		broadcast for lookup

*** Location of functionality.
	centralized (bottleneck for availibility and performance)
	Voting
	Mutual agreement (commit) problem
		Byzantine generals problem
		fundamental limits may cause indefinite waiting
			(Lynch, Fisher, Patterson)
		commit algorithms (2 phase commit)

*** Protection of resources
	controlling access to data, computers, ...
*** Autonomy of computers.

** Case Study: Distributed Operating Systems (Ref: Tannenbaum and van Renesse)
	Def: logically centralized, but runs on multiple, independent CPUs.
		network OS: computers run different OSs,
			 users work on own machines, use "remote logins,"
			 aware of where their files are kept,
			 little or no fault tolerance.
	Issues that distinguish distributed from network OS
		file system
			keep file systems separate
			adjoining files systems by new node at top
			single global file system (e.g. Locus)
				-location and replication under system control
		protection
			user ids on different systems not meaningful on
				other systems (need to rlogin)
				-programs cannot easily do rlogin
			mapping between userids (.rhost files)
			single uid for each user (valid everywhere)
		execution location (where is process created for a job?)
			just say "create process"
				(ok if system chooses machine)
			rlogins
			name machine to run process on

*** Example systems:
	Cambridge DOS
		based on a ring network
		processor bank and servers (dedicated to disks, printers, etc).
		users start on a given processor from processor bank
	Amoeba DOS
		capability based, object-oriented (as in Hydra)
		workstations for each user, pool processors, and
			 specialized servers.
		uses RPC.
	V kernel (Stanford)
		only communication provided.
		collection of workstations, each running kernel.
		send-recieve.
	Eden
		users and workstations (no processor pool)
		objects have both data and processes and are mobile
		uses RPC