CS 541 Lecture -*- Outline -*- * Overview of Distributed Systems with discussion of Argus and SR... Reference: G. LeLann, "Motivations, Objectives, and Characterization of Distributed Systems", in Distributed Systems, Architecture and Implementation, LNCS Vol. 105, Springer-Verlag, 1981. ** Characteristics of Distributed Systems System with physically separted, autonomous, computers that are used to solve a common problem. *** Examples (problems to solve in a language) electronic banking systems (money cards), operating system airline reservation systems. Note that these examples have persistent data OS type Services: file service, printer-spooler, process creation, terminal management, mail, time *** Logical characteristics **** Modular architecture, with possibly varying number of processors **** Implementation by message passing on a network (no shared memory) that doesn't mean that the system can't provide an abstraction of shared memory e.g., Linda **** Interprocess message delays are variable some non-zero time always exists between production and reciept of a message. So a processor cannot have a complete, consistent view of system. **** independent failures of processors and network (nodes crash, messages are lost) Garbling and duplication of messages can be prevented by various techniques (checksums, ...) **** reactive that is, users can talk to the system and get answers ** Objectives of Distributed Systems *** Performance (more processors) shorter response time (time from initiation to completion) higher throughput (number of jobs per minute) *** Extensibility (ease of adding new hardware or software) add or delete processors, even dynamically! integrate new functions at low cost, because of modular design ease of maintenance (unplug an element, fix it, plug it in) *** Availability (probability that at an average length user task, started at any instant can be successfully finished) from redundancy of hardware and replication of data + function *** Reliability (probability that critical data is not lost in a year). from redundant storage of data *** Resource sharing physical devices such as printers, CPUs. ** Issues for Distributed Systems describe how Argus resolved some of these *** Fault tolerance (availability, reliability) crashes of computers links failing Can specify systems that are nondeterministic and reflect some failures. **** Will users see failures? does system try to behave as if it were centralized? e.g., keep user waiting indefinitely until problems fixed? does system show some failures to users? e.g., exceptions for programming languages: are the system's guarantees such that no failures can be dealt with in programs? can programmer tune (program) availability and reliability or is it all provided by the system? Argus doesn't try to hide failures... **** what tools are available to deal with failures? atomic actions, stable storage exceptions what primitives do users have to help with these questions? Argus has atomic actions, stable storage, exceptions, at-most-once RPC. *** Transparency (will users be made aware of distributed aspects?.) logically centralized system: behaves like centralized system. do users get to place computations, etc. at different nodes in the net? What tools are there to deal with distribution? in Argus: guardians in SR: virtual machines, resources *** Communication primitives **** Request, reply (send, receive) reliable? shot in dark (programmer handles acks, etc.) received at most once (expensive) exactly once (unrealistic) blocking? nonblocking is more flexible, but harder to program can use blocking if process fork is cheap buffering? can do flow control if sender blocks until received, and don't need buffers buffering raises several issues: flow control, protection, priorities, etc. **** RPC semantics? like procedure call with value-result passing. but what about failures? (transparency) it will be more expensive, so perhaps should look different. representation of data in mesages standard format in general for each type (Herlihy) *** Consistency (due to lack of global state information). What guarantees for an RPC (e.g., one that has an external effect)? zero or more, zero or one (at most once), at least one exactly once is impossible (?) Need transactions or physical redundancy to ensure at most once execution. **** Transactions read-compute-update (in a database) Serializibility: appear to have executed in serial order. can commit (make permenant) or abort. important because the invariant is preserved be each transaction Recoverability: effect of commit transactions is not lost. implemented with physical redundancy (stable storage) *** Overhead of communications. message send efficiency should reduce copying end-to-end arguments: reliability can only be achieved at highest level of protocol, therefore a waste to worry about reliability being 100% at lower levels. (system may crash after message is delivered) *** Naming Name servers are availability and reliability bottlenecks if unavailable, can't find other services so name service has to be distributed. Can partition name space (joined at global top node) not too available multiple name servers single name space, but no redundancy (pointers to other servers) broadcast for lookup *** Location of functionality. centralized (bottleneck for availibility and performance) Voting Mutual agreement (commit) problem Byzantine generals problem fundamental limits may cause indefinite waiting (Lynch, Fisher, Patterson) commit algorithms (2 phase commit) *** Protection of resources controlling access to data, computers, ... *** Autonomy of computers. ** Case Study: Distributed Operating Systems (Ref: Tannenbaum and van Renesse) Def: logically centralized, but runs on multiple, independent CPUs. network OS: computers run different OSs, users work on own machines, use "remote logins," aware of where their files are kept, little or no fault tolerance. Issues that distinguish distributed from network OS file system keep file systems separate adjoining files systems by new node at top single global file system (e.g. Locus) -location and replication under system control protection user ids on different systems not meaningful on other systems (need to rlogin) -programs cannot easily do rlogin mapping between userids (.rhost files) single uid for each user (valid everywhere) execution location (where is process created for a job?) just say "create process" (ok if system chooses machine) rlogins name machine to run process on *** Example systems: Cambridge DOS based on a ring network processor bank and servers (dedicated to disks, printers, etc). users start on a given processor from processor bank Amoeba DOS capability based, object-oriented (as in Hydra) workstations for each user, pool processors, and specialized servers. uses RPC. V kernel (Stanford) only communication provided. collection of workstations, each running kernel. send-recieve. Eden users and workstations (no processor pool) objects have both data and processes and are mobile uses RPC