Replication Of Non-Deterministic Objects

Introduction

I completed my PhD thesis in autumn 1998. It was accepted on Nov. 17, 1998 upon proposition of the jury composed by

Prof. Alfred Strohmeier, EPFL, Switzerland
Prof. Hans-Heinrich Nägeli, University of Neuchâtel, Switzerland
Prof. André Schiper, EPFL, Switzerland
Prof. Andy Wellings , University of York, UK

On this page, you can find the thesis' abstract in HTML:

in English
in French
in German

as well as two gzipped postscript versions of the full thesis:

A4 paper (21cm x 29.7cm, 337 Kb)
US letter (8.5" x 11", 339 Kb)

The postscript files print best in recto/verso: the left hand pages have a different layout, and there are some empty pages in the document to make sure that all chapters start on a recto.

This thesis discusses replication of non-deterministic objects in distributed systems to achieve fault tolerance against crash failures. The objects replicated are the virtual nodes of a distributed application. Replication is viewed as an issue that is to be dealt with only during the configuration of a distributed application and that should not affect the development of the application. Hence, replication of virtual nodes should be transparent to the application.

Like all measures to achieve fault tolerance, replication introduces redundancy in the system. Not surprisingly, the main difficulty is guaranteeing the consistency of all replicas such that they behave in the same way as if the object was not replicated (replication transparency). This is further complicated if active objects (like virtual nodes) are replicated, and these objects themselves can be clients of still further objects in the distributed application.

The problems of replication of active non-deterministic objects are analyzed in the context of distributed Ada 95 applications. The ISO standard for Ada 95 defines a model for distributed execution based on remote procedure calls (RPC). Virtual nodes in Ada 95 use this as their sole communication paradigm, but they may contain tasks to execute activities concurrently, thus making the execution potentially non-deterministic due to implicit timing dependencies. Such non-determinism cannot be avoided by choosing deterministic tasking policies.

I present two different approaches to maintain replica consistency despite this non-determinism. In a first approach, I consider the run-time support of Ada 95 as a black box (except for the part handling remote communications). This corresponds to a non-deterministic computation model. I show that replication of non-deterministic virtual nodes requires that remote procedure calls are implemented as nested transactions. Unfortunately, effects of failures are not local to the replicas of a virtual node: when a failure occurs, nested remote calls made to other virtual nodes must be undone. Also, using transactional semantics for RPCs necessitates a compromise regarding transparency: the application must identify global state for it cannot be determined reliably in an automatic way. Further study reveals that this approach cannot be implemented in a transparent way at all because the consistency criterion of Ada 95 (linearizability) is much weaker than that of transactions (serializability). An execution of remote procedure calls as transactions may thus lead to incompatibilities with the semantics of the programming language. If remotely called subprograms on a replicated virtual node perform partial operations, i.e., entry calls on global protected objects, deadlocks that cannot be broken can occur in certain cases. Such deadlocks do not occur when the virtual node is not replicated. The transactional semantics of RPCs must therefore be exposed to the application.

A second approach is based on a piecewise deterministic computation model, i.e., the execution of a virtual node is seen as a sequence of deterministic state intervals. Whenever a non-deterministic event occurs, a new state interval is started. I study replica organization under this computation model (semi-active replication). In this model, all non-deterministic decisions are made on one distinguished replica (the leader), while all other replicas (the followers) are forced to follow the same sequence of non-deterministic events. I show that it suffices to synchronize the followers with the leader upon each observable event, i.e., when the leader sends a message to some other virtual node. It is not necessary to synchronize upon each and every non-deterministic event - which would incur a prohibitively high overhead. Non-deterministic events occurring on the leader between observable events are logged and sent to the followers just before the leader executes an observable event. Consequently, it is guaranteed that the followers will reach the same state as the leader, and thus the effects of failures remain mostly local to the replicas.

A prototype implementation called RAPIDS (Replicated Ada Partitions In Distributed Systems) serves as a proof of concept for this second approach, demonstrating its feasibility. RAPIDS is an Ada 95 implementation of a replication manager for semi-active replication for the GNAT development system for Ada 95. It is entirely contained within the run-time support and hence largely transparent for the application.

Back to the top

Résumé

Cette thèse traite de la duplication d'objets non-dé terministes dans des systèmes répartis afin de les rendre tolérants aux pannes, plus particulièrement aux défaillances par arrêt. Les objets dupliqués sont les noeuds virtuels d'une application répartie. La duplication de noeuds virtuels est censée être transparente pour l'application: on considère la duplication comme un mécanisme qui n'intervient qu'au moment de la configuration de l'application et qui ne devrait pas affecter le développement de celleci.

Comme toutes les mesures ayant comme but la tolérance aux pannes, la duplication introduit de la redondance dans le système. La difficulté principale à résoudre est é videmment de garantir la cohérence de tous les duplicata pour que l'ensemble de duplicata se comporte envers les autres objets comme un objet singulier (transparence de duplication). Le problème est encore amplifié si l'on considère la duplication d'objets actifs, qui peuvent eux-mêmes être des clients d'autres objets dans l'application répartie.

Dans cette thèse, les problèmes de la duplication d'objets actifs non-deterministes sont étudiés dans le cadre d'applications réparties mises en oeuvre avec le langage de programmation Ada 95. Le standard ISO de ce langage définit un modèle pour l'exécution répartie fondé sur des appels à distance (RPC). Les noeuds virtuels en Ada 95 ne communiquent que par des appels à distance; ils peuvent néanmoins contenir des tâches, ce qui rend leur comportement non-déterministe en raison de certaines dépendances temporelles implicites. Ce genre de non-déterminisme ne peut pas être résolu en choisissant des règles d'ordonnancement de tâches déterministes.

Cette thèse présente deux approches différentes pour maintenir la cohérence des duplicata malgré ce non-déterminisme. Dans la première, le support d'exécution de Ada 95 est considéré comme une boîte noire, à l'exception de sa partie responsable du traitement de la communication entre les noeuds virtuels. Ceci correspond à un modèle de calcul non-déterministe. Je montre que la duplication de noeuds virtuels non-déterministes né cessite une sémantique de transactions imbriquées pour les appels à distance. Les effets de défaillances ne sont pourtant pas locaux à l'ensemble des duplicata d'un noeud virtuel: lors d'une défaillance, les appels à distance imbriqués faits à d'autres noeuds doivent être annulés. La sémantique transactionnelle nécessite également un compromis en ce qui concerne la transparence: l'application doit identifier son état global, car il ne peut pas être identifié avec certitude de manière transparente dans le support d'exé cution. Mais le problème le plus sévère est sans doute qu'il est même impossible de mettre en oeuvre cette approche d'une manière transparente car le critère de cohérence de Ada 95 (linéarisabilité) est beaucoup moins fort que celui de transactions (sérialisabilité). L'exécution d'appels à distance comme des transactions peut donc conduire à des incompatibilités avec la sémantique du langage de programmation lui-même. Si un sous-programme appelé à distance sur un noeud virtuel dupliqué accomplit des opérations partiels, c.-à-d. des appels d'entrées d'objets protégés globaux, des interblocages qui ne peuvent pas être résolus sont possibles. De tels interblocages ne surviendraient pas si le noeud virtuel n'était pas dupliqué! Il est donc indispensable que la nature transactionnelle des appels à distance soit révélée à l'application.

La deuxième approche se fonde sur un modèle de calcul à étapes déterministes (piecewise deterministic computation model) où l'exécution d'un noeud virtuel est modélisé e par une séquence d'intervalles déterministes d'états. À chaque événement non-déterministe, un nouvel intervalle d'état commence. J'étudie l'organisation de duplicata dans ce modèle de calcul par le moyen de la duplication semi-active. Dans cette forme d'organisation, toutes les décisions non-déterministes se font sur un duplicata distingué (appelé leader ou meneur) et tous les autres duplicata (les followers, ou suiveurs) sont forcés de suivre la même séquence d'événements non-déterministes. Je démontre qu'il suffit de synchroniser les suiveurs avec le meneur chaque fois que ce dernier est sur le point d'exécuter un événement observable, c'est-à-dire quand il envoie un message à un autre noeud virtuel. Il n'est pas nécessaire de les synchroniser à chaque événement - ceci entraînerait une baisse de performance prohibitive. Des événements non-déterministes qui surviennent entre deux événements observables sur le meneur peuvent être enregistrés dans un journal qui est envoyé aux suiveurs juste avant que le meneur exécute un événement observable. On garantit ainsi que les suiveurs vont atteindre le même état que le meneur. Avec ce procédé, les effets de dé faillances restent presque complètement locaux aux duplicata.

Une première mise en oeuvre, appelée RAPIDS (Replicated Ada Partitions In Distributed Systems), montre la faisabilité de la deuxième approche fondée sur le modèle de calcul à étapes déterministes. RAPIDS est un gestionnaire de duplicata pour la duplication semi-active pour le système de développement Ada 95 GNAT. Il est entièrement encapsulé dans le support d'exécution et il est donc largement transparent pour l'application.

Retour au début de page

Zusammenfassung

Diese Dissertation analysiert die Replikation von nichtdeterministischen Objekten in verteilten Systemen zum Zwecke der Fehlertoleranz bezüglich Abstürzen (crash failures). Die replizierten Objekte sind hier die logischen Knoten (virtual nodes) einer verteilten Anwendung. Replikation von logischen Knoten soll für die Anwendung transparent sein: da ihre Funktionalität unverändert bleibt, wenn Knoten repliziert werden, soll auch ihre Entwicklung nicht tangiert werden. Erst bei der Konfiguration der verteilten Anwendung muss die Replikation von Knoten berücksichtigt werden.

Das Replizieren von Knoten führt - wie alle Methoden der Fehlertoleranz - zu Redundanz im System. Erwartungsgemäss ist die Koordination der Repliken somit das grösste zu lösende Problem. Ziel dieser Koordination ist Replikationstransparenz, d.h. ein repliziertes Objekt soll sich gegenüber dem Rest des Systems so verhalten, als ob es nicht repliziert wäre. Betrachtet man die Replikation von aktiven Objekten, so wird dies noch erschwert, wenn ein repliziertes Objekt selbst Dienste von weiteren Objekten benutzt.

Die Problematik der Replikation von aktiven, nichtdeterministischen Objekten wird in dieser Dissertation anhand der Programmiersprache Ada 95 untersucht. Der ISO-Standard dieser Sprache definiert ein Modell für die Programmierung und Ausführung von verteilten Anwendungen, das im wesentlichen auf dem Konzept des Remote Procedure Calls (RPC) beruht. Die einzelnen Knoten einer verteilten Anwendung in Ada 95 kommunizieren zwar ausschliesslich durch RPC, können jedoch ohne weiteres Tasks beinhalten. Aufgrund impliziter Zeitabhängigkeiten wird das Verhalten eines solchen Knotens nichtdeterministisch, selbst wenn ein deterministischer Scheduling- Algorithmus für Tasks gewählt wird.

In dieser Dissertation stelle ich zwei verschiedene Methoden für die Koordination nichtdeterministischer Repliken vor. In einem ersten Ansatz wird die Laufzeitbibliothek von Ada 95, mit Ausnahme der Unterstützung der Kommunikation in verteilten Anwendungen, als "black box" betrachtet. Dies entspricht einem nichtdeterministischen Ablaufmodell. Das Replizieren nichtdeterministischer Knoten führt dazu, dass Remote Procedure Calls die Semantik von geschachtelten Transaktionen haben müssen. Dabei sind jedoch die Auswirkungen eines Absturzes nicht auf die Gruppe der Repliken eines Knotens beschränkt: tritt ein Absturz während der Ausführung eines Remote Procedure Calls auf, so müssen alle geschachtelten RPCs zu anderen Knoten zurückgesetzt werden. Der gravierendste Mangel dieses Ansatzes ist jedoch, dass er nicht in transparenter Art und Weise implementiert werden kann. Einerseits muss eine Anwendung ihre globalen Daten identifizieren, da diese nicht zuverlässig automatisch erkannt werden können, andererseits ist die Konsistenzbedingung von Ada 95 (Linearisierbarkeit) weniger strikt ist als jene von Transaktionen (Serialisierbarkeit). Damit kann eine Ausführung von Remote Procedure Calls als Transaktionen zu Inkompatibilitäten mit der Semantik der Programmiersprache führen, falls in einem replizierten Knoten partielle Operationen, d.h. Aufrufe von Entries von globalen Protected Objects, ausgeführt werden. In einem solchen Fall können Deadlocks, die nicht aufgelöst werden können und zu denen es nicht gekommen wä re, wenn der Knoten nicht repliziert worden wäre, auftreten! Somit muss die transaktionelle Implementierung von Remote Procedure Calls zwangslä ufig der Anwendung sichtbar gemacht werden, damit diese den neuen Rahmenbedingungen Rechnung tragen kann.

Der zweite Ansatz geht von einem stückweise deterministischen Ablaufmodell (piecewise deterministic computation model) aus. Dabei wird die Ausführung einer Anwendung (bzw. eines Knotens) durch eine Sequenz von deterministischen Zustandsintervallen abgebildet. Mit jedem nichtdeterministischen Ereignis wird ein neues Zustandsintervall begonnen. Die Koordination mittels semi-aktiver Replikation wird untersucht. Dabei werden alle nichtdeterministischen Entscheidungen auf einer ausgezeichneten Replik (dem Leader) getroffen, während alle anderen Repliken (die Followers) diesen Entscheidungen zu folgen haben, d.h. alle nichtdeterministische Ereignisse in der selben Reihenfolge ausführen müssen. Es wird gezeigt, dass eine Synchronisation zwischen dem Leader und seinen Followers bei jedem extern sichtbaren Ereignis, d.h. wenn der Leader eine Nachricht an einen anderen Knoten sendet, ausreichend ist. Es ist nicht erforderlich, die Repliken bei jedem nichtdeterministischen Ereignis zu synchronisieren - was auf Grund des hohen Aufwands kaum praktikabel wäre. Das Auftreten nichtdeterministischer Ereignisse auf dem Leader zwischen zwei extern sichtbaren Ereignissen wird in einem Log aufgezeichnet. Bevor der Leader ein extern sichtbares Ereignis ausfü hrt, wird dieses Log allen Followers übermittelt. Damit kann sichergestellt werden, dass diese den gleichen Zustand wie der Leader erreichen werden. Auf diese Art und Weise können die Auswirkungen von Abstürzen grösstenteils in der Gruppe von Repliken verkapselt werden.

Die Implementation des Prototyps RAPIDS (Replicated Ada Partitions In Distributed Systems) zeigt, dass dieser zweite Ansatz des stückweisen Determinismus machbar ist. RAPIDS ist ein Replikationsmanager für semi-aktive Replikation für GNAT (ein Entwicklungssystem für Ada 95), der gänzlich in den Laufzeitbibliotheken verkapselt ist. Die Replikation von Knoten ist somit weitgehend transparent für Anwendungen.

Zurück zum Seitenanfang

Replication Of Non-Deterministic Objects

Introduction

Abstract

Résumé

Zusammenfassung