Fault Tolerance

13 minute read

Updated:

Introduction – Dependability

  • Being fault tolerant is strongly related to dependable systems.
  • image
  • The figure that was introduced for SOA – A component provides services to clients, and to provide services it may require services from other components, and they interact with each other. → imp a component may depend on some other component
  • Logically, a component C depends on c* if the correctness of C’s behaviour depends on the correctness of C* behaviour.
  • Dependability is 1 to 1 logical matching, where component c depends on c* if the correctness of C depends on correctness od c* → correctness is key
    • C can be a process, component, channel etc

Dependability – Requirements

  • In order to put the concept of fault tolerance on DS, we have to look at dependability. We need to look at it bc in fault tolerance, there’s this aspect of dependability, but also correctness.
  • We go back to first lecture, and we look at the challenges of DS. → fault tolerance!
  • If this concept of dependability is key, there is some key requirements. Understand the differences.
    • Availability – system is available and is ready to be used.
      • Its constantly there, you want to use it instantly → usually refers to the concept of ‘instant time
    • Reliability - the continuity of service delivery without failure.
      • Refers to the ‘interval of time’.
      • High reliability means that it continues to work without interruption during a relatively long period of time.
      • The system has been reliable for the past hour/ days. Bc to the continuity of the delivery, it will be related to time interval.
    • Safety – when a system temporarily fails to operate correctly, there is very low probability of something catastrophic situations
      • When sending someone in the space we need high safety. Failure for a brief moment can result in sth disastrous.
    • Maintainability – how easy can a failed system be repaired
      • Once ur system fails it needs to detect the fault and deal with the failure, and get the system back to service.

Reliability vs Availability

  • Example
    • Suppose you have a comp system with 100 machines, and for one hour, ur system fails for one millisecond and back to service. In this case, you could say that the system was 99.9% available, but it has low reliability – in a sense that in a millisecond, I can’t trust that the system was functioning properly
    • There’s a system that works 12 month a year – in July and August you decide to shut down the system. In this case, the availability is pretty good, for 95%, but for those 10-month you know it worked without crash. And the reliability is high!!
  • → difference btwn: availability – instant time, and reliability – time interval
  • Reliability of component C R(t)
    • Is a conditional probability that C has been function correctly during time interval 0-t, given C was functioning correctly at T=0.
  • 3 VERY important metrics
  • Screenshot 2020-02-25 at 9 23 34 pm
    • Mean Time To Failure - MTTF – the average time until a component fails
      • System is up, and it works properly but it fails and its detected.
      • The red parts in the diagram
    • Mean Time To Repair - MTTR – the average time needed to replier a component
      • The red part
    • Mean Time Between Failures - MTBF – MTTF + MTTR -
    • Well known and v imp metrics

Reliability vs Availability II

  • Availability of component C – A(t), aka Is the system instantly ready – you can look at the availability of component C, which is the average fraction of the time that component C has been up and running in interval 0 and t.
    • Screenshot 2020-02-25 at 9 24 02 pm→ fraction of. time that its been available
  • Ideally, in building DS, the long-term availability – A(∞), A. expect the system to be always be available. If we look at the metrics, availability is:
    • Mean time to failure / (mean time to failure + mean time to repair)
  • if this is an imp metric I can make sense of, you will hopefully have an accurate notion of when the failure occurs and its consequences.
  • If we look at how we define A, its MTTF / (MTTF+MTTR) If MTTR is small, the mean time to repair, then it’s in ur interest coz then ur system is up and running quickly!!

Terminology

  • Do we understand the diff btwn failure, error and fault. We may tend to look at the definition and interchange them.
  • Screenshot 2020-02-25 at 9 25 14 pm

Failure models

  • We’d like to look at the type of failure that’s likely to happen in DS
  • Screenshot 2020-02-25 at 9 26 19 pm

Dependability vs Security

  • Omission failures– Its dangerous bc in omission failure, it could’ve forgotten an action that it should’ve taken - to react to the system
  • Commission failures- a component takes an action that it shouldn’t have taken
  • Deliberate failures – common in any DS when it comes to security – gotta distinguish btwn deliberate and unintentional
    • In DS, context of security, say there was a security breach – is it deliberate or unintentional? hard to prove.

Failure Masking by Redundancy

  • In order to deal with fault tolerance, DS adopts redundancy. By redundancy, hopefully you will mask those failures
  • 3 different failure masking redundancy
    • Information- you can add an extra bit of data so that errors can be recovered when bits are garbled
      • E.g. in networks when sending a packet, the sender does CRC, adds extra bits In case there’s a noise on the medium so that you can detect that error has happened
    • Time redundancy – Design a system such that an action can be performed again if anything went wrong.
      • TCP – retransmission request to a server when lacking an expected response. you go back to sender and contact.
      • To do with timing, and therefore timer can play imp role
    • Physical – Add more resources so when a system fails it can still function properly
      • To do with software processes or hardware system – you add more server and services to prevent from failing

** Process Resilience**

  • In order to support fault tolerance, we replicate processes into groups.
  • Process replication is a key in DS, so more process you add, unlikely ur DS will fail. In order to protect malfunction processes, you needa organise them according to groups, so that if one process fails some other can take over.

    Screenshot 2020-02-25 at 9 27 19 pm

  • Flat group
    • All processes are equal and there is no leader – decisions are made collectively
    • Advantage is that it is symmetrical and has no single point of failure. If one fails, the group simply gets smaller
    • Disadvantage is that decision making is more complicating and could introduce delay and overhead
  • Hierarchical group
    • There is a coordinator and all the others are workers. And tasks and requests are sent to cordis.
    • If the coordinator fails, it brings the entire group down, but as long as its running it can make decisions without bothering others.
    • If the cordi fails, you needa elect new coordinator, using Leader election algorithm.

Groups and Failure Masking

  • K-fault tolerant group – when a group of processors can mask any k concurrent member failures.
    • K is called degree of fault tolerance
    • So when the system can survive faults in k components and still meet its specifications.
  • Q: What needs to be the size for my system to be fault tolerant?
    • For halting failures (crash/ omission/ timing failures)
      • You will need k+1 members in ur system. if I have 4 members, We need one that’s alive lmao. Hence k+1
    • With arbitrary failures
      • Minimum of 2k+1 processes are need to achieve k-fault tolerance.
      • Coz you need consensus that rest of the processors would agree – its to do with majority vote. If I have 5 and 2 fails, 3 would say that its still alive.

Consensus

  • Concept of consensus in fault tolerance is really important
  • Non faulty will have to do the job – and this work will consist of executing same command in the same order- need to progress all together.
  • With consensus, the non-faulty processors, in order to get an agreement, they need to reach a decision on what command to execute next
    • In recent blockchain, consensus plays a role - due to the distribution of the database, consensus is key!! And there’s always a consensus algorithm.

Consensus in faulty systems with crash failures

  • We have a system and we have potential faults, processes, and you accept the fact that failures may happen.
  • In order to reach an agreement, we look at flooding base consensus
  • System model
    • Process group - you have a system made of number of processors – P = {P1,P2…}
    • if a failure occurs, a fail will be detected and dealt with.
    • There’s a client that contacts Pi requesting it to execute a command
    • Each process in the group in the process maintains a list of proposed commands, and they need to agree on which to execute
  • Algorithm steps
    1. In round r, the process Pi multicasts its known set of commands, C ri, to all others.
    2. At the end of the round r, each Pi is gonna merge all the received commands into a new C i r+1. We have moved on to the next stage all together
  • To decide which command to execute, it selects a globally shared, deterministic function.

Consensus in faulty systems with crash failures II

  • But issue happens when they don’t have the same list hence can’t agree on the command.
  • image
  • You have 4 processors, and each process multicasts its known commands, and these commands are gonna be merged.
  • P2, P3, P4, sent all its commands, but p1 has crashed in round r. but it managed to send info to p2. Interesting is that which command are they gonna decide what to execute.
  • As P2 is the lucky one coz it received everything, so it makes a decision
  • P3, P4 detects that P1 failed as it hasn’t received p1, this case they cannot move forward coz they have sth missing. And it doesn’t know if P2 received anything or not!! → cannot make decision
  • P3 and P4 has to wait for the next round.
  • P2 makes a decision and broadcast decisions to the others
  • Now P3, P4 received the command from P2, now they should be able make decisions.
    • In round r, nothing happened – but since they’ve all moved on, they can execute → consensus!!
  • They can decide to execute the same command selected by P2
  • -> need 2k+1 servers to survive k crashed member, under assumption that the messages are consistent

Consensus in faulty systems with arbitrary failures

  • Here you complicate the problem even more bc the communication btwn process in itself is inconsistent.

    Screenshot 2020-02-25 at 9 29 10 pm

1) 3 processors. P1 seem to have sent a to P2 and P3. P2 is a nasty process so it forwards improper message to the others. It forwarded b!! if you are P3, you received 2 different messages

2) P1 (the master process) is misbehaving and sending diff things to p2 and P3. P2 being naïve and sends b to P3 and now P3 is confused

  • → Problematic!

Consensus in faulty systems with arbitrary failures II

  • System model
    • Process group consisting of n members, which one is designated to be the primary P and n-1 backups B1… Bn-1.
    • A client sends a value v, of true or false to P (primary)
    • Messages can be lost but this can be detected
    • Messages can’t be corrupted beyond detection
    • A receiver of the message can reliably detect its sender.
  • Byzantine Agreement: requirements
    1. BA1 - For every non-faulty backup process, it stores the same value
    2. BA2 - If the primary is non-faulty, then every nonfaulty backup process stores what the primary had sent.
  • Observation
    1. Primary is faulty → BA1 says that backups may store the same, but different value than originally sent by the client.
    2. Primary is non faulty → satisfying BA2 implies that BA1 is satisfied.

Why having 3k process is not enough

  • Situation where we want to tolerate a single process, aka k=1.

Screenshot 2020-02-25 at 9 30 50 pm

1) P is a faulty process - It means that process P is misbehaving as it sends T to B1 and F to B2. B2 as part of the process receives F, and forwards that to B1. B1 forward T to B2. B1 having T,F and B2 having T,F → so what is it? → There is no consensus as they cannot agree.

2) Assuming process P is non faulty, it sends T to B1 and B2. B1 is the faulty one and sends diff one to B2 hence B2 has T and F → no consensus!

→ 2 eg where there’s no agreement

Why having 3k+1 processes is enough

  • If we have 3k+1, and you tolerate the failure of a single process, then 3(1)+1 is possible.

Screenshot 2020-02-25 at 9 31 47 pm

1) Primary is faulty and provides inconsistent information to backups – in first round, T to B1 B3, F to B2. All the backups will forward what they got to all the others. They all have these extra values. In the second round, if all 3 forward the values they received, then you can agree on consensus on value T.

2) One of the backup fails- P is non faulty, but B2 misbehaves and sends diff value to B1 and B3. In this case, with 2 rounds, B1 and B3 will come to same conclusion – since they can agree and it’s the same as what P sent out, consensus is reached! → fulfils BA2.

→ for 1 fault, you need 3k+1 processes- as you need to go 2 rounds

The Byzantine Generals problem

  • It’s to do with reference to byzantine empire in eastern Europe.
  • The generals need to communicate other by messenger, and after observing the enemy they need to come up with a plan of action – whether its attack or retreat. They need to agree on a common decision!
  • Some are royal and some are traitors – and you need algorithm to reach agreement
    1. All loyal generals decide upon the same plan of action
    2. AND that a small number of traitors cannot cause the loyal generals to adopt a bad plan

1 commander, 2 Lieutenants and 2 types of messages

  • Screenshot 2020-02-25 at 9 32 23 pm

1) 2 is a traitor and 1 has received 2 diff messages and doesn’t know what to follow.

  • 1 follows the commander bc of the hierarchy of army – and still 1/3 of the army is weaker by force and this creates confusion

2) Commander is a traitor

  • And 2/3 of the total army has followed the incorrect order!! → failure is certain

1 commander, n Lieutenants and m types of messages

  • Screenshot 2020-02-25 at 9 33 35 pm→ 1 commander, 3 Lieutenants and 3 msgs
  • If the commander itself is faulty, and there are 3 Lieutenants, each of them are receiving 3 different messages. Hence, they don’t know what to do! → this is what happens in DS and they need consensus

The Byzantine Generals problem

  • V imp problem in DS → having consensus.
  • Problem definition: A commanding general must send an order to his n-1 lieutenant generals such that
  • 2 v imp condition to ensure there are consensus
    1. All loyal lieutenants obey the same order
    2. If the commanding general is loyal, then every loyal lieutenant obeys the order he sends
  • In DS terms – given a network of n process which can comm with one another only by meas of messages on a bi-directional channel, ensure that a process sends an item of data to n-1 such that
    1. Reliably operating processes receive the same item
    2. If the sending process sis operating reliably then the item received is identical to the item sent
      • → refer to BA1, BA2
  • In blockchain, you need consensus and its more complex – and they rely on byzantine algorithm

Leave a comment