Failures to Tolerate (FTT)

Reading Time: 5 mins

FTT is familiar name when working on vSAN and I consider this as most important policy when designing a vSan cluster. In this post, I am going to quickly walk you through the use of this policy and how it works..

The Number of Failures to Tolerate capability addresses the key customer and design requirement of availability. With FTT, availability is provided by maintaining replica copies of data, to mitigate the risk of a host failure resulting in lost connectivity to data or potential data loss. To start with let me tell you about the number of hosts required to achieve the specific FTT value:

If you want to tolerate n failures, then you need 2n + 1 ESXi hosts in the VSAN cluster, like shown in below table:

What if the cluster do not have required number of hosts to satisfy the defined FTT value in storage policy and you are trying to deploy a vm ?  vm creation would fail with error as “Cannot complete file creation operation”.

FTT supports different types of availability, this can be chosen when creating a policy. In its most basic form, let’s think about the FTT policy that provides mirrored (RAID1) availability. For instance, a FTT=1 supports n+1 availability by providing a second copy of data on a separate host in the cluster. However, as you would expect, the resulting impact on capacity is that it is doubled. For example, if you deploy a virtual machine with a 200 GB disk, and copy 40 GB of data to the VM, without any further configuration of space efficiency technologies like dedup and compression, the following impact on sizing would result:
  • FTT=0 Results in 40GB of used capacity (not recommended)
  • FTT=1 Results in 80GB of used capacity (n+1)
  • FTT=2 Results in 120GB of used capacity (n+2)
  • FTT=3 Results in 160GB of used capacity (n+3)

Limitations of a Two-Host or Three-Host Cluster Configuration:

In a two hosts cluster (2 hosts + 1 witness) and three-host clusters, only FTT supported is 1. vSAN saves each of the two required replicas of vm data on separate hosts with witness object on a third host. Because of the few hosts in the cluster, the following limitations exist:

  • When a host fails, vSAN cannot rebuild data on another host to protect against another failure and due to this data becomes non-compliant till the host is back.

  • If a host must enter maintenance mode, vSAN cannot evacuate data from the host to maintain policy compliance. While the host is in maintenance mode, data is exposed to a potential failure or inaccessibility if an additional failure occurs. So, full data migration is not supported in this 2 or 3 node clusters.
  • In any situation where two-host or three-host cluster has an inaccessible host or disk group, vSAN objects are at risk of becoming inaccessible should another failure occur.

Erasure Encoding (RAID 5):

Erasure coding provides the same levels of redundancy as mirroring but with a reduced capacity requirement. With RAID 5, a minimum of four hosts are required. Capacity consumption with RAID 5 erasure coding is reduced by 33 percent while still providing a FTT of 1.

Erasure coding is a method of taking data, breaking it into multiple pieces, and spreading it across multiple devices, while adding parity data. The method of spreading the data across multiple devices while adding parity data allows the data to be recreated if one or more of the data pieces is corrupted or lost. Although several methods of erasure coding exist, vSAN supports a RAID 5 and RAID 6 type of data placement and parity pattern as a method of surviving failures and providing space efficiency when compared to RAID 1 mirroring. With RAID 5, the data is placed in a 3 + 1 pattern across hosts. If a single host fails, data is still available. Like any other storage solution, vSAN RAID 5 requires less capacity than mirroring but a performance penalty might exist for workloads that are extremely write intensive or very sensitive to latency.

Erasure Encoding (RAID 6):

With RAID 6, the number of failures to tolerate is two and a minimum of six hosts are required. Capacity consumption with RAID 6 erasure coding is reduced by 50 percent while still providing a FTT of 2.

With RAID 6, the data is placed in a 4 + 2 pattern across hosts. If a host fails, data is still available and protected from an additional failure.
 

Note: RAID 5/6 (erasure coding) does not support 3 failures to tolerate.

 

Thanks for Reading

6 thoughts on “Failures to Tolerate (FTT)

  1. FTT 1 , 3 Failures supported 2*3 +1 =7 hosts
    What happens if 3 hosts failed simultaneously and VMs 2 Copy and Witness are in those hosts ?

    1. FTT 1 supports only one failure, this might be in regards to drive or a host.. In case of a host or drive failure, components will be rebuild into different hosts of cluster. Note: Rebuild happens only in case of 3+ node cluster.. if the cluster has only 3 hosts and FTT is set to 1, then data will be non-compliant if 1 host fails as there is no more host available for data rebuild to happen.. If you are looking for a 3 host failures, then change the policy to FTT 3 and yes, you should have sufficient storage space available to accommodate the components.

      1. Hi, if you have a cluster with for instance 10 ESXi, policy RAID-1 with one FTT, a lot of free space and fail one host, the component will be rebuilt in another host of the cluster. In this case, is it possible another ESXi fail in the cluster? Thax

  2. Hi, if you have a cluster with for instance 10 ESXi, policy RAID-1 with one FTT, a lot of free space and fail one host, the component will be rebuilt in another host of the cluster. In this case, is it possible another ESXi fail in the cluster? Thax

    1. Hi Jose,

      Once the failed host components are rebuild successfully and all objects are compliant with no any other errors, vsan cluster can sustain another host failure, doesn’t matter if the previously failed host is healthy or infield state.

    2. After the component is rebuilt you will again be able to sustain a single single node failure. If that second node fails before the initial rebuild is completed, you won’t be able to access that VM until the first rebuild completes. Your VM will be accessible while the second node is rebuilt.

Comments are closed.