ovn/OVN-GW-HA.md

   1 OVN Gateway High Availability Plan
   2 ==================================
   3 ```
   4          +---------------------------+
   5          |                           |
   6          |     External Network      |
   7          |                           |
   8          +-------------^-------------+
   9                        |
  10                        |
  11                  +-----------+
  12                  |           |
  13                  |  Gateway  |
  14                  |           |
  15                  +-----------+
  16                        ^
  17                        |
  18                        |
  19          +-------------v-------------+
  20          |                           |
  21          |    OVN Virtual Network    |
  22          |                           |
  23          +---------------------------+
  24
  25 OVN Gateway
  26 ```
  27
  28 The OVN gateway is responsible for shuffling traffic between the tunneled
  29 overlay network (governed by ovn-northd), and the legacy physical network.  In
  30 a naive implementation, the gateway is a single x86 server, or hardware VTEP.
  31 For most deployments, a single system has enough forwarding capacity to service
  32 the entire virtualized network, however, it introduces a single point of
  33 failure.  If this system dies, the entire OVN deployment becomes unavailable.
  34 To mitigate this risk, an HA solution is critical -- by spreading
  35 responsibility across multiple systems, no single server failure can take down
  36 the network.
  37
  38 An HA solution is both critical to the manageability of the system, and
  39 extremely difficult to get right.  The purpose of this document, is to propose
  40 a plan for OVN Gateway High Availability which takes into account our past
  41 experience building similar systems.  It should be considered a fluid changing
  42 proposal, not a set-in-stone decree.
  43
  44 Basic Architecture
  45 ------------------
  46 In an OVN deployment, the set of hypervisors and network elements operating
  47 under the guidance of ovn-northd are in what's called "logical space".  These
  48 servers use VXLAN, STT, or Geneve to communicate, oblivious to the details of
  49 the underlying physical network.  When these systems need to communicate with
  50 legacy networks, traffic must be routed through a Gateway which translates from
  51 OVN controlled tunnel traffic, to raw physical network traffic.
  52
  53 Since the gateway is typically the only system with a connection to the
  54 physical network all traffic between logical space and the WAN must travel
  55 through it.  This makes it a critical single point of failure -- if
  56 the gateway dies, communication with the WAN ceases for all systems in logical
  57 space.
  58
  59 To mitigate this risk, multiple gateways should be run in a "High Availability
  60 Cluster" or "HA Cluster".  The HA cluster will be responsible for performing
  61 the duties of a gateways,  while being able to recover gracefully from
  62 individual member failures.
  63
  64 ```
  65          +---------------------------+
  66          |                           |
  67          |     External Network      |
  68          |                           |
  69          +-------------^-------------+
  70                        |
  71                        |
  72 +----------------------v----------------------+
  73 |                                             |
  74 |          High Availability Cluster          |
  75 |                                             |
  76 | +-----------+  +-----------+  +-----------+ |
  77 | |           |  |           |  |           | |
  78 | |  Gateway  |  |  Gateway  |  |  Gateway  | |
  79 | |           |  |           |  |           | |
  80 | +-----------+  +-----------+  +-----------+ |
  81 +----------------------^----------------------+
  82                        |
  83                        |
  84          +-------------v-------------+
  85          |                           |
  86          |    OVN Virtual Network    |
  87          |                           |
  88          +---------------------------+
  89
  90 OVN Gateway HA Cluster
  91 ```
  92
  93 ##### L2 vs L3 High Availability
  94 In order to achieve this goal, there are two broad approaches one can take.
  95 The HA cluster can appear to the network like a giant Layer 2 Ethernet Switch,
  96 or like a giant IP Router. These approaches are called L2HA, and L3HA
  97 respectively.  L2HA allows ethernet broadcast domains to extend into logical
  98 space, a significant advantage, but this comes at a cost.  The need to avoid
  99 transient L2 loops during failover significantly complicates their design.  On
 100 the other hand, L3HA works for most use cases, is simpler, and fails more
 101 gracefully.  For these reasons, it is suggested that OVN supports an L3HA
 102 model, leaving L2HA for future work (or third party VTEP providers).  Both
 103 models are discussed further below.
 104
 105 L3HA
 106 ----
 107 In this section, we'll work through a basic simple L3HA implementation, on top
 108 of which we'll gradually build more sophisticated features explaining their
 109 motivations and implementations as we go.
 110
 111 ### Naive active-backup.
 112 Let's assume that there are a collection of logical routers which a tenant has
 113 asked for, our task is to schedule these logical routers on one of N gateways,
 114 and gracefully redistribute the routers on gateways which have failed.  The
 115 absolute simplest way to achieve this is what we'll call "naive-active-backup".
 116
 117 ```
 118 +----------------+   +----------------+
 119 | Leader         |   | Backup         |
 120 |                |   |                |
 121 |      A B C     |   |                |
 122 |                |   |                |
 123 +----+-+-+-+----++   +-+--------------+
 124      ^ ^ ^ ^    |      |
 125      | | | |    |      |
 126      | | | |  +-+------+---+
 127      + + + +  | ovn-northd |
 128      Traffic  +------------+
 129
 130 Naive Active Backup HA Implementation
 131 ```
 132
 133 In a naive active-backup, one of the Gateways is chosen (arbitrarily) as a
 134 leader.  All logical routers (A, B, C in the figure), are scheduled on this
 135 leader gateway and all traffic flows through it.  ovn-northd monitors this
 136 gateway via OpenFlow echo requests (or some equivalent), and if the gateway
 137 dies, it recreates the routers on one of the backups.
 138
 139 This approach basically works in most cases and should likely be the starting
 140 point for OVN -- it's strictly better than no HA solution and is a good
 141 foundation for more sophisticated solutions.  That said, it's not without it's
 142 limitations. Specifically, this approach doesn't coordinate with the physical
 143 network to minimize disruption during failures, and it tightly couples failover
 144 to ovn-northd (we'll discuss why this is bad in a bit), and wastes resources by
 145 leaving backup gateways completely unutilized.
 146
 147 ##### Router Failover
 148 When ovn-northd notices the leader has died and decides to migrate routers
 149 to a backup gateway, the physical network has to be notified to direct traffic
 150 to the new gateway.  Otherwise, traffic could be blackholed for longer than
 151 necessary making failovers worse than they need to be.
 152
 153 For now, let's assume that OVN requires all gateways to be on the same IP
 154 subnet on the physical network.  If this isn't the case,
 155 gateways would need to participate in routing protocols to orchestrate
 156 failovers, something which is difficult and out of scope of this document.
 157
 158 Since all gateways are on the same IP subnet, we simply need to worry about
 159 updating the MAC learning tables of the Ethernet switches on that subnet.
 160 Presumably, they all have entries for each logical router pointing to the old
 161 leader.  If these entries aren't updated, all traffic will be sent to the (now
 162 defunct) old leader, instead of the new one.
 163
 164 In order to mitigate this issue, it's recommended that the new gateway sends a
 165 Reverse ARP (RARP) onto the physical network for each logical router it now
 166 controls.  A Reverse ARP is a benign protocol used by many hypervisors when
 167 virtual machines migrate to update L2 forwarding tables.  In this case, the
 168 ethernet source address of the RARP is that of the logical router it
 169 corresponds to, and its destination is the broadcast address.  This causes the
 170 RARP to travel to every L2 switch in the broadcast domain, updating forwarding
 171 tables accordingly.  This strategy is recommended in all failover mechanisms
 172 discussed in this document -- when a router newly boots on a new leader, it
 173 should RARP its MAC address.
 174
 175 ### Controller Independent Active-backup
 176 ```
 177 +----------------+   +----------------+
 178 | Leader         |   | Backup         |
 179 |                |   |                |
 180 |      A B C     |   |                |
 181 |                |   |                |
 182 +----------------+   +----------------+
 183      ^ ^ ^ ^
 184      | | | |
 185      | | | |
 186      + + + +
 187      Traffic
 188
 189 Controller Independent Active-Backup Implementation
 190 ```
 191
 192 The fundamental problem with naive active-backup, is it tightly couples the
 193 failover solution to ovn-northd.  This can significantly increase downtime in
 194 the event of a failover as the (often already busy) ovn-northd controller has
 195 to recompute state for the new leader. Worse, if ovn-northd goes down, we
 196 can't perform gateway failover at all.  This violates the principle that
 197 control plane outages should have no impact on dataplane functionality.
 198
 199 In a controller independent active-backup configuration, ovn-northd is
 200 responsible for initial configuration while the HA cluster is responsible for
 201 monitoring the leader, and failing over to a backup if necessary.  ovn-northd
 202 sets HA policy, but doesn't actively participate when failovers occur.
 203
 204 Of course, in this model, ovn-northd is not without some responsibility.  Its
 205 role is to pre-plan what should happen in the event of a failure, leaving it
 206 to the individual switches to execute this plan.  It does this by assigning
 207 each gateway a unique leadership priority.  Once assigned, it communicates this
 208 priority to each node it controls.  Nodes use the leadership priority to
 209 determine which gateway in the cluster is the active leader by using a simple
 210 metric: the leader is the gateway that is healthy, with the highest priority.
 211 If that gateway goes down, leadership falls to the next highest priority, and
 212 conversely, if a new gateway comes up with a higher priority, it takes over
 213 leadership.
 214
 215 Thus, in this model, leadership of the HA cluster is determined simply by the
 216 status of its members.  Therefore if we can communicate the status of each
 217 gateway to each transport node, they can individually figure out which is the
 218 leader, and direct traffic accordingly.
 219
 220 ##### Tunnel Monitoring.
 221 Since in this model leadership is determined exclusively by the health status
 222 of member gateways, a key problem is how do we communicate this information to
 223 the relevant transport nodes.  Luckily, we can do this fairly cheaply using
 224 tunnel monitoring protocols like BFD.
 225
 226 The basic idea is pretty straightforward.  Each transport node maintains a
 227 tunnel to every gateway in the HA cluster (not just the leader).  These
 228 tunnels are monitored using the BFD protocol to see which are alive.  Given
 229 this information, hypervisors can trivially compute the highest priority live
 230 gateway, and thus the leader.
 231
 232 In practice, this leadership computation can be performed trivially using the
 233 bundle or group action.  Rather than using OpenFlow to simply output to the
 234 leader, all gateways could be listed in an active-backup bundle action ordered
 235 by their priority.  The bundle action will automatically take into account the
 236 tunnel monitoring status to output the packet to the highest priority live
 237 gateway.
 238
 239 ##### Inter-Gateway Monitoring
 240 One somewhat subtle aspect of this model, is that failovers are not globally
 241 atomic.  When a failover occurs, it will take some time for all hypervisors to
 242 notice and adjust accordingly.  Similarly, if a new high priority Gateway comes
 243 up, it may take some time for all hypervisors to switch over to the new leader.
 244 In order to avoid confusing the physical network, under these circumstances
 245 it's important for the backup gateways to drop traffic they've received
 246 erroneously.  In order to do this, each Gateway must know whether or not it is,
 247 in fact active.  This can be achieved by creating a mesh of tunnels between
 248 gateways.  Each gateway monitors the other gateways its cluster to determine
 249 which are alive, and therefore whether or not that gateway happens to be the
 250 leader.  If leading, the gateway forwards traffic normally, otherwise it drops
 251 all traffic.
 252
 253 ##### Gateway Leadership Resignation
 254 Sometimes a gateway may be healthy, but still may not be suitable to lead the
 255 HA cluster.  This could happen for several reasons including:
 256
 257 * The physical network is unreachable.
 258 * BFD (or ping) has detected the next hop router is unreachable.
 259 * The Gateway recently booted and isn't fully configured.
 260
 261 In this case, the Gateway should resign leadership by holding its tunnels down
 262 using the other_config:cpath_down flag.  This indicates to participating
 263 hypervisors and Gateways that this gateway should be treated as if it's down,
 264 even though its tunnels are still healthy.
 265
 266 ### Router Specific Active-Backup
 267 ```
 268 +----------------+ +----------------+
 269 |                | |                |
 270 |      A C       | |     B D E      |
 271 |                | |                |
 272 +----------------+ +----------------+
 273               ^ ^   ^ ^
 274               | |   | |
 275               | |   | |
 276               + +   + +
 277                Traffic
 278
 279 Router Specific Active-Backup
 280 ```
 281 Controller independent active-backup is a great advance over naive
 282 active-backup, but it still has one glaring problem -- it under-utilizes the
 283 backup gateways.  In ideal scenario, all traffic would split evenly among the
 284 live set of gateways.  Getting all the way there is somewhat tricky, but as a
 285 step in the direction, one could use the "Router Specific Active-Backup"
 286 algorithm.  This algorithm looks a lot like active-backup on a per logical
 287 router basis, with one twist.  It chooses a different active Gateway for each
 288 logical router.  Thus, in situations where there are several logical routers,
 289 all with somewhat balanced load, this algorithm performs better.
 290
 291 Implementation of this strategy is quite straightforward if built on top of
 292 basic controller independent active-backup.  On a per logical router basis, the
 293 algorithm is the same, leadership is determined by the liveness of the
 294 gateways.  The key difference here is that the gateways must have a different
 295 leadership priority for each logical router.  These leadership priorities can
 296 be computed by ovn-northd just as they had been in the controller independent
 297 active-backup model.
 298
 299 Once we have these per logical router priorities, they simply need be
 300 communicated to the members of the gateway cluster and the hypervisors.  The
 301 hypervisors in particular, need simply have an active-backup bundle action (or
 302 group action) per logical router listing the gateways in priority order for
 303 *that router*, rather than having a single bundle action shared for all the
 304 routers.
 305
 306 Additionally, the gateways need to be updated to take into account individual
 307 router priorities.  Specifically, each gateway should drop traffic of backup
 308 routers it's running, and forward traffic of active gateways, instead of simply
 309 dropping or forwarding everything.  This should likely be done by having
 310 ovn-controller recompute OpenFlow for the gateway, though other options exist.
 311
 312 The final complication is that ovn-northd's logic must be updated to choose
 313 these per logical router leadership priorities in a more sophisticated manner.
 314 It doesn't matter much exactly what algorithm it chooses to do this, beyond
 315 that it should provide good balancing in the common case.  I.E. each logical
 316 routers priorities should be different enough that routers balance to different
 317 gateways even when failures occur.
 318
 319 ##### Preemption
 320 In an active-backup setup, one issue that users will run into is that of
 321 gateway leader preemption.  If a new Gateway is added to a cluster, or for some
 322 reason an existing gateway is rebooted, we could end up in a situation where
 323 the newly activated gateway has higher priority than any other in the HA
 324 cluster.  In this case, as soon as that gateway appears, it will
 325 preempt leadership from the currently active leader causing an unnecessary
 326 failover.  Since failover can be quite expensive, this preemption may be
 327 undesirable.
 328
 329 The controller can optionally avoid preemption by cleverly tweaking the
 330 leadership priorities.  For each router, new gateways should be assigned
 331 priorities that put them second in line or later when they eventually come up.
 332 Furthermore, if a gateway goes down for a significant period of time, its old
 333 leadership priorities should be revoked and new ones should be assigned as if
 334 it's a brand new gateway.  Note that this should only happen if a gateway has
 335 been down for a while (several minutes), otherwise a flapping gateway could
 336 have wide ranging, unpredictable, consequences.
 337
 338 Note that preemption avoidance should be optional depending on the deployment.
 339 One necessarily sacrifices optimal load balancing to satisfy these
 340 requirements as new gateways will get no traffic on boot.  Thus, this feature
 341 represents a trade-off which must be made on a per installation basis.
 342
 343 ### Fully Active-Active HA
 344 ```
 345 +----------------+ +----------------+
 346 |                | |                |
 347 |   A B C D E    | |    A B C D E   |
 348 |                | |                |
 349 +----------------+ +----------------+
 350               ^ ^   ^ ^
 351               | |   | |
 352               | |   | |
 353               + +   + +
 354                Traffic
 355 ```
 356
 357 The final step in L3HA is to have true active-active HA.  In this scenario each
 358 router has an instance on each Gateway, and a mechanism similar to ECMP is used
 359 to distribute traffic evenly among all instances.  This mechanism would require
 360 Gateways to participate in routing protocols with the physical network to
 361 attract traffic and alert of failures.  It is out of scope of this document,
 362 but may eventually be necessary.
 363
 364 L2HA
 365 ----
 366 L2HA is very difficult to get right.  Unlike L3HA, where the consequences of
 367 problems are minor, in L2HA if two gateways are both transiently active, an L2
 368 loop triggers and a broadcast storm results.  In practice to get around this,
 369 gateways end up implementing an overly conservative "when in doubt drop all
 370 traffic" policy, or they implement something like MLAG.
 371
 372 MLAG has multiple gateways work together to pretend to be a single L2 switch
 373 with a large LACP bond.  In principle, it's the right solution to the problem as
 374 it solves the broadcast storm problem, and has been deployed successfully in
 375 other contexts.  That said, it's difficult to get right and not recommended.