ovn/TODO

   1 -*- outline -*-
   2
   3 * L3 support
   4
   5 ** OVN_Northbound schema
   6
   7 *** Needs to support interconnected routers
   8
   9 It should be possible to connect one router to another, e.g. to
  10 represent a provider/tenant router relationship.  This requires
  11 an OVN_Northbound schema change.
  12
  13 *** Needs to support extra routes
  14
  15 Currently a router port has a single route associated with it, but
  16 presumably we should support multiple routes.  For connections from
  17 one router to another, this doesn't seem to matter (just put more than
  18 one connection between them), but for connections between a router and
  19 a switch it might matter because a switch has only one router port.
  20
  21 ** OVN_SB schema
  22
  23 *** Logical datapath interconnection
  24
  25 There needs to be a way in the OVN_Southbound database to express
  26 connections between logical datapaths, so that packets can pass from a
  27 logical switch to its logical router (and vice versa) and from one
  28 logical router to another.
  29
  30 One way to do this would be to introduce logical patch ports, closely
  31 modeled on the "physical" patch ports that OVS has had for ages.  Each
  32 logical patch port would consist of two rows in the Port_Binding table
  33 (one in each logical datapath), with type "patch" and an option "peer"
  34 that names the other logical port in the pair.
  35
  36 If we do it this way then we'll need to figure out one odd special
  37 case.  Currently the ACL table documents that the logical router port
  38 is always named "ROUTER".  This can't be done directly with this patch
  39 port technique, because every row in the Logical_Port table must have
  40 a unique name.  This probably means that we should change the
  41 convention for the ACL table so that the logical router port name is
  42 unique; for example, we could change the Logical_Router_Port table to
  43 require the 'name' column to be unique, and then use that name in the
  44 ACL table.
  45
  46 *** Allow output to ingress port
  47
  48 Sometimes when a packet ingresses into a router, it has to egress the
  49 same port.  One example is a "one-armed" router that has multiple
  50 routes on a single port (or in which a host is (mis)configured to send
  51 every IP packet to the router, e.g. due to a bad netmask).  Another is
  52 when a router needs to send an ICMP reply to an ingressing packet.
  53
  54 To some degree this problem is layered, because there are two
  55 different notions of "ingress port".  The first is the OpenFlow
  56 ingress port, essentially a physical port identifier.  This is
  57 implemented as part of ovs-vswitchd's OpenFlow implementation.  It
  58 prevents a reply from being sent across the tunnel on which it
  59 arrived.  It is questionable whether this OpenFlow feature is useful
  60 to OVN.  (OVN already has to override it to allow a packet from one
  61 nested container to be forwarded to a different nested container.)
  62 OVS make it possible to disable this feature of OpenFlow by setting
  63 the OpenFlow input port field to 0.  (If one does this too early, of
  64 course, it means that there's no way to actually match on the input
  65 port in the OpenFlow flow tables, but one can work around that by
  66 instead setting the input port just before the output action, possibly
  67 wrapping these actions in push/pop pairs to preserve the input port
  68 for later.)
  69
  70 The second is the OVN logical ingress port, which is implemented in
  71 ovn-controller as part of the logical abstraction, using an OVS
  72 register.  Dropping packets directed to the logical ingress port is
  73 implemented through an OpenFlow table not directly visible to the
  74 logical flow table.  Currently this behavior can't be disabled, but
  75 various ways to ensure it could be implemented, e.g. the same as for
  76 OpenFlow by allowing the logical inport to be zeroed, or by
  77 introducing a new action that ignores the inport.
  78
  79 ** ovn-northd
  80
  81 *** What flows should it generate?
  82
  83 See description in ovn-northd(8).
  84
  85 ** New OVN logical actions
  86
  87 *** arp
  88
  89 Generates an ARP packet based on the current IPv4 packet and allows it
  90 to be processed as part of the current pipeline (and then pop back to
  91 processing the original IPv4 packet).
  92
  93 TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
  94 one per second for a given target.  We might need to do this too.
  95
  96 We probably need to buffer the packet that generated the ARP.  I don't
  97 know where to do that.
  98
  99 *** icmp4 { action... }
 100
 101 Generates an ICMPv4 packet based on the current IPv4 packet and
 102 processes it according to each nested action (and then pops back to
 103 processing the original IPv4 packet).  The intended use case is for
 104 generating "time exceeded" and "destination unreachable" errors.
 105
 106 ovn-sb.xml includes a tentative specification for this action.
 107
 108 Tentatively, the icmp4 action sets a default icmp_type and icmp_code
 109 and lets the nested actions override it.  This means that we'd have to
 110 make icmp_type and icmp_code writable.  Because changing icmp_type and
 111 icmp_code can change the interpretation of the rest of the data in the
 112 ICMP packet, we would want to think this through carefully.  If it
 113 seems like a bad idea then we could instead make the type and code a
 114 parameter to the action: icmp4(type, code) { action... }
 115
 116 It is worth considering what should be considered the ingress port for
 117 the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
 118 to go back out the ingress port.  Maybe the icmp4 action, therefore,
 119 should clear the inport, so that output to the original inport won't
 120 be discarded.
 121
 122 *** tcp_reset
 123
 124 Transforms the current TCP packet into a RST reply.
 125
 126 ovn-sb.xml includes a tentative specification for this action.
 127
 128 *** Other actions for IPv6.
 129
 130 IPv6 will probably need an action or actions for ND that is similar to
 131 the "arp" action, and an action for generating
 132
 133 *** ovn-controller translation to OpenFlow
 134
 135 The following two translation strategies come to mind.  Some of the
 136 new actions we might want to implement one way, some of them the
 137 other, depending on the details.
 138
 139 *** Implementation strategies
 140
 141 One way to do this is to define new actions as Open vSwitch extensions
 142 to OpenFlow, emit those actions in ovn-controller, and implement them
 143 in ovs-vswitchd (possibly pushing the implementations into the Linux
 144 and DPDK datapaths as well).  This is the only acceptable way for
 145 actions that need high performance.  None of these actions obviously
 146 need high performance, but it might be necessary to have fairness in
 147 handling e.g. a flood of incoming packets that require these actions.
 148 The main disadvantage of this approach is that it ties ovs-vswitchd
 149 (and the Linux kernel module) to supporting these actions essentially
 150 forever, which means that we'd want to make sure that they are
 151 general-purpose, well designed, maintainable, and supportable.
 152
 153 The other way to do this is to send the packets across an OpenFlow
 154 channel to ovn-controller and have ovn-controller process them.  This
 155 is acceptable for actions that don't need high performance, and it
 156 means that we don't add anything permanently to ovs-vswitchd or the
 157 kernel (so we can be more casual about the design).  The big
 158 disadvantage is that it becomes necessary to add a way to resume the
 159 OpenFlow pipeline when it is interrupted in the middle by sending a
 160 packet to the controller.  This is not as simple as doing a new flow
 161 table lookup and resuming from that point.  Instead, it is equivalent
 162 to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
 163 Much of this logic can be translated into OpenFlow actions (e.g. the
 164 call stack and data stack), but some of it is entirely outside
 165 OpenFlow (e.g. the state of mirrors).  To implement it properly, it
 166 seems that we'll have to introduce a new Open vSwitch extension to
 167 OpenFlow, a "send-to-controller" action that causes extra data to be
 168 sent to the controller, where the extra data packages up the state
 169 necessary to resume the pipeline.  Maybe the bits of the state that
 170 can be represented in OpenFlow can be embedded in this extra data in a
 171 controller-readable form, but other bits we might want to be opaque.
 172 It's also likely that we'll want to change and extend the form of this
 173 opaque data over time, so this should be allowed for, e.g. by
 174 including a nonce in the extra data that is newly generated every time
 175 ovs-vswitchd starts.
 176
 177 *** OpenFlow action definitions
 178
 179 Define OpenFlow wire structures for each new OpenFlow action and
 180 implement them in lib/ofp-actions.[ch].
 181
 182 *** OVS implementation
 183
 184 Add code for action translation.  Possibly add datapath code for
 185 action implementation.  However, none of these new actions should
 186 require high-bandwidth processing so we could at least start with them
 187 implemented in userspace only.  (ARP field modification is already
 188 userspace-only and no one has complained yet.)
 189
 190 ** IPv6
 191
 192 *** ND versus ARP
 193
 194 *** IPv6 routing
 195
 196 *** ICMPv6
 197
 198 ** IP to MAC binding
 199
 200 Somehow it has to be possible for an L3 logical router to map from an
 201 IP address to an Ethernet address.  This can happen statically or
 202 dynamically.  Probably both cases need to be supported eventually.
 203
 204 *** Static IP to MAC binding
 205
 206 Commonly, for a VM, the binding of an IP address to a MAC is known
 207 statically.  The Logical_Port table in the OVN_Northbound schema can
 208 be revised to make these bindings known.  Then ovn-northd can
 209 integrate the bindings into the logical router flow table.
 210 (ovn-northd can also integrate them into the logical switch flow table
 211 to terminate ARP requests from VIFs.)
 212
 213 *** Dynamic IP to MAC bindings
 214
 215 Some bindings from IP address to MAC will undoubtedly need to be
 216 discovered dynamically through ARP requests.  It's straightforward
 217 enough for a logical L3 router to generate ARP requests and forward
 218 them to the appropriate switch.
 219
 220 It's more difficult to figure out where the reply should be processed
 221 and stored.  It might seem at first that a first-cut implementation
 222 could just keep track of the binding on the hypervisor that needs to
 223 know, but that can't happen easily because the VM that sends the reply
 224 might not be on the same HV as the VM that needs the answer (that is,
 225 the VM that sent the packet that needs the binding to be resolved) and
 226 there isn't an easy way for it to know which HV needs the answer.
 227
 228 Thus, the HV that processes the ARP reply (which is unknown when the
 229 ARP is sent) has to tell all the HVs the binding.  The most obvious
 230 place for this in the OVN_Southbound database.
 231
 232 Details need to be worked out, including:
 233
 234 **** OVN_Southbound schema changes.
 235
 236 Possibly bindings could be added to the Port_Binding table by adding
 237 or modifying columns.  Another possibility is that another table
 238 should be added.
 239
 240 **** Logical_Flow representation
 241
 242 It would be really nice to maintain the general-purpose nature of
 243 logical flows, but these bindings might have to include some
 244 hard-coded special cases, especially when it comes to the relationship
 245 with populating the bindings into the OVN_Southbound table.
 246
 247 **** Tracking queries
 248
 249 It's probably best to only record in the database responses to queries
 250 actually issued by an L3 logical router, so somehow they have to be
 251 tracked, probably by putting a tentative binding without a MAC address
 252 into the database.
 253
 254 **** Renewal and expiration.
 255
 256 Something needs to make sure that bindings remain valid and expire
 257 those that become stale.
 258
 259 *** MTU handling (fragmentation on output)
 260
 261 ** Ratelimiting.
 262
 263 *** ARP.
 264
 265 *** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
 266
 267 As a point of comparison, Linux doesn't ratelimit TCP resets but I
 268 think it does everything else.
 269
 270 * ovn-controller
 271
 272 ** ovn-controller parameters and configuration.
 273
 274 *** SSL configuration.
 275
 276     Can probably get this from Open_vSwitch database.
 277
 278 ** Security
 279
 280 *** Limiting the impact of a compromised chassis.
 281
 282     Every instance of ovn-controller has the same full access to the central
 283     OVN_Southbound database.  This means that a compromised chassis can
 284     interfere with the normal operation of the rest of the deployment.  Some
 285     specific examples include writing to the logical flow table to alter
 286     traffic handling or updating the port binding table to claim ports that are
 287     actually present on a different chassis.  In practice, the compromised host
 288     would be fighting against ovn-northd and other instances of ovn-controller
 289     that would be trying to restore the correct state.  The impact could include
 290     at least temporarily redirecting traffic (so the compromised host could
 291     receive traffic that it shouldn't) and potentially a more general denial of
 292     service.
 293
 294     There are different potential improvements to this area.  The first would be
 295     to add some sort of ACL scheme to ovsdb-server.  A proposal for this should
 296     first include an ACL scheme for ovn-controller.  An example policy would
 297     be to make Logical_Flow read-only.  Table-level control is needed, but is
 298     not enough.  For example, ovn-controller must be able to update the Chassis
 299     and Encap tables, but should only be able to modify the rows associated with
 300     that chassis and no others.
 301
 302     A more complex example is the Port_Binding table.  Currently, ovn-controller
 303     is the source of truth of where a port is located.  There seems to be  no
 304     policy that can prevent malicious behavior of a compromised host with this
 305     table.
 306
 307     An alternative scheme for port bindings would be to provide an optional mode
 308     where an external entity controls port bindings and make them read-only to
 309     ovn-controller.  This is actually how OpenStack works today, for example.
 310     The part of OpenStack that manages VMs (Nova) tells the networking component
 311     (Neutron) where a port will be located, as opposed to the networking
 312     component discovering it.
 313
 314 * ovsdb-server
 315
 316   ovsdb-server should have adequate features for OVN but it probably
 317   needs work for scale and possibly for availability as deployments
 318   grow.  Here are some thoughts.
 319
 320   Andy Zhou is looking at these issues.
 321
 322 *** Reducing amount of data sent to clients.
 323
 324     Currently, whenever a row monitored by a client changes,
 325     ovsdb-server sends the client every monitored column in the row,
 326     even if only one column changes.  It might be valuable to reduce
 327     this only to the columns that changes.
 328
 329     Also, whenever a column changes, ovsdb-server sends the entire
 330     contents of the column.  It might be valuable, for columns that
 331     are sets or maps, to send only added or removed values or
 332     key-values pairs.
 333
 334     Currently, clients monitor the entire contents of a table.  It
 335     might make sense to allow clients to monitor only rows that
 336     satisfy specific criteria, e.g. to allow an ovn-controller to
 337     receive only Logical_Flow rows for logical networks on its hypervisor.
 338
 339 *** Reducing redundant data and code within ovsdb-server.
 340
 341     Currently, ovsdb-server separately composes database update
 342     information to send to each of its clients.  This is fine for a
 343     small number of clients, but it wastes time and memory when
 344     hundreds of clients all want the same updates (as will be in the
 345     case in OVN).
 346
 347     (This is somewhat opposed to the idea of letting a client monitor
 348     only some rows in a table, since that would increase the diversity
 349     among clients.)
 350
 351 *** Multithreading.
 352
 353     If it turns out that other changes don't let ovsdb-server scale
 354     adequately, we can multithread ovsdb-server.  Initially one might
 355     only break protocol handling into separate threads, leaving the
 356     actual database work serialized through a lock.
 357
 358 ** Increasing availability.
 359
 360    Database availability might become an issue.  The OVN system
 361    shouldn't grind to a halt if the database becomes unavailable, but
 362    it would become impossible to bring VIFs up or down, etc.
 363
 364    My current thought on how to increase availability is to add
 365    clustering to ovsdb-server, probably via the Raft consensus
 366    algorithm.  As an experiment, I wrote an implementation of Raft
 367    for Open vSwitch that you can clone from:
 368
 369        https://github.com/blp/ovs-reviews.git raft
 370
 371 ** Reducing startup time.
 372
 373    As-is, if ovsdb-server restarts, every client will fetch a fresh
 374    copy of the part of the database that it cares about.  With
 375    hundreds of clients, this could cause heavy CPU load on
 376    ovsdb-server and use excessive network bandwidth.  It would be
 377    better to allow incremental updates even across connection loss.
 378    One way might be to use "Difference Digests" as described in
 379    Epstein et al., "What's the Difference? Efficient Set
 380    Reconciliation Without Prior Context".  (I'm not yet aware of
 381    previous non-academic use of this technique.)
 382
 383 ** Support multiple tunnel encapsulations in Chassis.
 384
 385    So far, both ovn-controller and ovn-controller-vtep only allow
 386    chassis to have one tunnel encapsulation entry.  We should extend
 387    the implementation to support multiple tunnel encapsulations.
 388
 389 ** Update learned MAC addresses from VTEP to OVN
 390
 391    The VTEP gateway stores all MAC addresses learned from its
 392    physical interfaces in the 'Ucast_Macs_Local' and the
 393    'Mcast_Macs_Local' tables.  ovn-controller-vtep should be
 394    able to update that information back to ovn-sb database,
 395    so that other chassis know where to send packets destined
 396    to the extended external network instead of broadcasting.
 397
 398 ** Translate ovn-sb Multicast_Group table into VTEP config
 399
 400    The ovn-controller-vtep daemon should be able to translate
 401    the Multicast_Group table entry in ovn-sb database into
 402    Mcast_Macs_Remote table configuration in VTEP database.
 403
 404 * Use BFD as tunnel monitor.
 405
 406    Both ovn-controller and ovn-contorller-vtep should use BFD to
 407    monitor the tunnel liveness.  Both ovs-vswitchd schema and
 408    VTEP schema supports BFD.
 409
 410 * ACL
 411
 412 ** Support FTP ALGs.
 413
 414 ** Support reject action.
 415
 416 ** Support log option.