ovn/TODO

   1 -*- outline -*-
   2
   3 * L3 support
   4
   5 ** OVN_Northbound schema
   6
   7 *** Needs to support extra routes
   8
   9 Currently a router port has a single route associated with it, but
  10 presumably we should support multiple routes.  For connections from
  11 one router to another, this doesn't seem to matter (just put more than
  12 one connection between them), but for connections between a router and
  13 a switch it might matter because a switch has only one router port.
  14
  15 *** Logical router port names in ACLs
  16
  17 Currently the ACL table documents that the logical router port is
  18 always named "ROUTER".  This can't work directly using logical patch
  19 ports to connect a logical switch to its logical router, because every
  20 row in the Logical_Port table must have a unique name.  This probably
  21 means that we should change the convention for the ACL table so that
  22 the logical router port name is unique; for example, we could change
  23 the Logical_Router_Port table to require the 'name' column to be
  24 unique, and then use that name in the ACL table.
  25
  26 Another alternative would be to add a way to have aliases for logical
  27 ports, but I'm not sure that's a rathole we really want to go down.
  28
  29 ** OVN_SB schema
  30
  31 *** Allow output to ingress port
  32
  33 Sometimes when a packet ingresses into a router, it has to egress the
  34 same port.  One example is a "one-armed" router that has multiple
  35 routes on a single port (or in which a host is (mis)configured to send
  36 every IP packet to the router, e.g. due to a bad netmask).  Another is
  37 when a router needs to send an ICMP reply to an ingressing packet.
  38
  39 To some degree this problem is layered, because there are two
  40 different notions of "ingress port".  The first is the OpenFlow
  41 ingress port, essentially a physical port identifier.  This is
  42 implemented as part of ovs-vswitchd's OpenFlow implementation.  It
  43 prevents a reply from being sent across the tunnel on which it
  44 arrived.  It is questionable whether this OpenFlow feature is useful
  45 to OVN.  (OVN already has to override it to allow a packet from one
  46 nested container to be forwarded to a different nested container.)
  47 OVS make it possible to disable this feature of OpenFlow by setting
  48 the OpenFlow input port field to 0.  (If one does this too early, of
  49 course, it means that there's no way to actually match on the input
  50 port in the OpenFlow flow tables, but one can work around that by
  51 instead setting the input port just before the output action, possibly
  52 wrapping these actions in push/pop pairs to preserve the input port
  53 for later.)
  54
  55 The second is the OVN logical ingress port, which is implemented in
  56 ovn-controller as part of the logical abstraction, using an OVS
  57 register.  Dropping packets directed to the logical ingress port is
  58 implemented through an OpenFlow table not directly visible to the
  59 logical flow table.  Currently this behavior can't be disabled, but
  60 various ways to ensure it could be implemented, e.g. the same as for
  61 OpenFlow by allowing the logical inport to be zeroed, or by
  62 introducing a new action that ignores the inport.
  63
  64 ** ovn-northd
  65
  66 *** What flows should it generate?
  67
  68 See description in ovn-northd(8).
  69
  70 ** New OVN logical actions
  71
  72 *** arp
  73
  74 Generates an ARP packet based on the current IPv4 packet and allows it
  75 to be processed as part of the current pipeline (and then pop back to
  76 processing the original IPv4 packet).
  77
  78 TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
  79 one per second for a given target.  We might need to do this too.
  80
  81 We probably need to buffer the packet that generated the ARP.  I don't
  82 know where to do that.
  83
  84 *** icmp4 { action... }
  85
  86 Generates an ICMPv4 packet based on the current IPv4 packet and
  87 processes it according to each nested action (and then pops back to
  88 processing the original IPv4 packet).  The intended use case is for
  89 generating "time exceeded" and "destination unreachable" errors.
  90
  91 ovn-sb.xml includes a tentative specification for this action.
  92
  93 Tentatively, the icmp4 action sets a default icmp_type and icmp_code
  94 and lets the nested actions override it.  This means that we'd have to
  95 make icmp_type and icmp_code writable.  Because changing icmp_type and
  96 icmp_code can change the interpretation of the rest of the data in the
  97 ICMP packet, we would want to think this through carefully.  If it
  98 seems like a bad idea then we could instead make the type and code a
  99 parameter to the action: icmp4(type, code) { action... }
 100
 101 It is worth considering what should be considered the ingress port for
 102 the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
 103 to go back out the ingress port.  Maybe the icmp4 action, therefore,
 104 should clear the inport, so that output to the original inport won't
 105 be discarded.
 106
 107 *** tcp_reset
 108
 109 Transforms the current TCP packet into a RST reply.
 110
 111 ovn-sb.xml includes a tentative specification for this action.
 112
 113 *** Other actions for IPv6.
 114
 115 IPv6 will probably need an action or actions for ND that is similar to
 116 the "arp" action, and an action for generating
 117
 118 *** ovn-controller translation to OpenFlow
 119
 120 The following two translation strategies come to mind.  Some of the
 121 new actions we might want to implement one way, some of them the
 122 other, depending on the details.
 123
 124 *** Implementation strategies
 125
 126 One way to do this is to define new actions as Open vSwitch extensions
 127 to OpenFlow, emit those actions in ovn-controller, and implement them
 128 in ovs-vswitchd (possibly pushing the implementations into the Linux
 129 and DPDK datapaths as well).  This is the only acceptable way for
 130 actions that need high performance.  None of these actions obviously
 131 need high performance, but it might be necessary to have fairness in
 132 handling e.g. a flood of incoming packets that require these actions.
 133 The main disadvantage of this approach is that it ties ovs-vswitchd
 134 (and the Linux kernel module) to supporting these actions essentially
 135 forever, which means that we'd want to make sure that they are
 136 general-purpose, well designed, maintainable, and supportable.
 137
 138 The other way to do this is to send the packets across an OpenFlow
 139 channel to ovn-controller and have ovn-controller process them.  This
 140 is acceptable for actions that don't need high performance, and it
 141 means that we don't add anything permanently to ovs-vswitchd or the
 142 kernel (so we can be more casual about the design).  The big
 143 disadvantage is that it becomes necessary to add a way to resume the
 144 OpenFlow pipeline when it is interrupted in the middle by sending a
 145 packet to the controller.  This is not as simple as doing a new flow
 146 table lookup and resuming from that point.  Instead, it is equivalent
 147 to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
 148 Much of this logic can be translated into OpenFlow actions (e.g. the
 149 call stack and data stack), but some of it is entirely outside
 150 OpenFlow (e.g. the state of mirrors).  To implement it properly, it
 151 seems that we'll have to introduce a new Open vSwitch extension to
 152 OpenFlow, a "send-to-controller" action that causes extra data to be
 153 sent to the controller, where the extra data packages up the state
 154 necessary to resume the pipeline.  Maybe the bits of the state that
 155 can be represented in OpenFlow can be embedded in this extra data in a
 156 controller-readable form, but other bits we might want to be opaque.
 157 It's also likely that we'll want to change and extend the form of this
 158 opaque data over time, so this should be allowed for, e.g. by
 159 including a nonce in the extra data that is newly generated every time
 160 ovs-vswitchd starts.
 161
 162 *** OpenFlow action definitions
 163
 164 Define OpenFlow wire structures for each new OpenFlow action and
 165 implement them in lib/ofp-actions.[ch].
 166
 167 *** OVS implementation
 168
 169 Add code for action translation.  Possibly add datapath code for
 170 action implementation.  However, none of these new actions should
 171 require high-bandwidth processing so we could at least start with them
 172 implemented in userspace only.  (ARP field modification is already
 173 userspace-only and no one has complained yet.)
 174
 175 ** IPv6
 176
 177 *** ND versus ARP
 178
 179 *** IPv6 routing
 180
 181 *** ICMPv6
 182
 183 ** IP to MAC binding
 184
 185 Somehow it has to be possible for an L3 logical router to map from an
 186 IP address to an Ethernet address.  This can happen statically or
 187 dynamically.  Probably both cases need to be supported eventually.
 188
 189 *** Static IP to MAC binding
 190
 191 Commonly, for a VM, the binding of an IP address to a MAC is known
 192 statically.  The Logical_Port table in the OVN_Northbound schema can
 193 be revised to make these bindings known.  Then ovn-northd can
 194 integrate the bindings into the logical router flow table.
 195 (ovn-northd can also integrate them into the logical switch flow table
 196 to terminate ARP requests from VIFs.)
 197
 198 *** Dynamic IP to MAC bindings
 199
 200 Some bindings from IP address to MAC will undoubtedly need to be
 201 discovered dynamically through ARP requests.  It's straightforward
 202 enough for a logical L3 router to generate ARP requests and forward
 203 them to the appropriate switch.
 204
 205 It's more difficult to figure out where the reply should be processed
 206 and stored.  It might seem at first that a first-cut implementation
 207 could just keep track of the binding on the hypervisor that needs to
 208 know, but that can't happen easily because the VM that sends the reply
 209 might not be on the same HV as the VM that needs the answer (that is,
 210 the VM that sent the packet that needs the binding to be resolved) and
 211 there isn't an easy way for it to know which HV needs the answer.
 212
 213 Thus, the HV that processes the ARP reply (which is unknown when the
 214 ARP is sent) has to tell all the HVs the binding.  The most obvious
 215 place for this in the OVN_Southbound database.
 216
 217 Details need to be worked out, including:
 218
 219 **** OVN_Southbound schema changes.
 220
 221 Possibly bindings could be added to the Port_Binding table by adding
 222 or modifying columns.  Another possibility is that another table
 223 should be added.
 224
 225 **** Logical_Flow representation
 226
 227 It would be really nice to maintain the general-purpose nature of
 228 logical flows, but these bindings might have to include some
 229 hard-coded special cases, especially when it comes to the relationship
 230 with populating the bindings into the OVN_Southbound table.
 231
 232 **** Tracking queries
 233
 234 It's probably best to only record in the database responses to queries
 235 actually issued by an L3 logical router, so somehow they have to be
 236 tracked, probably by putting a tentative binding without a MAC address
 237 into the database.
 238
 239 **** Renewal and expiration.
 240
 241 Something needs to make sure that bindings remain valid and expire
 242 those that become stale.
 243
 244 *** MTU handling (fragmentation on output)
 245
 246 ** Ratelimiting.
 247
 248 *** ARP.
 249
 250 *** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
 251
 252 As a point of comparison, Linux doesn't ratelimit TCP resets but I
 253 think it does everything else.
 254
 255 * ovn-controller
 256
 257 ** ovn-controller parameters and configuration.
 258
 259 *** SSL configuration.
 260
 261     Can probably get this from Open_vSwitch database.
 262
 263 ** Security
 264
 265 *** Limiting the impact of a compromised chassis.
 266
 267     Every instance of ovn-controller has the same full access to the central
 268     OVN_Southbound database.  This means that a compromised chassis can
 269     interfere with the normal operation of the rest of the deployment.  Some
 270     specific examples include writing to the logical flow table to alter
 271     traffic handling or updating the port binding table to claim ports that are
 272     actually present on a different chassis.  In practice, the compromised host
 273     would be fighting against ovn-northd and other instances of ovn-controller
 274     that would be trying to restore the correct state.  The impact could include
 275     at least temporarily redirecting traffic (so the compromised host could
 276     receive traffic that it shouldn't) and potentially a more general denial of
 277     service.
 278
 279     There are different potential improvements to this area.  The first would be
 280     to add some sort of ACL scheme to ovsdb-server.  A proposal for this should
 281     first include an ACL scheme for ovn-controller.  An example policy would
 282     be to make Logical_Flow read-only.  Table-level control is needed, but is
 283     not enough.  For example, ovn-controller must be able to update the Chassis
 284     and Encap tables, but should only be able to modify the rows associated with
 285     that chassis and no others.
 286
 287     A more complex example is the Port_Binding table.  Currently, ovn-controller
 288     is the source of truth of where a port is located.  There seems to be  no
 289     policy that can prevent malicious behavior of a compromised host with this
 290     table.
 291
 292     An alternative scheme for port bindings would be to provide an optional mode
 293     where an external entity controls port bindings and make them read-only to
 294     ovn-controller.  This is actually how OpenStack works today, for example.
 295     The part of OpenStack that manages VMs (Nova) tells the networking component
 296     (Neutron) where a port will be located, as opposed to the networking
 297     component discovering it.
 298
 299 * ovsdb-server
 300
 301   ovsdb-server should have adequate features for OVN but it probably
 302   needs work for scale and possibly for availability as deployments
 303   grow.  Here are some thoughts.
 304
 305   Andy Zhou is looking at these issues.
 306
 307 *** Reducing amount of data sent to clients.
 308
 309     Currently, whenever a row monitored by a client changes,
 310     ovsdb-server sends the client every monitored column in the row,
 311     even if only one column changes.  It might be valuable to reduce
 312     this only to the columns that changes.
 313
 314     Also, whenever a column changes, ovsdb-server sends the entire
 315     contents of the column.  It might be valuable, for columns that
 316     are sets or maps, to send only added or removed values or
 317     key-values pairs.
 318
 319     Currently, clients monitor the entire contents of a table.  It
 320     might make sense to allow clients to monitor only rows that
 321     satisfy specific criteria, e.g. to allow an ovn-controller to
 322     receive only Logical_Flow rows for logical networks on its hypervisor.
 323
 324 *** Reducing redundant data and code within ovsdb-server.
 325
 326     Currently, ovsdb-server separately composes database update
 327     information to send to each of its clients.  This is fine for a
 328     small number of clients, but it wastes time and memory when
 329     hundreds of clients all want the same updates (as will be in the
 330     case in OVN).
 331
 332     (This is somewhat opposed to the idea of letting a client monitor
 333     only some rows in a table, since that would increase the diversity
 334     among clients.)
 335
 336 *** Multithreading.
 337
 338     If it turns out that other changes don't let ovsdb-server scale
 339     adequately, we can multithread ovsdb-server.  Initially one might
 340     only break protocol handling into separate threads, leaving the
 341     actual database work serialized through a lock.
 342
 343 ** Increasing availability.
 344
 345    Database availability might become an issue.  The OVN system
 346    shouldn't grind to a halt if the database becomes unavailable, but
 347    it would become impossible to bring VIFs up or down, etc.
 348
 349    My current thought on how to increase availability is to add
 350    clustering to ovsdb-server, probably via the Raft consensus
 351    algorithm.  As an experiment, I wrote an implementation of Raft
 352    for Open vSwitch that you can clone from:
 353
 354        https://github.com/blp/ovs-reviews.git raft
 355
 356 ** Reducing startup time.
 357
 358    As-is, if ovsdb-server restarts, every client will fetch a fresh
 359    copy of the part of the database that it cares about.  With
 360    hundreds of clients, this could cause heavy CPU load on
 361    ovsdb-server and use excessive network bandwidth.  It would be
 362    better to allow incremental updates even across connection loss.
 363    One way might be to use "Difference Digests" as described in
 364    Epstein et al., "What's the Difference? Efficient Set
 365    Reconciliation Without Prior Context".  (I'm not yet aware of
 366    previous non-academic use of this technique.)
 367
 368 ** Support multiple tunnel encapsulations in Chassis.
 369
 370    So far, both ovn-controller and ovn-controller-vtep only allow
 371    chassis to have one tunnel encapsulation entry.  We should extend
 372    the implementation to support multiple tunnel encapsulations.
 373
 374 ** Update learned MAC addresses from VTEP to OVN
 375
 376    The VTEP gateway stores all MAC addresses learned from its
 377    physical interfaces in the 'Ucast_Macs_Local' and the
 378    'Mcast_Macs_Local' tables.  ovn-controller-vtep should be
 379    able to update that information back to ovn-sb database,
 380    so that other chassis know where to send packets destined
 381    to the extended external network instead of broadcasting.
 382
 383 ** Translate ovn-sb Multicast_Group table into VTEP config
 384
 385    The ovn-controller-vtep daemon should be able to translate
 386    the Multicast_Group table entry in ovn-sb database into
 387    Mcast_Macs_Remote table configuration in VTEP database.
 388
 389 * Use BFD as tunnel monitor.
 390
 391    Both ovn-controller and ovn-contorller-vtep should use BFD to
 392    monitor the tunnel liveness.  Both ovs-vswitchd schema and
 393    VTEP schema supports BFD.
 394
 395 * ACL
 396
 397 ** Support FTP ALGs.
 398
 399 ** Support reject action.
 400
 401 ** Support log option.