ovn/TODO

   1 -*- outline -*-
   2
   3 * L3 support
   4
   5 ** OVN_Northbound schema
   6
   7 *** Needs to support interconnected routers
   8
   9 It should be possible to connect one router to another, e.g. to
  10 represent a provider/tenant router relationship.  This requires
  11 an OVN_Northbound schema change.
  12
  13 *** Needs to support extra routes
  14
  15 Currently a router port has a single route associated with it, but
  16 presumably we should support multiple routes.  For connections from
  17 one router to another, this doesn't seem to matter (just put more than
  18 one connection between them), but for connections between a router and
  19 a switch it might matter because a switch has only one router port.
  20
  21 *** Logical router port names in ACLs
  22
  23 Currently the ACL table documents that the logical router port is
  24 always named "ROUTER".  This can't work directly using logical patch
  25 ports to connect a logical switch to its logical router, because every
  26 row in the Logical_Port table must have a unique name.  This probably
  27 means that we should change the convention for the ACL table so that
  28 the logical router port name is unique; for example, we could change
  29 the Logical_Router_Port table to require the 'name' column to be
  30 unique, and then use that name in the ACL table.
  31
  32 Another alternative would be to add a way to have aliases for logical
  33 ports, but I'm not sure that's a rathole we really want to go down.
  34
  35 ** OVN_SB schema
  36
  37 *** Allow output to ingress port
  38
  39 Sometimes when a packet ingresses into a router, it has to egress the
  40 same port.  One example is a "one-armed" router that has multiple
  41 routes on a single port (or in which a host is (mis)configured to send
  42 every IP packet to the router, e.g. due to a bad netmask).  Another is
  43 when a router needs to send an ICMP reply to an ingressing packet.
  44
  45 To some degree this problem is layered, because there are two
  46 different notions of "ingress port".  The first is the OpenFlow
  47 ingress port, essentially a physical port identifier.  This is
  48 implemented as part of ovs-vswitchd's OpenFlow implementation.  It
  49 prevents a reply from being sent across the tunnel on which it
  50 arrived.  It is questionable whether this OpenFlow feature is useful
  51 to OVN.  (OVN already has to override it to allow a packet from one
  52 nested container to be forwarded to a different nested container.)
  53 OVS make it possible to disable this feature of OpenFlow by setting
  54 the OpenFlow input port field to 0.  (If one does this too early, of
  55 course, it means that there's no way to actually match on the input
  56 port in the OpenFlow flow tables, but one can work around that by
  57 instead setting the input port just before the output action, possibly
  58 wrapping these actions in push/pop pairs to preserve the input port
  59 for later.)
  60
  61 The second is the OVN logical ingress port, which is implemented in
  62 ovn-controller as part of the logical abstraction, using an OVS
  63 register.  Dropping packets directed to the logical ingress port is
  64 implemented through an OpenFlow table not directly visible to the
  65 logical flow table.  Currently this behavior can't be disabled, but
  66 various ways to ensure it could be implemented, e.g. the same as for
  67 OpenFlow by allowing the logical inport to be zeroed, or by
  68 introducing a new action that ignores the inport.
  69
  70 ** ovn-northd
  71
  72 *** What flows should it generate?
  73
  74 See description in ovn-northd(8).
  75
  76 ** New OVN logical actions
  77
  78 *** arp
  79
  80 Generates an ARP packet based on the current IPv4 packet and allows it
  81 to be processed as part of the current pipeline (and then pop back to
  82 processing the original IPv4 packet).
  83
  84 TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
  85 one per second for a given target.  We might need to do this too.
  86
  87 We probably need to buffer the packet that generated the ARP.  I don't
  88 know where to do that.
  89
  90 *** icmp4 { action... }
  91
  92 Generates an ICMPv4 packet based on the current IPv4 packet and
  93 processes it according to each nested action (and then pops back to
  94 processing the original IPv4 packet).  The intended use case is for
  95 generating "time exceeded" and "destination unreachable" errors.
  96
  97 ovn-sb.xml includes a tentative specification for this action.
  98
  99 Tentatively, the icmp4 action sets a default icmp_type and icmp_code
 100 and lets the nested actions override it.  This means that we'd have to
 101 make icmp_type and icmp_code writable.  Because changing icmp_type and
 102 icmp_code can change the interpretation of the rest of the data in the
 103 ICMP packet, we would want to think this through carefully.  If it
 104 seems like a bad idea then we could instead make the type and code a
 105 parameter to the action: icmp4(type, code) { action... }
 106
 107 It is worth considering what should be considered the ingress port for
 108 the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
 109 to go back out the ingress port.  Maybe the icmp4 action, therefore,
 110 should clear the inport, so that output to the original inport won't
 111 be discarded.
 112
 113 *** tcp_reset
 114
 115 Transforms the current TCP packet into a RST reply.
 116
 117 ovn-sb.xml includes a tentative specification for this action.
 118
 119 *** Other actions for IPv6.
 120
 121 IPv6 will probably need an action or actions for ND that is similar to
 122 the "arp" action, and an action for generating
 123
 124 *** ovn-controller translation to OpenFlow
 125
 126 The following two translation strategies come to mind.  Some of the
 127 new actions we might want to implement one way, some of them the
 128 other, depending on the details.
 129
 130 *** Implementation strategies
 131
 132 One way to do this is to define new actions as Open vSwitch extensions
 133 to OpenFlow, emit those actions in ovn-controller, and implement them
 134 in ovs-vswitchd (possibly pushing the implementations into the Linux
 135 and DPDK datapaths as well).  This is the only acceptable way for
 136 actions that need high performance.  None of these actions obviously
 137 need high performance, but it might be necessary to have fairness in
 138 handling e.g. a flood of incoming packets that require these actions.
 139 The main disadvantage of this approach is that it ties ovs-vswitchd
 140 (and the Linux kernel module) to supporting these actions essentially
 141 forever, which means that we'd want to make sure that they are
 142 general-purpose, well designed, maintainable, and supportable.
 143
 144 The other way to do this is to send the packets across an OpenFlow
 145 channel to ovn-controller and have ovn-controller process them.  This
 146 is acceptable for actions that don't need high performance, and it
 147 means that we don't add anything permanently to ovs-vswitchd or the
 148 kernel (so we can be more casual about the design).  The big
 149 disadvantage is that it becomes necessary to add a way to resume the
 150 OpenFlow pipeline when it is interrupted in the middle by sending a
 151 packet to the controller.  This is not as simple as doing a new flow
 152 table lookup and resuming from that point.  Instead, it is equivalent
 153 to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
 154 Much of this logic can be translated into OpenFlow actions (e.g. the
 155 call stack and data stack), but some of it is entirely outside
 156 OpenFlow (e.g. the state of mirrors).  To implement it properly, it
 157 seems that we'll have to introduce a new Open vSwitch extension to
 158 OpenFlow, a "send-to-controller" action that causes extra data to be
 159 sent to the controller, where the extra data packages up the state
 160 necessary to resume the pipeline.  Maybe the bits of the state that
 161 can be represented in OpenFlow can be embedded in this extra data in a
 162 controller-readable form, but other bits we might want to be opaque.
 163 It's also likely that we'll want to change and extend the form of this
 164 opaque data over time, so this should be allowed for, e.g. by
 165 including a nonce in the extra data that is newly generated every time
 166 ovs-vswitchd starts.
 167
 168 *** OpenFlow action definitions
 169
 170 Define OpenFlow wire structures for each new OpenFlow action and
 171 implement them in lib/ofp-actions.[ch].
 172
 173 *** OVS implementation
 174
 175 Add code for action translation.  Possibly add datapath code for
 176 action implementation.  However, none of these new actions should
 177 require high-bandwidth processing so we could at least start with them
 178 implemented in userspace only.  (ARP field modification is already
 179 userspace-only and no one has complained yet.)
 180
 181 ** IPv6
 182
 183 *** ND versus ARP
 184
 185 *** IPv6 routing
 186
 187 *** ICMPv6
 188
 189 ** IP to MAC binding
 190
 191 Somehow it has to be possible for an L3 logical router to map from an
 192 IP address to an Ethernet address.  This can happen statically or
 193 dynamically.  Probably both cases need to be supported eventually.
 194
 195 *** Static IP to MAC binding
 196
 197 Commonly, for a VM, the binding of an IP address to a MAC is known
 198 statically.  The Logical_Port table in the OVN_Northbound schema can
 199 be revised to make these bindings known.  Then ovn-northd can
 200 integrate the bindings into the logical router flow table.
 201 (ovn-northd can also integrate them into the logical switch flow table
 202 to terminate ARP requests from VIFs.)
 203
 204 *** Dynamic IP to MAC bindings
 205
 206 Some bindings from IP address to MAC will undoubtedly need to be
 207 discovered dynamically through ARP requests.  It's straightforward
 208 enough for a logical L3 router to generate ARP requests and forward
 209 them to the appropriate switch.
 210
 211 It's more difficult to figure out where the reply should be processed
 212 and stored.  It might seem at first that a first-cut implementation
 213 could just keep track of the binding on the hypervisor that needs to
 214 know, but that can't happen easily because the VM that sends the reply
 215 might not be on the same HV as the VM that needs the answer (that is,
 216 the VM that sent the packet that needs the binding to be resolved) and
 217 there isn't an easy way for it to know which HV needs the answer.
 218
 219 Thus, the HV that processes the ARP reply (which is unknown when the
 220 ARP is sent) has to tell all the HVs the binding.  The most obvious
 221 place for this in the OVN_Southbound database.
 222
 223 Details need to be worked out, including:
 224
 225 **** OVN_Southbound schema changes.
 226
 227 Possibly bindings could be added to the Port_Binding table by adding
 228 or modifying columns.  Another possibility is that another table
 229 should be added.
 230
 231 **** Logical_Flow representation
 232
 233 It would be really nice to maintain the general-purpose nature of
 234 logical flows, but these bindings might have to include some
 235 hard-coded special cases, especially when it comes to the relationship
 236 with populating the bindings into the OVN_Southbound table.
 237
 238 **** Tracking queries
 239
 240 It's probably best to only record in the database responses to queries
 241 actually issued by an L3 logical router, so somehow they have to be
 242 tracked, probably by putting a tentative binding without a MAC address
 243 into the database.
 244
 245 **** Renewal and expiration.
 246
 247 Something needs to make sure that bindings remain valid and expire
 248 those that become stale.
 249
 250 *** MTU handling (fragmentation on output)
 251
 252 ** Ratelimiting.
 253
 254 *** ARP.
 255
 256 *** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
 257
 258 As a point of comparison, Linux doesn't ratelimit TCP resets but I
 259 think it does everything else.
 260
 261 * ovn-controller
 262
 263 ** ovn-controller parameters and configuration.
 264
 265 *** SSL configuration.
 266
 267     Can probably get this from Open_vSwitch database.
 268
 269 ** Security
 270
 271 *** Limiting the impact of a compromised chassis.
 272
 273     Every instance of ovn-controller has the same full access to the central
 274     OVN_Southbound database.  This means that a compromised chassis can
 275     interfere with the normal operation of the rest of the deployment.  Some
 276     specific examples include writing to the logical flow table to alter
 277     traffic handling or updating the port binding table to claim ports that are
 278     actually present on a different chassis.  In practice, the compromised host
 279     would be fighting against ovn-northd and other instances of ovn-controller
 280     that would be trying to restore the correct state.  The impact could include
 281     at least temporarily redirecting traffic (so the compromised host could
 282     receive traffic that it shouldn't) and potentially a more general denial of
 283     service.
 284
 285     There are different potential improvements to this area.  The first would be
 286     to add some sort of ACL scheme to ovsdb-server.  A proposal for this should
 287     first include an ACL scheme for ovn-controller.  An example policy would
 288     be to make Logical_Flow read-only.  Table-level control is needed, but is
 289     not enough.  For example, ovn-controller must be able to update the Chassis
 290     and Encap tables, but should only be able to modify the rows associated with
 291     that chassis and no others.
 292
 293     A more complex example is the Port_Binding table.  Currently, ovn-controller
 294     is the source of truth of where a port is located.  There seems to be  no
 295     policy that can prevent malicious behavior of a compromised host with this
 296     table.
 297
 298     An alternative scheme for port bindings would be to provide an optional mode
 299     where an external entity controls port bindings and make them read-only to
 300     ovn-controller.  This is actually how OpenStack works today, for example.
 301     The part of OpenStack that manages VMs (Nova) tells the networking component
 302     (Neutron) where a port will be located, as opposed to the networking
 303     component discovering it.
 304
 305 * ovsdb-server
 306
 307   ovsdb-server should have adequate features for OVN but it probably
 308   needs work for scale and possibly for availability as deployments
 309   grow.  Here are some thoughts.
 310
 311   Andy Zhou is looking at these issues.
 312
 313 *** Reducing amount of data sent to clients.
 314
 315     Currently, whenever a row monitored by a client changes,
 316     ovsdb-server sends the client every monitored column in the row,
 317     even if only one column changes.  It might be valuable to reduce
 318     this only to the columns that changes.
 319
 320     Also, whenever a column changes, ovsdb-server sends the entire
 321     contents of the column.  It might be valuable, for columns that
 322     are sets or maps, to send only added or removed values or
 323     key-values pairs.
 324
 325     Currently, clients monitor the entire contents of a table.  It
 326     might make sense to allow clients to monitor only rows that
 327     satisfy specific criteria, e.g. to allow an ovn-controller to
 328     receive only Logical_Flow rows for logical networks on its hypervisor.
 329
 330 *** Reducing redundant data and code within ovsdb-server.
 331
 332     Currently, ovsdb-server separately composes database update
 333     information to send to each of its clients.  This is fine for a
 334     small number of clients, but it wastes time and memory when
 335     hundreds of clients all want the same updates (as will be in the
 336     case in OVN).
 337
 338     (This is somewhat opposed to the idea of letting a client monitor
 339     only some rows in a table, since that would increase the diversity
 340     among clients.)
 341
 342 *** Multithreading.
 343
 344     If it turns out that other changes don't let ovsdb-server scale
 345     adequately, we can multithread ovsdb-server.  Initially one might
 346     only break protocol handling into separate threads, leaving the
 347     actual database work serialized through a lock.
 348
 349 ** Increasing availability.
 350
 351    Database availability might become an issue.  The OVN system
 352    shouldn't grind to a halt if the database becomes unavailable, but
 353    it would become impossible to bring VIFs up or down, etc.
 354
 355    My current thought on how to increase availability is to add
 356    clustering to ovsdb-server, probably via the Raft consensus
 357    algorithm.  As an experiment, I wrote an implementation of Raft
 358    for Open vSwitch that you can clone from:
 359
 360        https://github.com/blp/ovs-reviews.git raft
 361
 362 ** Reducing startup time.
 363
 364    As-is, if ovsdb-server restarts, every client will fetch a fresh
 365    copy of the part of the database that it cares about.  With
 366    hundreds of clients, this could cause heavy CPU load on
 367    ovsdb-server and use excessive network bandwidth.  It would be
 368    better to allow incremental updates even across connection loss.
 369    One way might be to use "Difference Digests" as described in
 370    Epstein et al., "What's the Difference? Efficient Set
 371    Reconciliation Without Prior Context".  (I'm not yet aware of
 372    previous non-academic use of this technique.)
 373
 374 ** Support multiple tunnel encapsulations in Chassis.
 375
 376    So far, both ovn-controller and ovn-controller-vtep only allow
 377    chassis to have one tunnel encapsulation entry.  We should extend
 378    the implementation to support multiple tunnel encapsulations.
 379
 380 ** Update learned MAC addresses from VTEP to OVN
 381
 382    The VTEP gateway stores all MAC addresses learned from its
 383    physical interfaces in the 'Ucast_Macs_Local' and the
 384    'Mcast_Macs_Local' tables.  ovn-controller-vtep should be
 385    able to update that information back to ovn-sb database,
 386    so that other chassis know where to send packets destined
 387    to the extended external network instead of broadcasting.
 388
 389 ** Translate ovn-sb Multicast_Group table into VTEP config
 390
 391    The ovn-controller-vtep daemon should be able to translate
 392    the Multicast_Group table entry in ovn-sb database into
 393    Mcast_Macs_Remote table configuration in VTEP database.
 394
 395 * Use BFD as tunnel monitor.
 396
 397    Both ovn-controller and ovn-contorller-vtep should use BFD to
 398    monitor the tunnel liveness.  Both ovs-vswitchd schema and
 399    VTEP schema supports BFD.
 400
 401 * ACL
 402
 403 ** Support FTP ALGs.
 404
 405 ** Support reject action.
 406
 407 ** Support log option.