ovn/TODO

   1 -*- outline -*-
   2
   3 * L3 support
   4
   5 ** OVN_Northbound schema
   6
   7 *** Needs to support extra routes
   8
   9 Currently a router port has a single route associated with it, but
  10 presumably we should support multiple routes.  For connections from
  11 one router to another, this doesn't seem to matter (just put more than
  12 one connection between them), but for connections between a router and
  13 a switch it might matter because a switch has only one router port.
  14
  15 ** OVN_SB schema
  16
  17 *** Allow output to ingress port
  18
  19 Sometimes when a packet ingresses into a router, it has to egress the
  20 same port.  One example is a "one-armed" router that has multiple
  21 routes on a single port (or in which a host is (mis)configured to send
  22 every IP packet to the router, e.g. due to a bad netmask).  Another is
  23 when a router needs to send an ICMP reply to an ingressing packet.
  24
  25 To some degree this problem is layered, because there are two
  26 different notions of "ingress port".  The first is the OpenFlow
  27 ingress port, essentially a physical port identifier.  This is
  28 implemented as part of ovs-vswitchd's OpenFlow implementation.  It
  29 prevents a reply from being sent across the tunnel on which it
  30 arrived.  It is questionable whether this OpenFlow feature is useful
  31 to OVN.  (OVN already has to override it to allow a packet from one
  32 nested container to be forwarded to a different nested container.)
  33 OVS make it possible to disable this feature of OpenFlow by setting
  34 the OpenFlow input port field to 0.  (If one does this too early, of
  35 course, it means that there's no way to actually match on the input
  36 port in the OpenFlow flow tables, but one can work around that by
  37 instead setting the input port just before the output action, possibly
  38 wrapping these actions in push/pop pairs to preserve the input port
  39 for later.)
  40
  41 The second is the OVN logical ingress port, which is implemented in
  42 ovn-controller as part of the logical abstraction, using an OVS
  43 register.  Dropping packets directed to the logical ingress port is
  44 implemented through an OpenFlow table not directly visible to the
  45 logical flow table.  Currently this behavior can't be disabled, but
  46 various ways to ensure it could be implemented, e.g. the same as for
  47 OpenFlow by allowing the logical inport to be zeroed, or by
  48 introducing a new action that ignores the inport.
  49
  50 ** New OVN logical actions
  51
  52 *** arp
  53
  54 Generates an ARP packet based on the current IPv4 packet and allows it
  55 to be processed as part of the current pipeline (and then pop back to
  56 processing the original IPv4 packet).
  57
  58 TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
  59 one per second for a given target.  We might need to do this too.
  60
  61 We probably need to buffer the packet that generated the ARP.  I don't
  62 know where to do that.
  63
  64 *** icmp4 { action... }
  65
  66 Generates an ICMPv4 packet based on the current IPv4 packet and
  67 processes it according to each nested action (and then pops back to
  68 processing the original IPv4 packet).  The intended use case is for
  69 generating "time exceeded" and "destination unreachable" errors.
  70
  71 ovn-sb.xml includes a tentative specification for this action.
  72
  73 Tentatively, the icmp4 action sets a default icmp_type and icmp_code
  74 and lets the nested actions override it.  This means that we'd have to
  75 make icmp_type and icmp_code writable.  Because changing icmp_type and
  76 icmp_code can change the interpretation of the rest of the data in the
  77 ICMP packet, we would want to think this through carefully.  If it
  78 seems like a bad idea then we could instead make the type and code a
  79 parameter to the action: icmp4(type, code) { action... }
  80
  81 It is worth considering what should be considered the ingress port for
  82 the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
  83 to go back out the ingress port.  Maybe the icmp4 action, therefore,
  84 should clear the inport, so that output to the original inport won't
  85 be discarded.
  86
  87 *** tcp_reset
  88
  89 Transforms the current TCP packet into a RST reply.
  90
  91 ovn-sb.xml includes a tentative specification for this action.
  92
  93 *** Other actions for IPv6.
  94
  95 IPv6 will probably need an action or actions for ND that is similar to
  96 the "arp" action, and an action for generating
  97
  98 *** ovn-controller translation to OpenFlow
  99
 100 The following two translation strategies come to mind.  Some of the
 101 new actions we might want to implement one way, some of them the
 102 other, depending on the details.
 103
 104 *** Implementation strategies
 105
 106 One way to do this is to define new actions as Open vSwitch extensions
 107 to OpenFlow, emit those actions in ovn-controller, and implement them
 108 in ovs-vswitchd (possibly pushing the implementations into the Linux
 109 and DPDK datapaths as well).  This is the only acceptable way for
 110 actions that need high performance.  None of these actions obviously
 111 need high performance, but it might be necessary to have fairness in
 112 handling e.g. a flood of incoming packets that require these actions.
 113 The main disadvantage of this approach is that it ties ovs-vswitchd
 114 (and the Linux kernel module) to supporting these actions essentially
 115 forever, which means that we'd want to make sure that they are
 116 general-purpose, well designed, maintainable, and supportable.
 117
 118 The other way to do this is to send the packets across an OpenFlow
 119 channel to ovn-controller and have ovn-controller process them.  This
 120 is acceptable for actions that don't need high performance, and it
 121 means that we don't add anything permanently to ovs-vswitchd or the
 122 kernel (so we can be more casual about the design).  The big
 123 disadvantage is that it becomes necessary to add a way to resume the
 124 OpenFlow pipeline when it is interrupted in the middle by sending a
 125 packet to the controller.  This is not as simple as doing a new flow
 126 table lookup and resuming from that point.  Instead, it is equivalent
 127 to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
 128 Much of this logic can be translated into OpenFlow actions (e.g. the
 129 call stack and data stack), but some of it is entirely outside
 130 OpenFlow (e.g. the state of mirrors).  To implement it properly, it
 131 seems that we'll have to introduce a new Open vSwitch extension to
 132 OpenFlow, a "send-to-controller" action that causes extra data to be
 133 sent to the controller, where the extra data packages up the state
 134 necessary to resume the pipeline.  Maybe the bits of the state that
 135 can be represented in OpenFlow can be embedded in this extra data in a
 136 controller-readable form, but other bits we might want to be opaque.
 137 It's also likely that we'll want to change and extend the form of this
 138 opaque data over time, so this should be allowed for, e.g. by
 139 including a nonce in the extra data that is newly generated every time
 140 ovs-vswitchd starts.
 141
 142 *** OpenFlow action definitions
 143
 144 Define OpenFlow wire structures for each new OpenFlow action and
 145 implement them in lib/ofp-actions.[ch].
 146
 147 *** OVS implementation
 148
 149 Add code for action translation.  Possibly add datapath code for
 150 action implementation.  However, none of these new actions should
 151 require high-bandwidth processing so we could at least start with them
 152 implemented in userspace only.  (ARP field modification is already
 153 userspace-only and no one has complained yet.)
 154
 155 ** IPv6
 156
 157 *** ND versus ARP
 158
 159 *** IPv6 routing
 160
 161 *** ICMPv6
 162
 163 ** Dynamic IP to MAC bindings
 164
 165 Some bindings from IP address to MAC will undoubtedly need to be
 166 discovered dynamically through ARP requests.  It's straightforward
 167 enough for a logical L3 router to generate ARP requests and forward
 168 them to the appropriate switch.
 169
 170 It's more difficult to figure out where the reply should be processed
 171 and stored.  It might seem at first that a first-cut implementation
 172 could just keep track of the binding on the hypervisor that needs to
 173 know, but that can't happen easily because the VM that sends the reply
 174 might not be on the same HV as the VM that needs the answer (that is,
 175 the VM that sent the packet that needs the binding to be resolved) and
 176 there isn't an easy way for it to know which HV needs the answer.
 177
 178 Thus, the HV that processes the ARP reply (which is unknown when the
 179 ARP is sent) has to tell all the HVs the binding.  The most obvious
 180 place for this in the OVN_Southbound database.
 181
 182 Details need to be worked out, including:
 183
 184 *** OVN_Southbound schema changes.
 185
 186 Possibly bindings could be added to the Port_Binding table by adding
 187 or modifying columns.  Another possibility is that another table
 188 should be added.
 189
 190 *** Logical_Flow representation
 191
 192 It would be really nice to maintain the general-purpose nature of
 193 logical flows, but these bindings might have to include some
 194 hard-coded special cases, especially when it comes to the relationship
 195 with populating the bindings into the OVN_Southbound table.
 196
 197 *** Tracking queries
 198
 199 It's probably best to only record in the database responses to queries
 200 actually issued by an L3 logical router, so somehow they have to be
 201 tracked, probably by putting a tentative binding without a MAC address
 202 into the database.
 203
 204 *** Renewal and expiration.
 205
 206 Something needs to make sure that bindings remain valid and expire
 207 those that become stale.
 208
 209 ** MTU handling (fragmentation on output)
 210
 211 ** Ratelimiting.
 212
 213 *** ARP.
 214
 215 *** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
 216
 217 As a point of comparison, Linux doesn't ratelimit TCP resets but I
 218 think it does everything else.
 219
 220 * ovn-controller
 221
 222 ** ovn-controller parameters and configuration.
 223
 224 *** SSL configuration.
 225
 226     Can probably get this from Open_vSwitch database.
 227
 228 ** Security
 229
 230 *** Limiting the impact of a compromised chassis.
 231
 232     Every instance of ovn-controller has the same full access to the central
 233     OVN_Southbound database.  This means that a compromised chassis can
 234     interfere with the normal operation of the rest of the deployment.  Some
 235     specific examples include writing to the logical flow table to alter
 236     traffic handling or updating the port binding table to claim ports that are
 237     actually present on a different chassis.  In practice, the compromised host
 238     would be fighting against ovn-northd and other instances of ovn-controller
 239     that would be trying to restore the correct state.  The impact could include
 240     at least temporarily redirecting traffic (so the compromised host could
 241     receive traffic that it shouldn't) and potentially a more general denial of
 242     service.
 243
 244     There are different potential improvements to this area.  The first would be
 245     to add some sort of ACL scheme to ovsdb-server.  A proposal for this should
 246     first include an ACL scheme for ovn-controller.  An example policy would
 247     be to make Logical_Flow read-only.  Table-level control is needed, but is
 248     not enough.  For example, ovn-controller must be able to update the Chassis
 249     and Encap tables, but should only be able to modify the rows associated with
 250     that chassis and no others.
 251
 252     A more complex example is the Port_Binding table.  Currently, ovn-controller
 253     is the source of truth of where a port is located.  There seems to be  no
 254     policy that can prevent malicious behavior of a compromised host with this
 255     table.
 256
 257     An alternative scheme for port bindings would be to provide an optional mode
 258     where an external entity controls port bindings and make them read-only to
 259     ovn-controller.  This is actually how OpenStack works today, for example.
 260     The part of OpenStack that manages VMs (Nova) tells the networking component
 261     (Neutron) where a port will be located, as opposed to the networking
 262     component discovering it.
 263
 264 * ovsdb-server
 265
 266   ovsdb-server should have adequate features for OVN but it probably
 267   needs work for scale and possibly for availability as deployments
 268   grow.  Here are some thoughts.
 269
 270   Andy Zhou is looking at these issues.
 271
 272 *** Reducing amount of data sent to clients.
 273
 274     Currently, whenever a row monitored by a client changes,
 275     ovsdb-server sends the client every monitored column in the row,
 276     even if only one column changes.  It might be valuable to reduce
 277     this only to the columns that changes.
 278
 279     Also, whenever a column changes, ovsdb-server sends the entire
 280     contents of the column.  It might be valuable, for columns that
 281     are sets or maps, to send only added or removed values or
 282     key-values pairs.
 283
 284     Currently, clients monitor the entire contents of a table.  It
 285     might make sense to allow clients to monitor only rows that
 286     satisfy specific criteria, e.g. to allow an ovn-controller to
 287     receive only Logical_Flow rows for logical networks on its hypervisor.
 288
 289 *** Reducing redundant data and code within ovsdb-server.
 290
 291     Currently, ovsdb-server separately composes database update
 292     information to send to each of its clients.  This is fine for a
 293     small number of clients, but it wastes time and memory when
 294     hundreds of clients all want the same updates (as will be in the
 295     case in OVN).
 296
 297     (This is somewhat opposed to the idea of letting a client monitor
 298     only some rows in a table, since that would increase the diversity
 299     among clients.)
 300
 301 *** Multithreading.
 302
 303     If it turns out that other changes don't let ovsdb-server scale
 304     adequately, we can multithread ovsdb-server.  Initially one might
 305     only break protocol handling into separate threads, leaving the
 306     actual database work serialized through a lock.
 307
 308 ** Increasing availability.
 309
 310    Database availability might become an issue.  The OVN system
 311    shouldn't grind to a halt if the database becomes unavailable, but
 312    it would become impossible to bring VIFs up or down, etc.
 313
 314    My current thought on how to increase availability is to add
 315    clustering to ovsdb-server, probably via the Raft consensus
 316    algorithm.  As an experiment, I wrote an implementation of Raft
 317    for Open vSwitch that you can clone from:
 318
 319        https://github.com/blp/ovs-reviews.git raft
 320
 321 ** Reducing startup time.
 322
 323    As-is, if ovsdb-server restarts, every client will fetch a fresh
 324    copy of the part of the database that it cares about.  With
 325    hundreds of clients, this could cause heavy CPU load on
 326    ovsdb-server and use excessive network bandwidth.  It would be
 327    better to allow incremental updates even across connection loss.
 328    One way might be to use "Difference Digests" as described in
 329    Epstein et al., "What's the Difference? Efficient Set
 330    Reconciliation Without Prior Context".  (I'm not yet aware of
 331    previous non-academic use of this technique.)
 332
 333 ** Support multiple tunnel encapsulations in Chassis.
 334
 335    So far, both ovn-controller and ovn-controller-vtep only allow
 336    chassis to have one tunnel encapsulation entry.  We should extend
 337    the implementation to support multiple tunnel encapsulations.
 338
 339 ** Update learned MAC addresses from VTEP to OVN
 340
 341    The VTEP gateway stores all MAC addresses learned from its
 342    physical interfaces in the 'Ucast_Macs_Local' and the
 343    'Mcast_Macs_Local' tables.  ovn-controller-vtep should be
 344    able to update that information back to ovn-sb database,
 345    so that other chassis know where to send packets destined
 346    to the extended external network instead of broadcasting.
 347
 348 ** Translate ovn-sb Multicast_Group table into VTEP config
 349
 350    The ovn-controller-vtep daemon should be able to translate
 351    the Multicast_Group table entry in ovn-sb database into
 352    Mcast_Macs_Remote table configuration in VTEP database.
 353
 354 * Use BFD as tunnel monitor.
 355
 356    Both ovn-controller and ovn-contorller-vtep should use BFD to
 357    monitor the tunnel liveness.  Both ovs-vswitchd schema and
 358    VTEP schema supports BFD.
 359
 360 * ACL
 361
 362 ** Support FTP ALGs.
 363
 364 ** Support reject action.
 365
 366 ** Support log option.