ovn/TODO

   1 -*- outline -*-
   2
   3 * L3 support
   4
   5 ** New OVN logical actions
   6
   7 *** arp
   8
   9 Generates an ARP packet based on the current IPv4 packet and allows it
  10 to be processed as part of the current pipeline (and then pop back to
  11 processing the original IPv4 packet).
  12
  13 TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to
  14 one per second for a given target.  We might need to do this too.
  15
  16 We probably need to buffer the packet that generated the ARP.  I don't
  17 know where to do that.
  18
  19 *** icmp4 { action... }
  20
  21 Generates an ICMPv4 packet based on the current IPv4 packet and
  22 processes it according to each nested action (and then pops back to
  23 processing the original IPv4 packet).  The intended use case is for
  24 generating "time exceeded" and "destination unreachable" errors.
  25
  26 ovn-sb.xml includes a tentative specification for this action.
  27
  28 Tentatively, the icmp4 action sets a default icmp_type and icmp_code
  29 and lets the nested actions override it.  This means that we'd have to
  30 make icmp_type and icmp_code writable.  Because changing icmp_type and
  31 icmp_code can change the interpretation of the rest of the data in the
  32 ICMP packet, we would want to think this through carefully.  If it
  33 seems like a bad idea then we could instead make the type and code a
  34 parameter to the action: icmp4(type, code) { action... }
  35
  36 It is worth considering what should be considered the ingress port for
  37 the ICMPv4 packet.  It's quite likely that the ICMPv4 packet is going
  38 to go back out the ingress port.  Maybe the icmp4 action, therefore,
  39 should clear the inport, so that output to the original inport won't
  40 be discarded.
  41
  42 *** tcp_reset
  43
  44 Transforms the current TCP packet into a RST reply.
  45
  46 ovn-sb.xml includes a tentative specification for this action.
  47
  48 *** Other actions for IPv6.
  49
  50 IPv6 will probably need an action or actions for ND that is similar to
  51 the "arp" action, and an action for generating
  52
  53 *** ovn-controller translation to OpenFlow
  54
  55 The following two translation strategies come to mind.  Some of the
  56 new actions we might want to implement one way, some of them the
  57 other, depending on the details.
  58
  59 *** Implementation strategies
  60
  61 One way to do this is to define new actions as Open vSwitch extensions
  62 to OpenFlow, emit those actions in ovn-controller, and implement them
  63 in ovs-vswitchd (possibly pushing the implementations into the Linux
  64 and DPDK datapaths as well).  This is the only acceptable way for
  65 actions that need high performance.  None of these actions obviously
  66 need high performance, but it might be necessary to have fairness in
  67 handling e.g. a flood of incoming packets that require these actions.
  68 The main disadvantage of this approach is that it ties ovs-vswitchd
  69 (and the Linux kernel module) to supporting these actions essentially
  70 forever, which means that we'd want to make sure that they are
  71 general-purpose, well designed, maintainable, and supportable.
  72
  73 The other way to do this is to send the packets across an OpenFlow
  74 channel to ovn-controller and have ovn-controller process them.  This
  75 is acceptable for actions that don't need high performance, and it
  76 means that we don't add anything permanently to ovs-vswitchd or the
  77 kernel (so we can be more casual about the design).  The big
  78 disadvantage is that it becomes necessary to add a way to resume the
  79 OpenFlow pipeline when it is interrupted in the middle by sending a
  80 packet to the controller.  This is not as simple as doing a new flow
  81 table lookup and resuming from that point.  Instead, it is equivalent
  82 to the (very complicated) recirculation logic in ofproto-dpif-xlate.c.
  83 Much of this logic can be translated into OpenFlow actions (e.g. the
  84 call stack and data stack), but some of it is entirely outside
  85 OpenFlow (e.g. the state of mirrors).  To implement it properly, it
  86 seems that we'll have to introduce a new Open vSwitch extension to
  87 OpenFlow, a "send-to-controller" action that causes extra data to be
  88 sent to the controller, where the extra data packages up the state
  89 necessary to resume the pipeline.  Maybe the bits of the state that
  90 can be represented in OpenFlow can be embedded in this extra data in a
  91 controller-readable form, but other bits we might want to be opaque.
  92 It's also likely that we'll want to change and extend the form of this
  93 opaque data over time, so this should be allowed for, e.g. by
  94 including a nonce in the extra data that is newly generated every time
  95 ovs-vswitchd starts.
  96
  97 *** OpenFlow action definitions
  98
  99 Define OpenFlow wire structures for each new OpenFlow action and
 100 implement them in lib/ofp-actions.[ch].
 101
 102 *** OVS implementation
 103
 104 Add code for action translation.  Possibly add datapath code for
 105 action implementation.  However, none of these new actions should
 106 require high-bandwidth processing so we could at least start with them
 107 implemented in userspace only.  (ARP field modification is already
 108 userspace-only and no one has complained yet.)
 109
 110 ** IPv6
 111
 112 *** ND versus ARP
 113
 114 *** IPv6 routing
 115
 116 *** ICMPv6
 117
 118 ** Dynamic IP to MAC bindings
 119
 120 Some bindings from IP address to MAC will undoubtedly need to be
 121 discovered dynamically through ARP requests.  It's straightforward
 122 enough for a logical L3 router to generate ARP requests and forward
 123 them to the appropriate switch.
 124
 125 It's more difficult to figure out where the reply should be processed
 126 and stored.  It might seem at first that a first-cut implementation
 127 could just keep track of the binding on the hypervisor that needs to
 128 know, but that can't happen easily because the VM that sends the reply
 129 might not be on the same HV as the VM that needs the answer (that is,
 130 the VM that sent the packet that needs the binding to be resolved) and
 131 there isn't an easy way for it to know which HV needs the answer.
 132
 133 Thus, the HV that processes the ARP reply (which is unknown when the
 134 ARP is sent) has to tell all the HVs the binding.  The most obvious
 135 place for this in the OVN_Southbound database.
 136
 137 Details need to be worked out, including:
 138
 139 *** OVN_Southbound schema changes.
 140
 141 Possibly bindings could be added to the Port_Binding table by adding
 142 or modifying columns.  Another possibility is that another table
 143 should be added.
 144
 145 *** Logical_Flow representation
 146
 147 It would be really nice to maintain the general-purpose nature of
 148 logical flows, but these bindings might have to include some
 149 hard-coded special cases, especially when it comes to the relationship
 150 with populating the bindings into the OVN_Southbound table.
 151
 152 *** Tracking queries
 153
 154 It's probably best to only record in the database responses to queries
 155 actually issued by an L3 logical router, so somehow they have to be
 156 tracked, probably by putting a tentative binding without a MAC address
 157 into the database.
 158
 159 *** Renewal and expiration.
 160
 161 Something needs to make sure that bindings remain valid and expire
 162 those that become stale.
 163
 164 ** MTU handling (fragmentation on output)
 165
 166 ** Ratelimiting.
 167
 168 *** ARP.
 169
 170 *** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ...
 171
 172 As a point of comparison, Linux doesn't ratelimit TCP resets but I
 173 think it does everything else.
 174
 175 * ovn-controller
 176
 177 ** ovn-controller parameters and configuration.
 178
 179 *** SSL configuration.
 180
 181     Can probably get this from Open_vSwitch database.
 182
 183 ** Security
 184
 185 *** Limiting the impact of a compromised chassis.
 186
 187     Every instance of ovn-controller has the same full access to the central
 188     OVN_Southbound database.  This means that a compromised chassis can
 189     interfere with the normal operation of the rest of the deployment.  Some
 190     specific examples include writing to the logical flow table to alter
 191     traffic handling or updating the port binding table to claim ports that are
 192     actually present on a different chassis.  In practice, the compromised host
 193     would be fighting against ovn-northd and other instances of ovn-controller
 194     that would be trying to restore the correct state.  The impact could include
 195     at least temporarily redirecting traffic (so the compromised host could
 196     receive traffic that it shouldn't) and potentially a more general denial of
 197     service.
 198
 199     There are different potential improvements to this area.  The first would be
 200     to add some sort of ACL scheme to ovsdb-server.  A proposal for this should
 201     first include an ACL scheme for ovn-controller.  An example policy would
 202     be to make Logical_Flow read-only.  Table-level control is needed, but is
 203     not enough.  For example, ovn-controller must be able to update the Chassis
 204     and Encap tables, but should only be able to modify the rows associated with
 205     that chassis and no others.
 206
 207     A more complex example is the Port_Binding table.  Currently, ovn-controller
 208     is the source of truth of where a port is located.  There seems to be  no
 209     policy that can prevent malicious behavior of a compromised host with this
 210     table.
 211
 212     An alternative scheme for port bindings would be to provide an optional mode
 213     where an external entity controls port bindings and make them read-only to
 214     ovn-controller.  This is actually how OpenStack works today, for example.
 215     The part of OpenStack that manages VMs (Nova) tells the networking component
 216     (Neutron) where a port will be located, as opposed to the networking
 217     component discovering it.
 218
 219 * ovsdb-server
 220
 221   ovsdb-server should have adequate features for OVN but it probably
 222   needs work for scale and possibly for availability as deployments
 223   grow.  Here are some thoughts.
 224
 225   Andy Zhou is looking at these issues.
 226
 227 *** Reducing amount of data sent to clients.
 228
 229     Currently, whenever a row monitored by a client changes,
 230     ovsdb-server sends the client every monitored column in the row,
 231     even if only one column changes.  It might be valuable to reduce
 232     this only to the columns that changes.
 233
 234     Also, whenever a column changes, ovsdb-server sends the entire
 235     contents of the column.  It might be valuable, for columns that
 236     are sets or maps, to send only added or removed values or
 237     key-values pairs.
 238
 239     Currently, clients monitor the entire contents of a table.  It
 240     might make sense to allow clients to monitor only rows that
 241     satisfy specific criteria, e.g. to allow an ovn-controller to
 242     receive only Logical_Flow rows for logical networks on its hypervisor.
 243
 244 *** Reducing redundant data and code within ovsdb-server.
 245
 246     Currently, ovsdb-server separately composes database update
 247     information to send to each of its clients.  This is fine for a
 248     small number of clients, but it wastes time and memory when
 249     hundreds of clients all want the same updates (as will be in the
 250     case in OVN).
 251
 252     (This is somewhat opposed to the idea of letting a client monitor
 253     only some rows in a table, since that would increase the diversity
 254     among clients.)
 255
 256 *** Multithreading.
 257
 258     If it turns out that other changes don't let ovsdb-server scale
 259     adequately, we can multithread ovsdb-server.  Initially one might
 260     only break protocol handling into separate threads, leaving the
 261     actual database work serialized through a lock.
 262
 263 ** Increasing availability.
 264
 265    Database availability might become an issue.  The OVN system
 266    shouldn't grind to a halt if the database becomes unavailable, but
 267    it would become impossible to bring VIFs up or down, etc.
 268
 269    My current thought on how to increase availability is to add
 270    clustering to ovsdb-server, probably via the Raft consensus
 271    algorithm.  As an experiment, I wrote an implementation of Raft
 272    for Open vSwitch that you can clone from:
 273
 274        https://github.com/blp/ovs-reviews.git raft
 275
 276 ** Reducing startup time.
 277
 278    As-is, if ovsdb-server restarts, every client will fetch a fresh
 279    copy of the part of the database that it cares about.  With
 280    hundreds of clients, this could cause heavy CPU load on
 281    ovsdb-server and use excessive network bandwidth.  It would be
 282    better to allow incremental updates even across connection loss.
 283    One way might be to use "Difference Digests" as described in
 284    Epstein et al., "What's the Difference? Efficient Set
 285    Reconciliation Without Prior Context".  (I'm not yet aware of
 286    previous non-academic use of this technique.)
 287
 288 ** Support multiple tunnel encapsulations in Chassis.
 289
 290    So far, both ovn-controller and ovn-controller-vtep only allow
 291    chassis to have one tunnel encapsulation entry.  We should extend
 292    the implementation to support multiple tunnel encapsulations.
 293
 294 ** Update learned MAC addresses from VTEP to OVN
 295
 296    The VTEP gateway stores all MAC addresses learned from its
 297    physical interfaces in the 'Ucast_Macs_Local' and the
 298    'Mcast_Macs_Local' tables.  ovn-controller-vtep should be
 299    able to update that information back to ovn-sb database,
 300    so that other chassis know where to send packets destined
 301    to the extended external network instead of broadcasting.
 302
 303 ** Translate ovn-sb Multicast_Group table into VTEP config
 304
 305    The ovn-controller-vtep daemon should be able to translate
 306    the Multicast_Group table entry in ovn-sb database into
 307    Mcast_Macs_Remote table configuration in VTEP database.
 308
 309 * Consider the use of BFD as tunnel monitor.
 310
 311   The use of BFD for hypervisor-to-hypervisor tunnels is probably not worth it,
 312   since there's no alternative to switch to if a tunnel goes down.  It could
 313   make sense at a slow rate if someone does OVN monitoring system integration,
 314   but not otherwise.
 315
 316   When OVN gets to supporting HA for gateways (see ovn/OVN-GW-HA.md), BFD is
 317   likely needed as a part of that solution.
 318
 319   There's more commentary in this ML post:
 320   http://openvswitch.org/pipermail/dev/2015-November/062385.html
 321
 322 * ACL
 323
 324 ** Support FTP ALGs.
 325
 326 ** Support reject action.
 327
 328 ** Support log option.