-*- outline -*- * L3 support ** New OVN logical actions *** arp Generates an ARP packet based on the current IPv4 packet and allows it to be processed as part of the current pipeline (and then pop back to processing the original IPv4 packet). TCP/IP stacks typically limit the rate at which ARPs are sent, e.g. to one per second for a given target. We might need to do this too. We probably need to buffer the packet that generated the ARP. I don't know where to do that. *** icmp4 { action... } Generates an ICMPv4 packet based on the current IPv4 packet and processes it according to each nested action (and then pops back to processing the original IPv4 packet). The intended use case is for generating "time exceeded" and "destination unreachable" errors. ovn-sb.xml includes a tentative specification for this action. Tentatively, the icmp4 action sets a default icmp_type and icmp_code and lets the nested actions override it. This means that we'd have to make icmp_type and icmp_code writable. Because changing icmp_type and icmp_code can change the interpretation of the rest of the data in the ICMP packet, we would want to think this through carefully. If it seems like a bad idea then we could instead make the type and code a parameter to the action: icmp4(type, code) { action... } It is worth considering what should be considered the ingress port for the ICMPv4 packet. It's quite likely that the ICMPv4 packet is going to go back out the ingress port. Maybe the icmp4 action, therefore, should clear the inport, so that output to the original inport won't be discarded. *** tcp_reset Transforms the current TCP packet into a RST reply. ovn-sb.xml includes a tentative specification for this action. *** Other actions for IPv6. IPv6 will probably need an action or actions for ND that is similar to the "arp" action, and an action for generating ** IPv6 *** ND versus ARP *** IPv6 routing *** ICMPv6 ** Dynamic IP to MAC bindings Some bindings from IP address to MAC will undoubtedly need to be discovered dynamically through ARP requests. It's straightforward enough for a logical L3 router to generate ARP requests and forward them to the appropriate switch. It's more difficult to figure out where the reply should be processed and stored. It might seem at first that a first-cut implementation could just keep track of the binding on the hypervisor that needs to know, but that can't happen easily because the VM that sends the reply might not be on the same HV as the VM that needs the answer (that is, the VM that sent the packet that needs the binding to be resolved) and there isn't an easy way for it to know which HV needs the answer. Thus, the HV that processes the ARP reply (which is unknown when the ARP is sent) has to tell all the HVs the binding. The most obvious place for this in the OVN_Southbound database. Details need to be worked out, including: *** OVN_Southbound schema changes. Possibly bindings could be added to the Port_Binding table by adding or modifying columns. Another possibility is that another table should be added. *** Logical_Flow representation It would be really nice to maintain the general-purpose nature of logical flows, but these bindings might have to include some hard-coded special cases, especially when it comes to the relationship with populating the bindings into the OVN_Southbound table. *** Tracking queries It's probably best to only record in the database responses to queries actually issued by an L3 logical router, so somehow they have to be tracked, probably by putting a tentative binding without a MAC address into the database. *** Renewal and expiration. Something needs to make sure that bindings remain valid and expire those that become stale. ** MTU handling (fragmentation on output) ** Ratelimiting. *** ARP. *** ICMP error generation, TCP reset, UDP unreachable, protocol unreachable, ... As a point of comparison, Linux doesn't ratelimit TCP resets but I think it does everything else. * ovn-controller ** ovn-controller parameters and configuration. *** SSL configuration. Can probably get this from Open_vSwitch database. ** Security *** Limiting the impact of a compromised chassis. Every instance of ovn-controller has the same full access to the central OVN_Southbound database. This means that a compromised chassis can interfere with the normal operation of the rest of the deployment. Some specific examples include writing to the logical flow table to alter traffic handling or updating the port binding table to claim ports that are actually present on a different chassis. In practice, the compromised host would be fighting against ovn-northd and other instances of ovn-controller that would be trying to restore the correct state. The impact could include at least temporarily redirecting traffic (so the compromised host could receive traffic that it shouldn't) and potentially a more general denial of service. There are different potential improvements to this area. The first would be to add some sort of ACL scheme to ovsdb-server. A proposal for this should first include an ACL scheme for ovn-controller. An example policy would be to make Logical_Flow read-only. Table-level control is needed, but is not enough. For example, ovn-controller must be able to update the Chassis and Encap tables, but should only be able to modify the rows associated with that chassis and no others. A more complex example is the Port_Binding table. Currently, ovn-controller is the source of truth of where a port is located. There seems to be no policy that can prevent malicious behavior of a compromised host with this table. An alternative scheme for port bindings would be to provide an optional mode where an external entity controls port bindings and make them read-only to ovn-controller. This is actually how OpenStack works today, for example. The part of OpenStack that manages VMs (Nova) tells the networking component (Neutron) where a port will be located, as opposed to the networking component discovering it. * ovsdb-server ovsdb-server should have adequate features for OVN but it probably needs work for scale and possibly for availability as deployments grow. Here are some thoughts. Andy Zhou is looking at these issues. *** Reducing amount of data sent to clients. Currently, whenever a row monitored by a client changes, ovsdb-server sends the client every monitored column in the row, even if only one column changes. It might be valuable to reduce this only to the columns that changes. Also, whenever a column changes, ovsdb-server sends the entire contents of the column. It might be valuable, for columns that are sets or maps, to send only added or removed values or key-values pairs. Currently, clients monitor the entire contents of a table. It might make sense to allow clients to monitor only rows that satisfy specific criteria, e.g. to allow an ovn-controller to receive only Logical_Flow rows for logical networks on its hypervisor. *** Reducing redundant data and code within ovsdb-server. Currently, ovsdb-server separately composes database update information to send to each of its clients. This is fine for a small number of clients, but it wastes time and memory when hundreds of clients all want the same updates (as will be in the case in OVN). (This is somewhat opposed to the idea of letting a client monitor only some rows in a table, since that would increase the diversity among clients.) *** Multithreading. If it turns out that other changes don't let ovsdb-server scale adequately, we can multithread ovsdb-server. Initially one might only break protocol handling into separate threads, leaving the actual database work serialized through a lock. ** Increasing availability. Database availability might become an issue. The OVN system shouldn't grind to a halt if the database becomes unavailable, but it would become impossible to bring VIFs up or down, etc. My current thought on how to increase availability is to add clustering to ovsdb-server, probably via the Raft consensus algorithm. As an experiment, I wrote an implementation of Raft for Open vSwitch that you can clone from: https://github.com/blp/ovs-reviews.git raft ** Reducing startup time. As-is, if ovsdb-server restarts, every client will fetch a fresh copy of the part of the database that it cares about. With hundreds of clients, this could cause heavy CPU load on ovsdb-server and use excessive network bandwidth. It would be better to allow incremental updates even across connection loss. One way might be to use "Difference Digests" as described in Epstein et al., "What's the Difference? Efficient Set Reconciliation Without Prior Context". (I'm not yet aware of previous non-academic use of this technique.) ** Support multiple tunnel encapsulations in Chassis. So far, both ovn-controller and ovn-controller-vtep only allow chassis to have one tunnel encapsulation entry. We should extend the implementation to support multiple tunnel encapsulations. ** Update learned MAC addresses from VTEP to OVN The VTEP gateway stores all MAC addresses learned from its physical interfaces in the 'Ucast_Macs_Local' and the 'Mcast_Macs_Local' tables. ovn-controller-vtep should be able to update that information back to ovn-sb database, so that other chassis know where to send packets destined to the extended external network instead of broadcasting. ** Translate ovn-sb Multicast_Group table into VTEP config The ovn-controller-vtep daemon should be able to translate the Multicast_Group table entry in ovn-sb database into Mcast_Macs_Remote table configuration in VTEP database. * Consider the use of BFD as tunnel monitor. The use of BFD for hypervisor-to-hypervisor tunnels is probably not worth it, since there's no alternative to switch to if a tunnel goes down. It could make sense at a slow rate if someone does OVN monitoring system integration, but not otherwise. When OVN gets to supporting HA for gateways (see ovn/OVN-GW-HA.md), BFD is likely needed as a part of that solution. There's more commentary in this ML post: http://openvswitch.org/pipermail/dev/2015-November/062385.html * ACL ** Support FTP ALGs. ** Support reject action. ** Support log option.