ovn: Add initial design documentation.

author Ben Pfaff <blp@nicira.com>

Thu, 26 Feb 2015 22:49:26 +0000 (14:49 -0800)

committer Ben Pfaff <blp@nicira.com>

Thu, 26 Feb 2015 22:49:26 +0000 (14:49 -0800)
author Ben Pfaff <blp@nicira.com>
Thu, 26 Feb 2015 22:49:26 +0000 (14:49 -0800)
committer Ben Pfaff <blp@nicira.com>
Thu, 26 Feb 2015 22:49:26 +0000 (14:49 -0800)
diff --git a/Makefile.am b/Makefile.am

index 0480d20..699a580 100644 (file)
--- a/Makefile.am
+++ b/Makefile.am
@@ -370,3 +370,4 @@ include tutorial/automake.mk
  include vtep/automake.mk
  include datapath-windows/automake.mk
  include datapath-windows/include/automake.mk
+include ovn/automake.mk
diff --git a/configure.ac b/configure.ac

index d2d02ca..795f876 100644 (file)
--- a/configure.ac
+++ b/configure.ac
@@ -1,4 +1,4 @@
-# Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2014 Nicira, Inc.
+# Copyright (c) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Nicira, Inc.
  #
  # Licensed under the Apache License, Version 2.0 (the "License");
  # you may not use this file except in compliance with the License.
@@ -182,6 +182,7 @@ dnl This makes sure that include/openflow gets created in the build directory.
  AC_CONFIG_COMMANDS([include/openflow/openflow.h.stamp])
  
  AC_CONFIG_COMMANDS([utilities/bugtool/dummy], [:])
+AC_CONFIG_COMMANDS([ovn/dummy], [:])
  
  m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES])
  
diff --git a/ovn/TODO b/ovn/TODO

new file mode 100644 (file)

index 0000000..e405c7c
--- /dev/null
+++ b/ovn/TODO
@@ -0,0 +1,306 @@
+* Flow match expression handling library.
+
+  ovn-controller is the primary user of flow match expressions, but
+  the same syntax and I imagine the same code ought to be useful in
+  ovn-nbd for ACL match expressions.
+
+** Definition of data structures to represent a match expression as a
+   syntax tree.
+
+** Definition of data structures to represent variables (fields).
+
+   Fields need names and prerequisites.  Most fields are numeric and
+   thus need widths.  We need also need a way to represent nominal
+   fields (currently just logical port names).  It might be
+   appropriate to associate fields directly with OXM/NXM code points;
+   we have to decide whether we want OVN to use the OVS flow structure
+   or work with OXM more directly.
+
+   Probably should be defined so that the data structure is also
+   useful for references to fields in action parsing.
+
+** Lexical analysis.
+
+   Probably should be defined so that the lexer can be reused for
+   parsing actions.
+
+** Parsing into syntax tree.
+
+** Semantic checking against variable definitions.
+
+** Applying prerequisites.
+
+** Simplification into conjunction-of-disjunctions (CoD) form.
+
+** Transformation from CoD form into OXM matches.
+
+* ovn-controller
+
+** Flow table handling in ovn-controller.
+
+   ovn-controller has to transform logical datapath flows from the
+   database into OpenFlow flows.
+
+*** Definition (or choice) of data structure for flows and flow table.
+
+    It would be natural enough to use "struct flow" and "struct
+    classifier" for this.  Maybe that is what we should do.  However,
+    "struct classifier" is optimized for searches based on packet
+    headers, whereas all we care about here can be implemented with a
+    hash table.  Also, we may want to make it easy to add and remove
+    support for fields without recompiling, which is not possible with
+    "struct flow" or "struct classifier".
+
+    On the other hand, we may find that it is difficult to decide that
+    two OXM flow matches are identical (to normalize them) without a
+    lot of domain-specific knowledge that is already embedded in struct
+    flow.  It's also going to be a pain to come up with a way to make
+    anything other than "struct flow" work with the ofputil_*()
+    functions for encoding and decoding OpenFlow.
+
+    It's also possible we could use struct flow without struct
+    classifier.
+
+*** Assembling conjunctive flows from flow match expressions.
+
+    This transformation explodes logical datapath flows into multiple
+    OpenFlow flow table entries, since a flow match expression in CoD
+    form requires several OpenFlow flow table entries.  It also
+    requires merging together OpenFlow flow tables entries that contain
+    "conjunction" actions (really just concatenating their actions).
+
+*** Translating logical datapath port names into port numbers.
+
+    Logical ports are specified by name in logical datapath flows, but
+    OpenFlow only works in terms of numbers.
+
+*** Translating logical datapath actions into OpenFlow actions.
+
+    Some of the logical datapath actions do not have natural
+    representations as OpenFlow actions: they require
+    packet-in/packet-out round trips through ovn-controller.  The
+    trickiest part of that is going to be making sure that the
+    packet-out resumes the control flow that was broken off by the
+    packet-in.  That's tricky; we'll probably have to restrict control
+    flow or add OVS features to make resuming in general possible.  Not
+    sure which is better at this point.
+
+*** OpenFlow flow table synchronization.
+
+    The internal representation of the OpenFlow flow table has to be
+    synced across the controller connection to OVS.  This probably
+    boils down to the "flow monitoring" feature of OF1.4 which was then
+    made available as a "standard extension" to OF1.3.  (OVS hasn't
+    implemented this for OF1.4 yet, but the feature is based on a OVS
+    extension to OF1.0, so it should be straightforward to add it.)
+
+    We probably need some way to catch cases where OVS and OVN don't
+    see eye-to-eye on what exactly constitutes a flow, so that OVN
+    doesn't waste a lot of CPU time hammering at OVS trying to install
+    something that it's not going to do.
+
+*** Logical/physical translation.
+
+    When a packet comes into the integration bridge, the first stage of
+    processing needs to translate it from a physical to a logical
+    context.  When a packet leaves the integration bridge, the final
+    stage of processing needs to translate it back into a physical
+    context.  ovn-controller needs to populate the OpenFlow flows
+    tables to do these translations.
+
+*** Determine how to split logical pipeline across physical nodes.
+
+    From the original OVN architecture document:
+
+    The pipeline processing is split between the ingress and egress
+    transport nodes.  In particular, the logical egress processing may
+    occur at either hypervisor.  Processing the logical egress on the
+    ingress hypervisor requires more state about the egress vif's
+    policies, but reduces traffic on the wire that would eventually be
+    dropped.  Whereas, processing on the egress hypervisor can reduce
+    broadcast traffic on the wire by doing local replication.  We
+    initially plan to process logical egress on the egress hypervisor
+    so that less state needs to be replicated.  However, we may change
+    this behavior once we gain some experience writing the logical
+    flows.
+
+    The split pipeline processing split will influence how tunnel keys
+    are encoded.
+
+** Interaction with Open_vSwitch and OVN databases:
+
+*** Monitor VIFs attached to the integration bridge in Open_vSwitch.
+
+    In response to changes, add or remove corresponding rows in
+    Bindings table in OVN.
+
+*** Populate Chassis row in OVN at startup.  Maintain Chassis row over time.
+
+    (Warn if any other Chassis claims the same IP address.)
+
+*** Remove Chassis and Bindings rows from OVN on exit.
+
+*** Monitor Chassis table in OVN.
+
+    Populate Port records for tunnels to other chassis into
+    Open_vSwitch database.  As a scale optimization later on, one can
+    populate only records for tunnels to other chassis that have
+    logical networks in common with this one.
+
+*** Monitor Pipeline table in OVN, trigger flow table recomputation on change.
+
+** ovn-controller parameters and configuration.
+
+*** Tunnel encapsulation to publish.
+
+    Default: VXLAN? Geneve?
+
+*** Location of Open_vSwitch database.
+
+    We can probably use the same default as ovs-vsctl.
+
+*** Location of OVN database.
+
+    Probably no useful default.
+
+*** SSL configuration.
+
+    Can probably get this from Open_vSwitch database.
+
+* ovn-nbd
+
+** Monitor OVN_Northbound database, trigger Pipeline recomputation on change.
+
+** Translate each OVN_Northbound entity into Pipeline logical datapath flows.
+
+   We have to first sit down and figure out what the general
+   translation of each entity is.  The original OVN architecture
+   description at
+   http://openvswitch.org/pipermail/dev/2015-January/050380.html had
+   some sketches of these, but they need to be completed and
+   elaborated.
+
+   Initially, the simplest way to do this is probably to write
+   straight C code to do a full translation of the entire
+   OVN_Northbound database into the format for the Pipeline table in
+   the OVN database.  As scale increases, this will probably be too
+   inefficient since a small change in OVN_Northbound requires a full
+   recomputation.  At that point, we probably want to adopt a more
+   systematic approach, such as something akin to the "nlog" system
+   used in NVP (see Koponen et al. "Network Virtualization in
+   Multi-tenant Datacenters", NSDI 2014).
+
+** Push logical datapath flows to Pipeline table.
+
+** Monitor OVN database Bindings table.
+
+   Sync rows in the OVN Bindings table to the "up" column in the
+   OVN_Northbound database.
+
+* ovsdb-server
+
+  ovsdb-server should have adequate features for OVN but it probably
+  needs work for scale and possibly for availability as deployments
+  grow.  Here are some thoughts.
+
+  Andy Zhou is looking at these issues.
+
+** Scaling number of connections.
+
+   In typical use today a given ovsdb-server has only a single-digit
+   number of simultaneous connections.  The OVN database will have a
+   connection from every hypervisor.  This use case needs testing and
+   probably coding work.  Here are some possible improvements.
+
+*** Reducing amount of data sent to clients.
+
+    Currently, whenever a row monitored by a client changes,
+    ovsdb-server sends the client every monitored column in the row,
+    even if only one column changes.  It might be valuable to reduce
+    this only to the columns that changes.
+
+    Also, whenever a column changes, ovsdb-server sends the entire
+    contents of the column.  It might be valuable, for columns that
+    are sets or maps, to send only added or removed values or
+    key-values pairs.
+
+    Currently, clients monitor the entire contents of a table.  It
+    might make sense to allow clients to monitor only rows that
+    satisfy specific criteria, e.g. to allow an ovn-controller to
+    receive only Pipeline rows for logical networks on its hypervisor.
+
+*** Reducing redundant data and code within ovsdb-server.
+
+    Currently, ovsdb-server separately composes database update
+    information to send to each of its clients.  This is fine for a
+    small number of clients, but it wastes time and memory when
+    hundreds of clients all want the same updates (as will be in the
+    case in OVN).
+
+    (This is somewhat opposed to the idea of letting a client monitor
+    only some rows in a table, since that would increase the diversity
+    among clients.)
+
+*** Multithreading.
+
+    If it turns out that other changes don't let ovsdb-server scale
+    adequately, we can multithread ovsdb-server.  Initially one might
+    only break protocol handling into separate threads, leaving the
+    actual database work serialized through a lock.
+
+** Increasing availability.
+
+   Database availability might become an issue.  The OVN system
+   shouldn't grind to a halt if the database becomes unavailable, but
+   it would become impossible to bring VIFs up or down, etc.
+
+   My current thought on how to increase availability is to add
+   clustering to ovsdb-server, probably via the Raft consensus
+   algorithm.  As an experiment, I wrote an implementation of Raft
+   for Open vSwitch that you can clone from:
+
+       https://github.com/blp/ovs-reviews.git raft
+
+** Reducing startup time.
+
+   As-is, if ovsdb-server restarts, every client will fetch a fresh
+   copy of the part of the database that it cares about.  With
+   hundreds of clients, this could cause heavy CPU load on
+   ovsdb-server and use excessive network bandwidth.  It would be
+   better to allow incremental updates even across connection loss.
+   One way might be to use "Difference Digests" as described in
+   Epstein et al., "What's the Difference? Efficient Set
+   Reconciliation Without Prior Context".  (I'm not yet aware of
+   previous non-academic use of this technique.)
+
+* Miscellaneous:
+
+** Write ovn-nbctl utility.
+
+   The idea here is that we need a utility to act on the OVN_Northbound
+   database in a way similar to a CMS, so that we can do some testing
+   without an actual CMS in the picture.
+
+   No details yet.
+
+** Init scripts for ovn-controller (on HVs), ovn-nbd, OVN DB server.
+
+** Distribution packaging.
+
+* Not yet scoped:
+
+** Neutron plugin.
+
+*** Create stackforge/networking-ovn repository based on OpenStack's
+cookiecutter git repo generator
+
+*** Document mappings between Neutron data model and the OVN northbound DB
+
+*** Create a Neutron ML2 mechanism driver that implements the mappings
+on Neutron resource requests
+
+*** Add synchronization for when we need to sanity check that the OVN
+northbound DB reflects the current state of the world as intended by
+Neutron (needed for various failure scenarios)
+
+** Gateways.
diff --git a/ovn/automake.mk b/ovn/automake.mk

new file mode 100644 (file)

index 0000000..a4951dc
--- /dev/null
+++ b/ovn/automake.mk
@@ -0,0 +1,77 @@
+# OVN schema and IDL
+EXTRA_DIST += ovn/ovn.ovsschema
+pkgdata_DATA += ovn/ovn.ovsschema
+
+# OVN E-R diagram
+#
+# If "python" or "dot" is not available, then we do not add graphical diagram
+# to the documentation.
+if HAVE_PYTHON
+if HAVE_DOT
+ovn/ovn.gv: ovsdb/ovsdb-dot.in ovn/ovn.ovsschema
+       $(AM_V_GEN)$(OVSDB_DOT) --no-arrows $(srcdir)/ovn/ovn.ovsschema > $@
+ovn/ovn.pic: ovn/ovn.gv ovsdb/dot2pic
+       $(AM_V_GEN)(dot -T plain < ovn/ovn.gv | $(PERL) $(srcdir)/ovsdb/dot2pic -f 3) > $@.tmp && \
+       mv $@.tmp $@
+OVN_PIC = ovn/ovn.pic
+OVN_DOT_DIAGRAM_ARG = --er-diagram=$(OVN_PIC)
+DISTCLEANFILES += ovn/ovn.gv ovn/ovn.pic
+endif
+endif
+
+# OVN schema documentation
+EXTRA_DIST += ovn/ovn.xml
+DISTCLEANFILES += ovn/ovn.5
+man_MANS += ovn/ovn.5
+ovn/ovn.5: \
+       ovsdb/ovsdb-doc ovn/ovn.xml ovn/ovn.ovsschema $(OVN_PIC)
+       $(AM_V_GEN)$(OVSDB_DOC) \
+               $(OVN_DOT_DIAGRAM_ARG) \
+               --version=$(VERSION) \
+               $(srcdir)/ovn/ovn.ovsschema \
+               $(srcdir)/ovn/ovn.xml > $@.tmp && \
+       mv $@.tmp $@
+
+# OVN northbound schema and IDL
+EXTRA_DIST += ovn/ovn-nb.ovsschema
+pkgdata_DATA += ovn/ovn-nb.ovsschema
+
+# OVN northbound E-R diagram
+#
+# If "python" or "dot" is not available, then we do not add graphical diagram
+# to the documentation.
+if HAVE_PYTHON
+if HAVE_DOT
+ovn/ovn-nb.gv: ovsdb/ovsdb-dot.in ovn/ovn-nb.ovsschema
+       $(AM_V_GEN)$(OVSDB_DOT) --no-arrows $(srcdir)/ovn/ovn-nb.ovsschema > $@
+ovn/ovn-nb.pic: ovn/ovn-nb.gv ovsdb/dot2pic
+       $(AM_V_GEN)(dot -T plain < ovn/ovn-nb.gv | $(PERL) $(srcdir)/ovsdb/dot2pic -f 3) > $@.tmp && \
+       mv $@.tmp $@
+OVN_NB_PIC = ovn/ovn-nb.pic
+OVN_NB_DOT_DIAGRAM_ARG = --er-diagram=$(OVN_NB_PIC)
+DISTCLEANFILES += ovn/ovn-nb.gv ovn/ovn-nb.pic
+endif
+endif
+
+# OVN northbound schema documentation
+EXTRA_DIST += ovn/ovn-nb.xml
+DISTCLEANFILES += ovn/ovn-nb.5
+man_MANS += ovn/ovn-nb.5
+ovn/ovn-nb.5: \
+       ovsdb/ovsdb-doc ovn/ovn-nb.xml ovn/ovn-nb.ovsschema $(OVN_NB_PIC)
+       $(AM_V_GEN)$(OVSDB_DOC) \
+               $(OVN_NB_DOT_DIAGRAM_ARG) \
+               --version=$(VERSION) \
+               $(srcdir)/ovn/ovn-nb.ovsschema \
+               $(srcdir)/ovn/ovn-nb.xml > $@.tmp && \
+       mv $@.tmp $@
+
+man_MANS += ovn/ovn-controller.8 ovn/ovn-architecture.7
+EXTRA_DIST += ovn/ovn-controller.8.in ovn/ovn-architecture.7.xml
+
+SUFFIXES += .xml
+%: %.xml
+       $(AM_V_GEN)$(run_python) $(srcdir)/build-aux/xml2nroff \
+               --version=$(VERSION) $< > $@.tmp && mv $@.tmp $@
+
+EXTRA_DIST += ovn/TODO
diff --git a/ovn/ovn-architecture.7.xml b/ovn/ovn-architecture.7.xml

new file mode 100644 (file)

index 0000000..9ffa036
--- /dev/null
+++ b/ovn/ovn-architecture.7.xml
@@ -0,0 +1,339 @@
+<?xml version="1.0" encoding="utf-8"?>
+<manpage program="ovn-architecture" section="7" title="OVN Architecture">
+  <h1>Name</h1>
+  <p>ovn-architecture -- Open Virtual Network architecture</p>
+
+  <h1>Description</h1>
+
+  <p>
+    OVN, the Open Virtual Network, is a system to support virtual network
+    abstraction.  OVN complements the existing capabilities of OVS to add
+    native support for virtual network abstractions, such as virtual L2 and L3
+    overlays and security groups.  Services such as DHCP are also desirable
+    features.  Just like OVS, OVN's design goal is to have a production-quality
+    implementation that can operate at significant scale.
+  </p>
+
+  <p>
+    An OVN deployment consists of several components:
+  </p>
+
+  <ul>
+    <li>
+      <p>
+        A <dfn>Cloud Management System</dfn> (<dfn>CMS</dfn>), which is
+        OVN's ultimate client (via its users and administrators).  OVN
+        integration requires installing a CMS-specific plugin and
+        related software (see below).  OVN initially targets OpenStack
+        as CMS.
+      </p>
+
+      <p>
+        We generally speak of ``the'' CMS, but one can imagine scenarios in
+        which multiple CMSes manage different parts of an OVN deployment.
+      </p>
+    </li>
+
+    <li>
+      An OVN Database physical or virtual node (or, eventually, cluster)
+      installed in a central location.
+    </li>
+
+    <li>
+      One or more (usually many) <dfn>hypervisors</dfn>.  Hypervisors must run
+      Open vSwitch and implement the interface described in
+      <code>IntegrationGuide.md</code> in the OVS source tree.  Any hypervisor
+      platform supported by Open vSwitch is acceptable.
+    </li>
+
+    <li>
+      <p>
+       Zero or more <dfn>gateways</dfn>.  A gateway extends a tunnel-based
+       logical network into a physical network by bidirectionally forwarding
+       packets between tunnels and a physical Ethernet port.  This allows
+       non-virtualized machines to participate in logical networks.  A gateway
+       may be a physical host, a virtual machine, or an ASIC-based hardware
+       switch that supports the <code>vtep</code>(5) schema.  (Support for the
+       latter will come later in OVN implementation.)
+      </p>
+
+      <p>
+       Hypervisors and gateways are together called <dfn>transport node</dfn>
+       or <dfn>chassis</dfn>.
+      </p>
+    </li>
+  </ul>
+
+  <p>
+    The diagram below shows how the major components of OVN and related
+    software interact.  Starting at the top of the diagram, we have:
+  </p>
+
+  <ul>
+    <li>
+      The Cloud Management System, as defined above.
+    </li>
+
+    <li>
+      <p>
+       The <dfn>OVN/CMS Plugin</dfn> is the component of the CMS that
+       interfaces to OVN.  In OpenStack, this is a Neutron plugin.
+       The plugin's main purpose is to translate the CMS's notion of logical
+       network configuration, stored in the CMS's configuration database in a
+       CMS-specific format, into an intermediate representation understood by
+       OVN.
+      </p>
+
+      <p>
+       This component is necessarily CMS-specific, so a new plugin needs to be
+       developed for each CMS that is integrated with OVN.  All of the
+       components below this one in the diagram are CMS-independent.
+      </p>
+    </li>
+
+    <li>
+      <p>
+       The <dfn>OVN Northbound Database</dfn> receives the intermediate
+       representation of logical network configuration passed down by the
+       OVN/CMS Plugin.  The database schema is meant to be ``impedance
+       matched'' with the concepts used in a CMS, so that it directly supports
+       notions of logical switches, routers, ACLs, and so on.  See
+       <code>ovs-nb</code>(5) for details.
+      </p>
+
+      <p>
+       The OVN Northbound Database has only two clients: the OVN/CMS Plugin
+       above it and <code>ovn-nbd</code> below it.
+      </p>
+    </li>
+
+    <li>
+      <code>ovn-nbd</code>(8) connects to the OVN Northbound Database above it
+      and the OVN Database below it.  It translates the logical network
+      configuration in terms of conventional network concepts, taken from the
+      OVN Northbound Database, into logical datapath flows in the OVN Database
+      below it.
+    </li>
+
+    <li>
+      <p>
+       The <dfn>OVN Database</dfn> is the center of the system.  Its clients
+       are <code>ovn-nbd</code>(8) above it and <code>ovn-controller</code>(8)
+       on every transport node below it.
+      </p>
+
+      <p>
+       The OVN Database contains three kinds of data: <dfn>Physical
+       Network</dfn> (PN) tables that specify how to reach hypervisor and
+       other nodes, <dfn>Logical Network</dfn> (LN) tables that describe the
+       logical network in terms of ``logical datapath flows,'' and
+       <dfn>Binding</dfn> tables that link logical network components'
+       locations to the physical network.  The hypervisors populate the PN and
+       Binding tables, whereas <code>ovn-nbd</code>(8) populates the LN
+       tables.
+      </p>
+
+      <p>
+       OVN Database performance must scale with the number of transport nodes.
+       This will likely require some work on <code>ovsdb-server</code>(1) as
+       we encounter bottlenecks.  Clustering for availability may be needed.
+      </p>
+    </li>
+  </ul>
+
+  <p>
+    The remaining components are replicated onto each hypervisor:
+  </p>
+
+  <ul>
+    <li>
+      <code>ovn-controller</code>(8) is OVN's agent on each hypervisor and
+      software gateway.  Northbound, it connects to the OVN Database to learn
+      about OVN configuration and status and to populate the PN and <code>Bindings</code>
+      tables with the hypervisor's status.  Southbound, it connects to
+      <code>ovs-vswitchd</code>(8) as an OpenFlow controller, for control over
+      network traffic, and to the local <code>ovsdb-server</code>(1) to allow
+      it to monitor and control Open vSwitch configuration.
+    </li>
+
+    <li>
+      <code>ovs-vswitchd</code>(8) and <code>ovsdb-server</code>(1) are
+      conventional components of Open vSwitch.
+    </li>
+  </ul>
+
+  <pre fixed="yes">
+                                  CMS
+                                   |
+                                   |
+                       +-----------|-----------+
+                       |           |           |
+                       |     OVN/CMS Plugin    |
+                       |           |           |
+                       |           |           |
+                       |   OVN Northbound DB   |
+                       |           |           |
+                       |           |           |
+                       |        ovn-nbd        |
+                       |           |           |
+                       +-----------|-----------+
+                                   |
+                                   |
+                                +------+
+                                |OVN DB|
+                                +------+
+                                   |
+                                   |
+                +------------------+------------------+
+                |                  |                  |
+ HV 1           |                  |    HV n          |
++---------------|---------------+  .  +---------------|---------------+
+|               |               |  .  |               |               |
+|        ovn-controller         |  .  |        ovn-controller         |
+|         |          |          |  .  |         |          |          |
+|         |          |          |     |         |          |          |
+|  ovs-vswitchd   ovsdb-server  |     |  ovs-vswitchd   ovsdb-server  |
+|                               |     |                               |
++-------------------------------+     +-------------------------------+
+  </pre>
+
+  <h3>Life Cycle of a VIF</h3>
+
+  <p>
+    Tables and their schemas presented in isolation are difficult to
+    understand.  Here's an example.
+  </p>
+
+  <p>
+    The steps in this example refer often to details of the OVN and OVN
+    Northbound database schemas.  Please see <code>ovn</code>(5) and
+    <code>ovn-nb</code>(5), respectively, for the full story on these
+    databases.
+  </p>
+
+  <ol>
+    <li>
+      A VIF's life cycle begins when a CMS administrator creates a new VIF
+      using the CMS user interface or API and adds it to a switch (one
+      implemented by OVN as a logical switch).  The CMS updates its own
+      configuration.  This includes associating unique, persistent identifier
+      <var>vif-id</var> and Ethernet address <var>mac</var> with the VIF.
+    </li>
+
+    <li>
+      The CMS plugin updates the OVN Northbound database to include the new
+      VIF, by adding a row to the <code>Logical_Port</code> table.  In the new
+      row, <code>name</code> is <var>vif-id</var>, <code>mac</code> is
+      <var>mac</var>, <code>switch</code> points to the OVN logical switch's
+      Logical_Switch record, and other columns are initialized appropriately.
+    </li>
+
+    <li>
+      <code>ovs-nbd</code> receives the OVN Northbound database update.  In
+      turn, it makes the corresponding updates to the OVN database, by adding
+      rows to the OVN database <code>Pipeline</code> table to reflect the new
+      port, e.g. add a flow to recognize that packets destined to the new
+      port's MAC address should be delivered to it, and update the flow that
+      delivers broadcast and multicast packets to include the new port.
+    </li>
+
+    <li>
+      On every hypervisor, <code>ovn-controller</code> receives the
+      <code>Pipeline</code> table updates that <code>ovs-nbd</code> made in the
+      previous step.  As long as the VM that owns the VIF is powered off,
+      <code>ovn-controller</code> cannot do much; it cannot, for example,
+      arrange to send packets to or receive packets from the VIF, because the
+      VIF does not actually exist anywhere.
+    </li>
+
+    <li>
+      Eventually, a user powers on the VM that owns the VIF.  On the hypervisor
+      where the VM is powered on, the integration between the hypervisor and
+      Open vSwitch (described in <code>IntegrationGuide.md</code>) adds the VIF
+      to the OVN integration bridge and stores <var>vif-id</var> in
+      <code>external-ids</code>:<code>iface-id</code> to indicate that the
+      interface is an instantiation of the new VIF.  (None of this code is new
+      in OVN; this is pre-existing integration work that has already been done
+      on hypervisors that support OVS.)
+    </li>
+
+    <li>
+      On the hypervisor where the VM is powered on, <code>ovn-controller</code>
+      notices <code>external-ids</code>:<code>iface-id</code> in the new
+      Interface.  In response, it updates the local hypervisor's OpenFlow
+      tables so that packets to and from the VIF are properly handled.
+      Afterward, it updates the <code>Bindings</code> table in the OVN DB,
+      adding a row that links the logical port from
+      <code>external-ids</code>:<code>iface-id</code> to the hypervisor.
+    </li>
+
+    <li>
+      Some CMS systems, including OpenStack, fully start a VM only when its
+      networking is ready.  To support this, <code>ovn-nbd</code> notices the
+      new row in the <code>Bindings</code> table, and pushes this upward by
+      updating the <ref column="up" table="Logical_Port" db="OVN_NB"/> column
+      in the OVN Northbound database's <ref table="Logical_Port" db="OVN_NB"/>
+      table to indicate that the VIF is now up.  The CMS, if it uses this
+      feature, can then react by allowing the VM's execution to proceed.
+    </li>
+
+    <li>
+      On every hypervisor but the one where the VIF resides,
+      <code>ovn-controller</code> notices the new row in the
+      <code>Bindings</code> table.  This provides <code>ovn-controller</code>
+      the physical location of the logical port, so each instance updates the
+      OpenFlow tables of its switch (based on logical datapath flows in the OVN
+      DB <code>Pipeline</code> table) so that packets to and from the VIF can
+      be properly handled via tunnels.
+    </li>
+
+    <li>
+      Eventually, a user powers off the VM that owns the VIF.  On the
+      hypervisor where the VM was powered on, the VIF is deleted from the OVN
+      integration bridge.
+    </li>
+
+    <li>
+      On the hypervisor where the VM was powered on,
+      <code>ovn-controller</code> notices that the VIF was deleted.  In
+      response, it removes the logical port's row from the
+      <code>Bindings</code> table.
+    </li>
+
+    <li>
+      On every hypervisor, <code>ovn-controller</code> notices the row removed
+      from the <code>Bindings</code> table.  This means that
+      <code>ovn-controller</code> no longer knows the physical location of the
+      logical port, so each instance updates its OpenFlow table to reflect
+      that.
+    </li>
+
+    <li>
+      Eventually, when the VIF (or its entire VM) is no longer needed by
+      anyone, an administrator deletes the VIF using the CMS user interface or
+      API.  The CMS updates its own configuration.
+    </li>
+
+    <li>
+      The CMS plugin removes the VIF from the OVN Northbound database,
+      by deleting its row in the <code>Logical_Port</code> table.
+    </li>
+
+    <li>
+      <code>ovs-nbd</code> receives the OVN Northbound update and in turn
+      updates the OVN database accordingly, by removing or updating the
+      rows from the OVN database <code>Pipeline</code> table that were related
+      to the now-destroyed VIF.
+    </li>
+
+    <li>
+      On every hypervisor, <code>ovn-controller</code> receives the
+      <code>Pipeline</code> table updates that <code>ovs-nbd</code> made in the
+      previous step.  <code>ovn-controller</code> updates OpenFlow tables to
+      reflect the update, although there may not be much to do, since the VIF
+      had already become unreachable when it was removed from the
+      <code>Bindings</code> table in a previous step.
+    </li>
+  </ol>
+
+</manpage>
diff --git a/ovn/ovn-controller.8.in b/ovn/ovn-controller.8.in

new file mode 100644 (file)

index 0000000..59fcb59
--- /dev/null
+++ b/ovn/ovn-controller.8.in
@@ -0,0 +1,41 @@
+.\" -*- nroff -*-
+.de IQ
+.  br
+.  ns
+.  IP "\\$1"
+..
+.TH ovn\-controller 8 "@VERSION@" "Open vSwitch" "Open vSwitch Manual"
+.ds PN ovn\-controller
+.
+.SH NAME
+ovn\-controller \- OVN local controller
+.
+.SH SYNOPSIS
+\fBovn\-controller\fR [\fIoptions\fR]
+.
+.SH DESCRIPTION
+\fBovn\-controller\fR is the local controller daemon for OVN, the Open
+Virtual Network.  It connects northbound to the OVN database (see
+\fBovn\fR(5)) over the OVSDB protocol, and southbound to the Open
+vSwitch database (see \fBovs-vswitchd.conf.db\fR(5)) over the OVSDB
+protocol and to \fBovs\-vswitchd\fR(8) via OpenFlow.  Each hypervisor
+and software gateway in an OVN deployment runs its own independent
+copy of \fBovn\-controller\fR; thus, \fBovn\-controller\fR's
+southbound connections are machine-local and do not run over a
+physical network.
+.PP
+XXX this is completely skeletal.
+.
+.SH OPTIONS
+.SS "Public Key Infrastructure Options"
+.so lib/ssl.man
+.so lib/ssl-peer-ca-cert.man
+.ds DD
+.so lib/daemon.man
+.so lib/vlog.man
+.so lib/unixctl.man
+.so lib/common.man
+.
+.SH "SEE ALSO"
+.
+\fBovn\-architecture\fR(7)
diff --git a/ovn/ovn-nb.ovsschema b/ovn/ovn-nb.ovsschema

new file mode 100644 (file)

index 0000000..ad675ac
--- /dev/null
+++ b/ovn/ovn-nb.ovsschema
@@ -0,0 +1,62 @@
+{
+    "name": "OVN_Northbound",
+    "tables": {
+        "Logical_Switch": {
+            "columns": {
+                "router_port": {"type": {"key": {"type": "uuid",
+                                                 "refTable": "Logical_Router_Port",
+                                                 "refType": "strong"},
+                                         "min": 0, "max": 1}},
+                "external_ids": {
+                    "type": {"key": "string", "value": "string",
+                             "min": 0, "max": "unlimited"}}}},
+        "Logical_Port": {
+            "columns": {
+                "switch": {"type": {"key": {"type": "uuid",
+                                            "refTable": "Logical_Switch",
+                                            "refType": "strong"}}},
+                "name": {"type": "string"},
+                "macs": {"type": {"key": "string",
+                                  "min": 0,
+                                  "max": "unlimited"}},
+                "port_security": {"type": {"key": "string",
+                                           "min": 0,
+                                           "max": "unlimited"}},
+                "up": {"type": {"key": "boolean", "min": 0, "max": 1}},
+                "external_ids": {
+                    "type": {"key": "string", "value": "string",
+                             "min": 0, "max": "unlimited"}}},
+            "indexes": [["name"]]},
+        "ACL": {
+            "columns": {
+                "switch": {"type": {"key": {"type": "uuid",
+                                            "refTable": "Logical_Switch",
+                                            "refType": "strong"}}},
+                "priority": {"type": {"key": {"type": "integer",
+                                              "minInteger": 0,
+                                              "maxInteger": 65535}}},
+                "match": {"type": "string"},
+                "action": {"type": {"key": {"type": "string",
+                                            "enum": ["set", ["allow", "allow-related", "drop", "reject"]]}}},
+                "log": {"type": "boolean"},
+                "external_ids": {
+                    "type": {"key": "string", "value": "string",
+                             "min": 0, "max": "unlimited"}}}},
+        "Logical_Router": {
+            "columns": {
+                "ip": {"type": "string"},
+                "default_gw": {"type": {"key": "string", "min": 0, "max": 1}},
+                "external_ids": {
+                    "type": {"key": "string", "value": "string",
+                             "min": 0, "max": "unlimited"}}}},
+        "Logical_Router_Port": {
+            "columns": {
+                "router": {"type": {"key": {"type": "uuid",
+                                            "refTable": "Logical_Router",
+                                            "refType": "strong"}}},
+                "network": {"type": "string"},
+                "mac": {"type": "string"},
+                "external_ids": {
+                    "type": {"key": "string", "value": "string",
+                             "min": 0, "max": "unlimited"}}}}},
+    "version": "1.0.0"}
diff --git a/ovn/ovn-nb.xml b/ovn/ovn-nb.xml

new file mode 100644 (file)

index 0000000..80190ca
--- /dev/null
+++ b/ovn/ovn-nb.xml
@@ -0,0 +1,245 @@
+<?xml version="1.0" encoding="utf-8"?>
+<database name="ovn-nb" title="OVN Northbound Database">
+  <p>
+    This database is the interface between OVN and the cloud management system
+    (CMS), such as OpenStack, running above it.  The CMS produces almost all of
+    the contents of the database.  The <code>ovs-nbd</code> program monitors
+    the database contents, transforms it, and stores it into the <ref
+    db="OVN"/> database.
+  </p>
+
+  <p>
+    We generally speak of ``the'' CMS, but one can imagine scenarios in
+    which multiple CMSes manage different parts of an OVN deployment.
+  </p>
+
+  <h2>External IDs</h2>
+
+  <p>
+    Each of the tables in this database contains a special column, named
+    <code>external_ids</code>.  This column has the same form and purpose each
+    place it appears.
+  </p>
+
+  <dl>
+    <dt><code>external_ids</code>: map of string-string pairs</dt>
+    <dd>
+      Key-value pairs for use by the CMS.  The CMS might use certain pairs, for
+      example, to identify entities in its own configuration that correspond to
+      those in this database.
+    </dd>
+  </dl>
+
+  <table name="Logical_Switch" title="L2 logical switch">
+    <p>
+      Each row represents one L2 logical switch.  A given switch's ports are
+      the <ref table="Logical_Port"/> rows whose <ref table="Logical_Port"
+      column="switch"/> column points to its row.
+    </p>
+
+    <column name="router_port">
+      <p>
+        The router port to which this logical switch is connected, or empty if
+        this logical switch is not connected to any router.  A switch may be
+        connected to at most one logical router, but this is not a significant
+        restriction because logical routers may be connected into arbitrary
+        topologies.
+      </p>
+    </column>
+
+    <group title="Common Columns">
+      <column name="external_ids">
+        See <em>External IDs</em> at the beginning of this document.
+      </column>
+    </group>
+  </table>
+
+  <table name="Logical_Port" title="L2 logical switch port">
+    <p>
+      A port within an L2 logical switch.
+    </p>
+
+    <column name="switch">
+      The logical switch to which the logical port is connected.
+    </column>
+
+    <column name="name">
+      The logical port name.  The name used here must match those used in the
+      <ref key="iface-id" table="Interface" column="external_ids"
+      db="Open_vSwitch"/> in the <ref db="Open_vSwitch"/> database's <ref
+      table="Interface" db="Open_vSwitch"/> table, because hypervisors use <ref
+      key="iface-id" table="Interface" column="external_ids"
+      db="Open_vSwitch"/> as a lookup key for logical ports.
+    </column>
+
+    <column name="up">
+      This column is populated by <code>ovn-nbd</code>, rather than by the CMS
+      plugin as is most of this database.  When a logical port is bound to a
+      physical location in the OVN database <ref db="OVN" table="Bindings"/>
+      table, <code>ovn-nbd</code> sets this column to <code>true</code>;
+      otherwise, or if the port becomes unbound later, it sets it to
+      <code>false</code>.  This allows the CMS to wait for a VM's networking to
+      become active before it allows the VM to start.
+    </column>
+
+    <column name="macs">
+      The logical port's own Ethernet address or addresses, each in the form
+      <var>xx</var>:<var>xx</var>:<var>xx</var>:<var>xx</var>:<var>xx</var>:<var>xx</var>.
+      Like a physical Ethernet NIC, a logical port ordinarily has a single
+      fixed Ethernet address.  The string <code>unknown</code> is also allowed
+      to indicate that the logical port has an unknown set of (additional)
+      source addresses.
+    </column>
+
+    <column name="port_security">
+      <p>
+        A set of L2 (Ethernet) or L3 (IPv4 or IPv6) addresses or L2+L3 pairs
+        from which the logical port is allowed to send packets and to which it
+        is allowed to receive packets.  If this column is empty, all addresses
+        are permitted.
+      </p>
+
+      <p>
+        Exact syntax is TBD.  One could simply use comma- or space-separated L2
+        and L3 addresses in each set member, or replace this by a subset of the
+        general-purpose expression language used for the <ref column="match"
+        table="Pipeline" db="OVN"/> column in the OVN database's <ref
+        table="Pipeline" db="OVN"/> table.
+      </p>
+    </column>
+
+    <group title="Common Columns">
+      <column name="external_ids">
+        See <em>External IDs</em> at the beginning of this document.
+      </column>
+    </group>
+  </table>
+
+  <table name="ACL" title="Access Control List (ACL) rule">
+    <p>
+      Each row in this table represents one ACL rule for the logical switch in
+      its <ref column="switch"/> column.  The <ref column="action"/> column for
+      the highest-<ref column="priority"/> matching row in this table
+      determines a packet's treatment.  If no row matches, packets are allowed
+      by default.  (Default-deny treatment is possible: add a rule with <ref
+      column="priority"/> 0, <code>true</code> as <ref column="match"/>, and
+      <code>deny</code> as <ref column="action"/>.)
+    </p>
+
+    <column name="switch">
+      The switch to which the ACL rule applies.  The expression in the
+      <ref column="match"/> column may match against logical ports
+      within this switch.
+    </column>
+
+    <column name="priority">
+      The ACL rule's priority.  Rules with numerically higher priority take
+      precedence over those with lower.  If two ACL rules with the same
+      priority both match, then the one actually applied to a packet is
+      undefined.
+    </column>
+
+    <column name="match">
+      The packets that the ACL should match, in the same expression language
+      used for the <ref column="match" table="Pipeline" db="OVN"/> column in
+      the OVN database's <ref table="Pipeline" db="OVN"/> table.  Match
+      <code>inport</code> and <code>outport</code> against names of logical
+      ports within <ref column="switch"/> to implement ingress and egress ACLs,
+      respectively.  In logical switches connected to logical routers, the
+      special port name <code>ROUTER</code> refers to the logical router port.
+    </column>
+
+    <column name="action">
+      <p>The action to take when the ACL rule matches:</p>
+      
+      <ul>
+       <li>
+         <code>allow</code>: Forward the packet.
+       </li>
+
+       <li>
+         <code>allow-related</code>: Forward the packet and related traffic
+         (e.g. inbound replies to an outbound connection).
+       </li>
+
+       <li>
+         <code>drop</code>: Silently drop the packet.
+       </li>
+
+       <li>
+         <code>reject</code>: Drop the packet, replying with a RST for TCP or
+         ICMP unreachable message for other IP-based protocols.
+       </li>
+      </ul>
+    </column>
+
+    <column name="log">
+      If set to <code>true</code>, packets that match the ACL will trigger a
+      log message on the transport node or nodes that perform ACL processing.
+      Logging may be combined with any <ref column="action"/>.
+    </column>
+
+    <group title="Common Columns">
+      <column name="external_ids">
+        See <em>External IDs</em> at the beginning of this document.
+      </column>
+    </group>
+  </table>
+
+  <table name="Logical_Router" title="L3 logical router">
+    <p>
+      Each row represents one L3 logical router.  A given router's ports are
+      the <ref table="Logical_Router_Port"/> rows whose <ref
+      table="Logical_Router_Port" column="router"/> column points to its row.
+    </p>
+
+    <column name="ip">
+      The logical router's own IP address.  The logical router uses this
+      address for ICMP replies (e.g. network unreachable messages) and other
+      traffic that it originates and responds to traffic destined to this
+      address (e.g. ICMP echo requests).
+    </column>
+
+    <column name="default_gw">
+      IP address to use as default gateway, if any.
+    </column>
+
+    <group title="Common Columns">
+      <column name="external_ids">
+        See <em>External IDs</em> at the beginning of this document.
+      </column>
+    </group>
+  </table>
+
+  <table name="Logical_Router_Port" title="L3 logical router port">
+    <p>
+      A port within an L3 logical router.
+    </p>
+
+    <p>
+      A router port is always attached to a switch port.  The connection can be
+      identified by following the <ref column="router_port"
+      table="Logical_Port"/> column from an appropriate <ref
+      table="Logical_Port"/> row.
+    </p>
+
+    <column name="router">
+      The router to which the port belongs.
+    </column>
+
+    <column name="network">
+      The IP network and netmask of the network on the router port.  Used for
+      routing.
+    </column>
+
+    <column name="mac">
+      The Ethernet address that belongs to this router port.
+    </column>
+
+    <group title="Common Columns">
+      <column name="external_ids">
+        See <em>External IDs</em> at the beginning of this document.
+      </column>
+    </group>
+  </table>
+</database>
diff --git a/ovn/ovn.ovsschema b/ovn/ovn.ovsschema

new file mode 100644 (file)

index 0000000..5597df4
--- /dev/null
+++ b/ovn/ovn.ovsschema
@@ -0,0 +1,50 @@
+{
+    "name": "OVN",
+    "tables": {
+        "Chassis": {
+            "columns": {
+                "name": {"type": "string"},
+                "encap": {"type": {"key": {"type": "string",
+                                           "enum": ["set", ["stt", "vxlan", "gre"]]}}},
+                "encap_options": {"type": {"key": "string",
+                                           "value": "string",
+                                           "min": 0,
+                                           "max": "unlimited"}},
+                "ip": {"type": "string"},
+                "gateway_ports": {"type": {"key": "string",
+                                           "value": {"type": "uuid",
+                                                     "refTable": "Gateway",
+                                                     "refType": "strong"},
+                                           "min": 0,
+                                           "max": "unlimited"}}},
+            "isRoot": true,
+            "indexes": [["name"]]},
+        "Gateway": {
+            "columns": {"attached_port": {"type": "string"},
+                        "vlan_map": {"type": {"key": {"type": "integer",
+                                                      "minInteger": 0,
+                                                      "maxInteger": 4095},
+                                              "value": {"type": "string"},
+                                              "min": 0,
+                                              "max": "unlimited"}}}},
+        "Pipeline": {
+            "columns": {
+                "table_id": {"type": {"key": {"type": "integer",
+                                              "minInteger": 0,
+                                              "maxInteger": 127}}},
+                "priority": {"type": {"key": {"type": "integer",
+                                              "minInteger": 0,
+                                              "maxInteger": 65535}}},
+                "match": {"type": "string"},
+                "actions": {"type": "string"}},
+            "isRoot": true},
+        "Bindings": {
+            "columns": {
+                "logical_port": {"type": "string"},
+                "chassis": {"type": "string"},
+                "mac": {"type": {"key": "string",
+                                 "min": 0,
+                                 "max": "unlimited"}}},
+            "indexes": [["logical_port"]],
+            "isRoot": true}},
+    "version": "1.0.0"}
diff --git a/ovn/ovn.xml b/ovn/ovn.xml

new file mode 100644 (file)

index 0000000..ccc2001
--- /dev/null
+++ b/ovn/ovn.xml
@@ -0,0 +1,497 @@
+<?xml version="1.0" encoding="utf-8"?>
+<database name="ovn" title="OVN Database">
+  <p>
+    This database holds logical and physical configuration and state for the
+    Open Virtual Network (OVN) system to support virtual network abstraction.
+    For an introduction to OVN, please see <code>ovn-architecture</code>(7).
+  </p>
+
+  <p>
+    The OVN database sits at the center of the OVN architecture.  It is the one
+    component that speaks both southbound directly to all the hypervisors and
+    gateways, via <code>ovn-controller</code>, and northbound to the Cloud
+    Management System, via <code>ovn-nbd</code>:
+  </p>
+
+  <h2>Database Structure</h2>
+
+  <p>
+    The OVN database contains three classes of data with different properties,
+    as described in the sections below.
+  </p>
+
+  <h3>Physical Network (PN) data</h3>
+
+  <p>
+    PN tables contain information about the chassis nodes in the system.  This
+    contains all the information necessary to wire the overlay, such as IP
+    addresses, supported tunnel types, and security keys.
+  </p>
+
+  <p>
+    The amount of PN data is small (O(n) in the number of chassis) and it
+    changes infrequently, so it can be replicated to every chassis.
+  </p>
+
+  <p>
+    The <ref table="Chassis"/> and <ref table="Gateway"/> tables comprise the
+    PN tables.
+  </p>
+
+  <h3>Logical Network (LN) data</h3>
+
+  <p>
+    LN tables contain the topology of logical switches and routers, ACLs,
+    firewall rules, and everything needed to describe how packets traverse a
+    logical network, represented as logical datapath flows (see Logical
+    Datapath Flows, below).
+  </p>
+
+  <p>
+    LN data may be large (O(n) in the number of logical ports, ACL rules,
+    etc.).  Thus, to improve scaling, each chassis should receive only data
+    related to logical networks in which that chassis participates.  Past
+    experience shows that in the presence of large logical networks, even
+    finer-grained partitioning of data, e.g. designing logical flows so that
+    only the chassis hosting a logical port needs related flows, pays off
+    scale-wise.  (This is not necessary initially but it is worth bearing in
+    mind in the design.)
+  </p>
+
+  <p>
+    The LN is a slave of the cloud management system running northbound of OVN.
+    That CMS determines the entire OVN logical configuration and therefore the
+    LN's content at any given time is a deterministic function of the CMS's
+    configuration, although that happens indirectly via the OVN Northbound DB
+    and <code>ovn-nbd</code>.
+  </p>
+
+  <p>
+    LN data is likely to change more quickly than PN data.  This is especially
+    true in a container environment where VMs are created and destroyed (and
+    therefore added to and deleted from logical switches) quickly.
+  </p>
+
+  <p>
+    The <ref table="Pipeline"/> table is currently the only LN table.
+  </p>
+
+  <h3>Bindings data</h3>
+
+  <p>
+    The Bindings tables contain the current placement of logical components
+    (such as VMs and VIFs) onto chassis and the bindings between logical ports
+    and MACs.
+  </p>
+
+  <p>
+    Bindings change frequently, at least every time a VM powers up or down
+    or migrates, and especially quickly in a container environment.  The
+    amount of data per VM (or VIF) is small.
+  </p>
+
+  <p>
+    Each chassis is authoritative about the VMs and VIFs that it hosts at any
+    given time and can efficiently flood that state to a central location, so
+    the consistency needs are minimal.
+  </p>
+
+  <p>
+    The <ref table="Bindings"/> table is currently the only Bindings table.
+  </p>
+
+  <table name="Chassis" title="Physical Network Hypervisor and Gateway Information">
+    <p>
+      Each row in this table represents a hypervisor or gateway (a chassis) in
+      the physical network (PN).  Each chassis, via
+      <code>ovn-controller</code>, adds and updates its own row, and keeps a
+      copy of the remaining rows to determine how to reach other hypervisors.
+    </p>
+
+    <p>
+      When a chassis shuts down gracefully, it should remove its own row.
+      (This is not critical because resources hosted on the chassis are equally
+      unreachable regardless of whether the row is present.)  If a chassis
+      shuts down permanently without removing its row, some kind of manual or
+      automatic cleanup is eventually needed; we can devise a process for that
+      as necessary.
+    </p>
+
+    <column name="name">
+      A chassis name, taken from <ref key="system-id" table="Open_vSwitch"
+      column="external_ids" db="Open_vSwitch"/> in the Open_vSwitch
+      database's <ref table="Open_vSwitch" db="Open_vSwitch"/> table.  OVN does
+      not prescribe a particular format for chassis names.
+    </column>
+
+    <group title="Encapsulation">
+      <p>
+        These columns together identify how OVN may transmit logical dataplane
+        packets to this chassis.
+      </p>
+
+      <column name="encap">
+        The encapsulation to use to transmit packets to this chassis.
+      </column>
+
+      <column name="encap_options">
+        Options for configuring the encapsulation, e.g. IPsec parameters when
+        IPsec support is introduced.  No options are currently defined.
+      </column>
+
+      <column name="ip">
+        The IPv4 address of the encapsulation tunnel endpoint.
+      </column>
+    </group>
+
+    <group title="Gateway Configuration">
+      <p>
+        A <dfn>gateway</dfn> is a chassis that forwards traffic between a
+        logical network and a physical VLAN.  Gateways are typically dedicated
+        nodes that do not host VMs.
+      </p>
+
+      <column name="gateway_ports">
+        Maps from the name of a gateway port, which is typically a physical
+        port (e.g. <code>eth1</code>) or an Open vSwitch patch port, to a <ref
+        table="Gateway"/> record that describes the details of the gatewaying
+        function.
+      </column>
+    </group>
+  </table>
+
+  <table name="Gateway" title="Physical Network Gateway Ports">
+    <p>
+      The <ref column="gateway_ports" table="Chassis"/> column in the <ref
+      table="Chassis"/> table refers to rows in this table to connect a chassis
+      port to a gateway function.  Each row in this table describes the logical
+      networks to which a gateway port is attached.  Each chassis, via
+      <code>ovn-controller</code>(8), adds and updates its own rows, if any
+      (since most chassis are not gateways), and keeps a copy of the remaining
+      rows to determine how to reach other chassis.
+    </p>
+
+    <column name="vlan_map">
+      Maps from a VLAN ID to a logical port name.  Thus, each named logical
+      port corresponds to one VLAN on the gateway port.
+    </column>
+
+    <column name="attached_port">
+      The name of the gateway port in the chassis's Open vSwitch integration
+      bridge.
+    </column>
+  </table>
+
+  <table name="Pipeline" title="Logical Network Pipeline">
+    <p>
+      Each row in this table represents one logical flow.  The cloud management
+      system, via its OVN integration, populates this table with logical flows
+      that implement the L2 and L3 topology specified in the CMS configuration.
+      Each hypervisor, via <code>ovn-controller</code>, translates the logical
+      flows into OpenFlow flows specific to its hypervisor and installs them
+      into Open vSwitch.
+    </p>
+
+    <p>
+      Logical flows are expressed in an OVN-specific format, described here.  A
+      logical datapath flow is much like an OpenFlow flow, except that the
+      flows are written in terms of logical ports and logical datapaths instead
+      of physical ports and physical datapaths.  Translation between logical
+      and physical flows helps to ensure isolation between logical datapaths.
+      (The logical flow abstraction also allows the CMS to do less work, since
+      it does not have to separately compute and push out physical physical
+      flows to each chassis.)
+    </p>
+
+    <p>
+      The default action when no flow matches is to drop packets.
+    </p>
+
+    <column name="table_id">
+      The stage in the logical pipeline, analogous to an OpenFlow table number.
+    </column>
+
+    <column name="priority">
+      The flow's priority.  Flows with numerically higher priority take
+      precedence over those with lower.  If two logical datapath flows with the
+      same priority both match, then the one actually applied to the packet is
+      undefined.
+    </column>
+
+    <column name="match">
+      <p>
+        A matching expression.  OVN provides a superset of OpenFlow matching
+        capabilities, using a syntax similar to Boolean expressions in a
+        programming language.
+      </p>
+
+      <p>
+        Matching expressions have two important kinds of primary expression:
+        <dfn>fields</dfn> and <dfn>constants</dfn>.  A field names a piece of
+        data or metadata.  The supported fields are:
+      </p>
+
+      <ul>
+        <li>
+          <code>metadata</code> <code>reg0</code> ... <code>reg7</code>
+          <code>xreg0</code> ... <code>xreg3</code>
+        </li>
+        <li><code>inport</code> <code>outport</code> <code>queue</code></li>
+        <li><code>eth.src</code> <code>eth.dst</code> <code>eth.type</code></li>
+        <li><code>vlan.tci</code> <code>vlan.vid</code> <code>vlan.pcp</code> <code>vlan.present</code></li>
+        <li><code>ip.proto</code> <code>ip.dscp</code> <code>ip.ecn</code> <code>ip.ttl</code> <code>ip.frag</code></li>
+        <li><code>ip4.src</code> <code>ip4.dst</code></li>
+        <li><code>ip6.src</code> <code>ip6.dst</code> <code>ip6.label</code></li>
+        <li><code>arp.op</code> <code>arp.spa</code> <code>arp.tpa</code> <code>arp.sha</code> <code>arp.tha</code></li>
+        <li><code>tcp.src</code> <code>tcp.dst</code> <code>tcp.flags</code></li>
+        <li><code>udp.src</code> <code>udp.dst</code></li>
+        <li><code>sctp.src</code> <code>sctp.dst</code></li>
+        <li><code>icmp4.type</code> <code>icmp4.code</code></li>
+        <li><code>icmp6.type</code> <code>icmp6.code</code></li>
+        <li><code>nd.target</code> <code>nd.sll</code> <code>nd.tll</code></li>
+      </ul>
+
+      <p>
+        Subfields may be addressed using a <code>[]</code> suffix,
+        e.g. <code>tcp.src[0..7]</code> refers to the low 8 bits of the TCP
+        source port.  A subfield may be used in any context a field is allowed.
+      </p>
+
+      <p>
+        Some fields have prerequisites.  OVN implicitly adds clauses to satisfy
+        these.  For example, <code>arp.op == 1</code> is equivalent to
+        <code>eth.type == 0x0806 &amp;&amp; arp.op == 1</code>, and
+        <code>tcp.src == 80</code> is equivalent to <code>(eth.type == 0x0800
+        || eth.type == 0x86dd) &amp;&amp; ip.proto == 6 &amp;&amp; tcp.src ==
+        80</code>.
+      </p>
+
+      <p>
+        Most fields have integer values.  Integer constants may be expressed in
+        several forms: decimal integers, hexadecimal integers prefixed by
+        <code>0x</code>, dotted-quad IPv4 addresses, IPv6 addresses in their
+        standard forms, and as Ethernet addresses as colon-separated hex
+        digits.  A constant in any of these forms may be followed by a slash
+        and a second constant (the mask) in the same form, to form a masked
+        constant.  IPv4 and IPv6 masks may be given as integers, to express
+        CIDR prefixes.
+      </p>
+
+      <p>
+        The <code>inport</code> and <code>outport</code> fields have string
+        values.  The useful values are <ref column="logical_port"/> names from
+        the <ref column="Bindings"/> and <ref column="Gateway"/> table.
+      </p>
+
+      <p>
+        The available operators, from highest to lowest precedence, are:
+      </p>
+
+      <ul>
+        <li><code>()</code></li>
+        <li><code>==   !=   &lt;   &lt;=   &gt;   &gt;=   in   not in</code></li>
+        <li><code>!</code></li>
+        <li><code>&amp;&amp;</code></li>
+        <li><code>||</code></li>
+      </ul>
+
+      <p>
+        The <code>()</code> operator is used for grouping.
+      </p>
+
+      <p>
+        The equality operator <code>==</code> is the most important operator.
+        Its operands must be a field and an optionally masked constant, in
+        either order.  The <code>==</code> operator yields true when the
+        field's value equals the constant's value for all the bits included in
+        the mask.  The <code>==</code> operator translates simply and naturally
+        to OpenFlow.
+      </p>
+
+      <p>
+        The inequality operator <code>!=</code> yields the inverse of
+        <code>==</code> but its syntax and use are the same.  Implementation of
+        the inequality operator is expensive.
+      </p>
+
+      <p>
+        The relational operators are &lt;, &lt;=, &gt;, and &gt;=.  Their
+        operands must be a field and a constant, in either order; the constant
+        must not be masked.  These operators are most commonly useful for L4
+        ports, e.g. <code>tcp.src &lt; 1024</code>.  Implementation of the
+        relational operators is expensive.
+      </p>
+
+      <p>
+        The set membership operator <code>in</code>, with syntax
+        ``<code><var>field</var> in { <var>constant1</var>,
+        <var>constant2</var>,</code> ... <code>}</code>'', is syntactic sugar
+        for ``<code>(<var>field</var> == <var>constant1</var> ||
+        <var>field</var> == <var>constant2</var> || </code>...<code>)</code>.
+        Conversely, ``<code><var>field</var> not in { <var>constant1</var>,
+        <var>constant2</var>, </code>...<code> }</code>'' is syntactic sugar
+        for ``<code>(<var>field</var> != <var>constant1</var> &amp;&amp;
+        <var>field</var> != <var>constant2</var> &amp;&amp;
+        </code>...<code>)</code>''.
+      </p>
+
+      <p>
+        The unary prefix operator <code>!</code> yields its operand's inverse.
+      </p>
+
+      <p>
+        The logical AND operator <code>&amp;&amp;</code> yields true only if
+        both of its operands are true.
+      </p>
+
+      <p>
+        The logical OR operator <code>||</code> yields true if at least one of
+        its operands is true.
+      </p>
+
+      <p>
+        Finally, the keywords <code>true</code> and <code>false</code> may also
+        be used in matching expressions.  <code>true</code> is useful by itself
+        as a catch-all expression that matches every packet.
+      </p>
+
+      <p>
+        (The above is pretty ambitious.  It probably makes sense to initially
+        implement only a subset of this specification.  The full specification
+        is written out mainly to get an idea of what a fully general matching
+        expression language could include.)
+      </p>
+    </column>
+
+    <column name="actions">
+      <p>
+        Below, a <var>value</var> is either a <var>constant</var> or a
+        <var>field</var>.  The following actions seem most likely to be useful:
+      </p>
+
+      <dl>
+        <dt><code>drop;</code></dt>
+        <dd>syntactic sugar for no actions</dd>
+
+        <dt><code>output(<var>value</var>);</code></dt>
+        <dd>output to port</dd>
+
+        <dt><code>broadcast;</code></dt>
+        <dd>output to every logical port except ingress port</dd>
+
+        <dt><code>resubmit;</code></dt>
+        <dd>execute next logical datapath table as subroutine</dd>
+
+        <dt><code>set(<var>field</var>=<var>value</var>);</code></dt>
+        <dd>set data or metadata field, or copy between fields</dd>
+      </dl>
+
+      <p>
+        Following are not well thought out:
+      </p>
+
+      <dl>
+          <dt><code>learn</code></dt>
+
+          <dt><code>conntrack</code></dt>
+
+          <dt><code>with(<var>field</var>=<var>value</var>) { <var>action</var>, </code>...<code> }</code></dt>
+          <dd>execute <var>actions</var> with temporary changes to <var>fields</var></dd>
+
+          <dt><code>dec_ttl { <var>action</var>, </code>...<code> } { <var>action</var>; </code>...<code>}</code></dt>
+          <dd>
+            decrement TTL; execute first set of actions if
+            successful, second set if TTL decrement fails
+          </dd>
+
+          <dt><code>icmp_reply { <var>action</var>, </code>...<code> }</code></dt>
+          <dd>generate ICMP reply from packet, execute <var>action</var>s</dd>
+
+         <dt><code>arp { <var>action</var>, </code>...<code> }</code></dt>
+         <dd>generate ARP from packet, execute <var>action</var>s</dd>
+      </dl>
+
+      <p>
+        Other actions can be added as needed
+        (e.g. <code>push_vlan</code>, <code>pop_vlan</code>,
+        <code>push_mpls</code>, <code>pop_mpls</code>).
+      </p>
+
+      <p>
+        Some of the OVN actions do not map directly to OpenFlow actions, e.g.:
+      </p>
+
+      <ul>
+        <li>
+          <code>with</code>: Implemented as <code>stack_push;
+          set(</code>...<code>); <var>actions</var>; stack_pop</code>.
+        </li>
+
+        <li>
+          <code>dec_ttl</code>: Implemented as <code>dec_ttl</code> followed
+          by the successful actions.  The failure case has to be implemented by
+          ovn-controller interpreting packet-ins.  It might be difficult to
+          identify the particular place in the processing pipeline in
+          <code>ovn-controller</code>; maybe some restrictions will be
+          necessary.
+        </li>
+
+        <li>
+          <code>icmp_reply</code>: Implemented by sending the packet to
+          <code>ovn-controller</code>, which generates the ICMP reply and sends
+          the packet back to <code>ovs-vswitchd</code>.
+        </li>
+      </ul>
+    </column>
+  </table>
+
+  <table name="Bindings" title="Physical-Logical Bindings">
+    <p>
+      Each row in this table identifies the physical location of a logical
+      port.  Each hypervisor, via <code>ovn-controller</code>, populates this
+      table with rows for the logical ports that are located on its hypervisor,
+      which <code>ovn-controller</code> in turn finds out by monitoring the
+      local hypervisor's Open_vSwitch database, which identifies logical ports
+      via the conventions described in <code>IntegrationGuide.md</code>.
+    </p>
+
+    <p>
+      When a chassis shuts down gracefully, it should remove its bindings.
+      (This is not critical because resources hosted on the chassis are equally
+      unreachable regardless of whether their rows are present.)  To handle the
+      case where a VM is shut down abruptly on one chassis, then brought up
+      again on a different one, <code>ovn-controller</code> must delete any
+      existing <ref table="Binding"/> record for a logical port when it adds a
+      new one.
+    </p>
+
+    <column name="logical_port">
+      A logical port, taken from <ref key="iface-id" table="Interface"
+      column="external_ids" db="Open_vSwitch"/> in the Open_vSwitch database's
+      <ref table="Interface" db="Open_vSwitch"/> table.  OVN does not prescribe
+      a particular format for the logical port ID.
+    </column>
+
+    <column name="chassis">
+      The physical location of the logical port.  To successfully identify a
+      chassis, this column must match the <ref table="Chassis" column="name"/>
+      column in some row in the <ref table="Chassis"/> table.
+    </column>
+
+    <column name="mac">
+      <p>
+        The Ethernet address or addresses used as a source address on the
+        logical port, each in the form
+        <var>xx</var>:<var>xx</var>:<var>xx</var>:<var>xx</var>:<var>xx</var>:<var>xx</var>.
+        The string <code>unknown</code> is also allowed to indicate that the
+        logical port has an unknown set of (additional) source addresses.
+      </p>
+
+      <p>
+        A VM interface would ordinarily have a single Ethernet address.  A
+        gateway port might initially only have <code>unknown</code>, and then
+        add MAC addresses to the set as it learns new source addresses.
+      </p>
+    </column>
+  </table>
+</database>
author	Ben Pfaff <blp@nicira.com>
	Thu, 26 Feb 2015 22:49:26 +0000 (14:49 -0800)
committer	Ben Pfaff <blp@nicira.com>
	Thu, 26 Feb 2015 22:49:26 +0000 (14:49 -0800)
Makefile.am		patch \| blob \| history
configure.ac		patch \| blob \| history
ovn/TODO	[new file with mode: 0644]	patch \| blob
ovn/automake.mk	[new file with mode: 0644]	patch \| blob
ovn/ovn-architecture.7.xml	[new file with mode: 0644]	patch \| blob
ovn/ovn-controller.8.in	[new file with mode: 0644]	patch \| blob
ovn/ovn-nb.ovsschema	[new file with mode: 0644]	patch \| blob
ovn/ovn-nb.xml	[new file with mode: 0644]	patch \| blob
ovn/ovn.ovsschema	[new file with mode: 0644]	patch \| blob
ovn/ovn.xml	[new file with mode: 0644]	patch \| blob