cascardo/linux.git
7 years agobpf, maps: add release callback
Daniel Borkmann [Wed, 15 Jun 2016 20:47:12 +0000 (22:47 +0200)]
bpf, maps: add release callback

Add a release callback for maps that is invoked when the last
reference to its struct file is gone and the struct file about
to be released by vfs. The handler will be used by fd array maps.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'sfc-rx-vlan-filtering'
David S. Miller [Thu, 16 Jun 2016 05:26:27 +0000 (22:26 -0700)]
Merge branch 'sfc-rx-vlan-filtering'

Edward Cree says:

====================
sfc: RX VLAN filtering

Adds support for VLAN-qualified receive filters on EF10 hardware.
This is needed when running as a guest if the hypervisor has enabled
vfs-vlan-restrict, in which case the firmware rejects filters not qualified
with VLAN 0.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Fix VLAN filtering feature if vPort has VLAN_RESTRICT flag
Andrew Rybchenko [Wed, 15 Jun 2016 16:52:08 +0000 (17:52 +0100)]
sfc: Fix VLAN filtering feature if vPort has VLAN_RESTRICT flag

If vPort has VLAN_RESTRICT flag, VLAN tagged traffic will not be
delivered without corresponding Rx filters which may be proxied to and
moderated by hypervisor.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Update MCDI protocol definitions
Edward Cree [Wed, 15 Jun 2016 16:51:48 +0000 (17:51 +0100)]
sfc: Update MCDI protocol definitions

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Disable VLAN filtering by default if not strictly required
Andrew Rybchenko [Wed, 15 Jun 2016 16:51:36 +0000 (17:51 +0100)]
sfc: Disable VLAN filtering by default if not strictly required

If should be done after net_dev->hw_features initialization, to keep the
feature there to be able to enable it later using ethtool.

VLAN filtering is enforced and fixed if vPort requires usage of VLAN
filters to receive tagged traffic.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: VLAN filters must only be created if the firmware supports this.
Martin Habets [Wed, 15 Jun 2016 16:51:07 +0000 (17:51 +0100)]
sfc: VLAN filters must only be created if the firmware supports this.

If it is not supported we simply disable the feature.

For the feature to work we need firmware filter support for
OUTER_VID + LOC_MAC and for OUTER_VID + LOC_MAC_IG.
The low-latency firmware can match on OUTER_VID + LOC_MAC but not on
OUTER_VID + LOC_MAC_IG.
For the capture packet firmware it is the other way around.
Only the full-feature variant can match on both combinations.

Incorporates a fix by Andrew Rybchenko <Andrew.Rybchenko@oktetlabs.ru>
in the net_dev->[hw_]features handling.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Fix dup unknown multicast/unicast filters after datapath reset
Andrew Rybchenko [Wed, 15 Jun 2016 16:49:30 +0000 (17:49 +0100)]
sfc: Fix dup unknown multicast/unicast filters after datapath reset

Filter match flags are not unique criteria to be mapped to priority
because of both unknown unicast and unknown multicast are mapped to
LOC_MAC_IG. So, local MAC is required to map filter to priority.
MCDI filter flags is unique criteria to find filter priority.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Refactor checks for invalid filter ID
Edward Cree [Wed, 15 Jun 2016 16:49:05 +0000 (17:49 +0100)]
sfc: Refactor checks for invalid filter ID

Nearly every time we call efx_ef10_filter_remove_unsafe, we first check
for EFX_EF10_FILTER_ID_INVALID, in which case we do nothing.  So move
that check into the function, simplifying all the call sites.

Also, change the return type to void, since none of the callers check it.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Take mac_lock before calling efx_ef10_filter_table_probe
Martin Habets [Wed, 15 Jun 2016 16:48:49 +0000 (17:48 +0100)]
sfc: Take mac_lock before calling efx_ef10_filter_table_probe

When trying to enslave an SFC interface to a bond the following BUG_ON was
hit:

 kernel BUG [in ef10.c]!
 CPU: 0 PID: 4383 Comm: ifenslave Tainted: G
...
 Call Trace:
  efx_ef10_filter_add_vlan+0x121/0x180 [sfc]
  efx_ef10_filter_table_probe+0x2a2/0x4f0 [sfc]
  efx_ef10_set_mac_address+0x370/0x6d0 [sfc]
  efx_set_mac_address+0x7d/0x120 [sfc]
  dev_set_mac_address+0x43/0xa0
  bond_enslave+0x337/0xea0 [bonding]
This comes from function efx_ef10_filter_vlan_sync_rx_mode.

To solve the bug we ensure the mac_lock is taken before calling
efx_ef10_filter_add_vlan. But to avoid a priority inversion mac_lock must
be taken before filter_sem.
To satisfy these requirements we end up taking mac_lock in
efx_ef10_vport_set_mac_address, efx_ef10_set_mac_address,
efx_ef10_sriov_set_vf_vlan and efx_probe_filters.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Implement ndo_vlan_rx_{add, kill}_vid() callbacks
Andrew Rybchenko [Wed, 15 Jun 2016 16:48:32 +0000 (17:48 +0100)]
sfc: Implement ndo_vlan_rx_{add, kill}_vid() callbacks

Supports HW VLAN filtering, en/disabled using ethtool.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Implement list of VLANs added over interface
Andrew Rybchenko [Wed, 15 Jun 2016 16:48:14 +0000 (17:48 +0100)]
sfc: Implement list of VLANs added over interface

Right now it contains dummy VLAN entry with unspecified VID only.
The entry is used for the case when HW VLAN filtering is not used.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Make EF10 filter management helper functions VLAN-aware
Andrew Rybchenko [Wed, 15 Jun 2016 16:47:36 +0000 (17:47 +0100)]
sfc: Make EF10 filter management helper functions VLAN-aware

It is a step to support VLAN filtering in HW.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Store unicast and multicast promisc flag with address cache
Andrew Rybchenko [Wed, 15 Jun 2016 16:45:56 +0000 (17:45 +0100)]
sfc: Store unicast and multicast promisc flag with address cache

These flags are built when address cache is updated.
The information will be required when VLAN filtering is added and address
cache is used without re-sync.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Move filter IDs to per-VLAN data structure
Andrew Rybchenko [Wed, 15 Jun 2016 16:45:36 +0000 (17:45 +0100)]
sfc: Move filter IDs to per-VLAN data structure

It is a step to support VLAN filtering in HW.
Until then, there is only one struct efx_ef10_filter_vlan per struct
efx_ef10_filter_table, with no VLAN information yet.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Forget filter ID when the filter is marked old
Andrew Rybchenko [Wed, 15 Jun 2016 16:44:20 +0000 (17:44 +0100)]
sfc: Forget filter ID when the filter is marked old

It is required to remove setting of filter IDs to invalid from multicast
and unicast addresses caching functions.
Add initialization to invalid when filter table is created.
Add paranoid checks to track consistency.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Assert filter_sem write locked when required
Edward Cree [Wed, 15 Jun 2016 16:43:43 +0000 (17:43 +0100)]
sfc: Assert filter_sem write locked when required

Based on a patch by Andrew Rybchenko <Andrew.Rybchenko@oktetlabs.ru>

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Add efx_nic member with fixed netdev features
Andrew Rybchenko [Wed, 15 Jun 2016 16:43:20 +0000 (17:43 +0100)]
sfc: Add efx_nic member with fixed netdev features

It allows to change set of fixed features on datapath reset.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Move last mc_promisc flag to EF10 filter table state
Andrew Rybchenko [Wed, 15 Jun 2016 16:43:00 +0000 (17:43 +0100)]
sfc: Move last mc_promisc flag to EF10 filter table state

It is used for EF10 only and logically belongs to EF10 filter table state.
It is OK that it is reset to false on filter table recreation since all
filters are removed on destruction.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agosfc: Define macro with EF10 offload feature
Andrew Rybchenko [Wed, 15 Jun 2016 16:42:26 +0000 (17:42 +0100)]
sfc: Define macro with EF10 offload feature

It is useful to simplify features addition.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'rxrpc-rewrite-20160615' of git://git.kernel.org/pub/scm/linux/kernel/git...
David S. Miller [Thu, 16 Jun 2016 05:22:17 +0000 (22:22 -0700)]
Merge tag 'rxrpc-rewrite-20160615' of git://git./linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Rework endpoint record handling

Here's the next part of the AF_RXRPC rewrite.  In this set I rework
endpoint record handling.  There are two types of endpoint record, local
and peer.  The local endpoint record is used as an anchor for the transport
socket that AF_RXRPC uses (at the moment a UDP socket).  Local endpoints
can be shared between AF_RXRPC sockets under certain restricted
circumstances.

The peer endpoint is a record of the remote end.  It is (or will be) used
to keep track MTU and RTT values and, with these changes, is used to find
the call(s) to abort when a network error occurs.

The following significant changes are made:

 (1) The local endpoint event handling code is split out into its own file.

 (2) The local endpoint list bottom half-excluding spinlock is removed as
     things are arranged such that sk_user_data will not change whilst the
     transport socket callbacks are in progress.

 (3) Local endpoints can now only be shared if they have the same transport
     address (as before) and have a local service ID of 0 (ie. they're not
     listening for incoming calls).  This prevents callbacks from a server
     to one process being picked up by another process.

 (4) Local endpoint destruction is now accomplished by the same work item
     as processes events, meaning that the destructor doesn't need to wait
     for the event processor.

 (5) Peer endpoints are now held in a hash table rather than a flat list.

 (6) Peer endpoints are now destroyed by RCU rather than by work item.

 (7) Peer endpoints are now differentiated by local endpoint and remote
     transport port in addition to remote transport address and transport
     type and family.

     This means that a firewall that excludes access between a particular
     local port and remote port won't cause calls to be aborted that use a
     different port pair.

 (8) Error report handling now no longer assumes that the source is always
     an IPv4 ICMP message from a UDP port and has assumptions that an ICMP
     message comes from an IPv4 socket removed.  At some point IPv6 support
     will be added.

 (9) Peer endpoints rather than local endpoints are now the anchor point
     for distributing network error reports.

(10) Both types of endpoint records are now disposed of as soon as all
     references to them are gone.  There is less hanging around and once
     their usage counts hit zero, records can no longer be resurrected.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'liquidio-next'
David S. Miller [Thu, 16 Jun 2016 04:44:33 +0000 (21:44 -0700)]
Merge branch 'liquidio-next'

Raghu Vatsavayi says:

====================
liquidio: Updates and Bug fixes

Following are updates for liquidio bug fixes and driver
support for new firmware interface. These updates are divided
into smaller logical patches as mentioned by you. These set of
nine patches should be applied in the following order as some of
them depend on earlier patches in the list.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: Introduce new octeon2/3 header
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:51 +0000 (16:54 -0700)]
liquidio: Introduce new octeon2/3 header

Added support for new instruction header for octeon2/octeon3(ih) and
corresponding changes.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: Replace ifidx for FW commands
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:50 +0000 (16:54 -0700)]
liquidio: Replace ifidx for FW commands

This patch decoupled the firmware side ifidx and host side interface
number. It also has some minor name change for linkinfo sturct field.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: New driver FW command structure
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:49 +0000 (16:54 -0700)]
liquidio: New driver FW command structure

This patch is for new driver/firmware control command structure
(octnic_packet_params and octnic_cmd_setup ) and resultant code changes.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: Consider PTP for packet size calculations
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:48 +0000 (16:54 -0700)]
liquidio: Consider PTP for packet size calculations

This patch is to refactor packet size calculations to support PTP enabled
for 66xx and 68xx cards and also other cards that do not support PTP.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: RX desc alloc changes
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:47 +0000 (16:54 -0700)]
liquidio: RX desc alloc changes

This patch is to add page based buffers for receive side descriptors of
the driver and separate free routines for rx and tx buffers.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio:RX queue alloc changes
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:46 +0000 (16:54 -0700)]
liquidio:RX queue alloc changes

This patch is to allocate rx queue's memory based on numa node and also use
page based buffers for rx traffic improvements.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio:Scatter gather list per IQ
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:45 +0000 (16:54 -0700)]
liquidio:Scatter gather list per IQ

This patch is to allocate and manage scatter gather lists per
input queue(iq's) and remove queue's interdependence.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: Host queue mapping changes
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:44 +0000 (16:54 -0700)]
liquidio: Host queue mapping changes

This patch is to allocate the input queues based on Numa node in tx path
and queue mapping changes based on the mapping info provided by firmware.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoliquidio: Avoid double free during soft command
Raghu Vatsavayi [Tue, 14 Jun 2016 23:54:43 +0000 (16:54 -0700)]
liquidio: Avoid double free during soft command

This patch is to resolve the double free issue by checking proper return
values from soft command.

Signed-off-by: Derek Chickles <derek.chickles@caviumnetworks.com>
Signed-off-by: Satanand Burla <satananda.burla@caviumnetworks.com>
Signed-off-by: Felix Manlunas <felix.manlunas@caviumnetworks.com>
Signed-off-by: Raghu Vatsavayi <raghu.vatsavayi@caviumnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoila: Fix checksum neutral mapping
Tom Herbert [Tue, 14 Jun 2016 23:29:15 +0000 (16:29 -0700)]
ila: Fix checksum neutral mapping

The algorithm for checksum neutral mapping is incorrect. This problem
was being hidden since we were previously always performing checksum
offload on the translated addresses and only with IPv6 HW csum.
Enabling an ILA router shows the issue.

Corrected algorithm:

old_loc is the original locator in the packet, new_loc is the value
to overwrite with and is found in the lookup table. old_flag is
the old flag value (zero of CSUM_NEUTRAL_FLAG) and new_flag is
then (old_flag ^ CSUM_NEUTRAL_FLAG) & CSUM_NEUTRAL_FLAG.

Need SUM(new_id + new_flag + diff) == SUM(old_id + old_flag) for
checksum neutral translation.

Solving for diff gives:

diff = (old_id - new_id) + (old_flag - new_flag)

compute_csum_diff8(new_id, old_id) gives old_id - new_id

If old_flag is set
   old_flag - new_flag = old_flag = CSUM_NEUTRAL_FLAG
Else
   old_flag - new_flag = -new_flag = ~CSUM_NEUTRAL_FLAG

Tested:
  - Implemented a user space program that creates random addresses
    and random locators to overwrite. Compares the checksum over
    the address before and after translation (must always be equal)
  - Enabled ILA router and showed proper operation.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: ipv4: Add ability to have GRE ignore DF bit in IPv4 payloads
Philip Prindeville [Tue, 14 Jun 2016 21:53:02 +0000 (15:53 -0600)]
net: ipv4: Add ability to have GRE ignore DF bit in IPv4 payloads

    In the presence of firewalls which improperly block ICMP Unreachable
    (including Fragmentation Required) messages, Path MTU Discovery is
    prevented from working.

    A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
    is done for other payloads like AppleTalk--and doing transparent
    fragmentation and reassembly.

    Redux includes the enforcement of mutual exclusion between this feature
    and Path MTU Discovery as suggested by Alexander Duyck.

Cc: Alexander Duyck <alexander.duyck@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Philip Prindeville <philipp@redfish-solutions.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: vrf: Switch dst dev to loopback on device delete
David Ahern [Tue, 14 Jun 2016 18:37:21 +0000 (11:37 -0700)]
net: vrf: Switch dst dev to loopback on device delete

Attempting to delete a VRF device with a socket bound to it can stall:

  unregister_netdevice: waiting for red to become free. Usage count = 1

The unregister is waiting for the dst to be released and with it
references to the vrf device. Similar to dst_ifdown switch the dst
dev to loopback on delete for all of the dst's for the vrf device
and release the references to the vrf device.

Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver")
Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge tag 'shared' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma
David S. Miller [Thu, 16 Jun 2016 04:37:10 +0000 (21:37 -0700)]
Merge tag 'shared' of git://git./linux/kernel/git/leon/linux-rdma

Mellanox shared code between RDMA and net-next trees

This is Mellanox mlx5_core shared code for both net-next and RDMA
trees for 4.8 kernel cycle.

7 years agomdio: mux: avoid 'maybe-uninitialized' warning
Arnd Bergmann [Tue, 14 Jun 2016 10:03:17 +0000 (12:03 +0200)]
mdio: mux: avoid 'maybe-uninitialized' warning

The latest changes to the MDIO code introduced a false-positive
warning with gcc-6 (possibly others):

drivers/net/phy/mdio-mux.c: In function 'mdio_mux_init':
drivers/net/phy/mdio-mux.c:188:3: error: 'parent_bus_node' may be used uninitialized in this function [-Werror=maybe-uninitialized]

It's easy to avoid the warning by making sure the parent_bus_node
is initialized in both cases at the start of the function, since
the later 'of_node_put()' call is also valid for a NULL pointer
argument.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: f20e6657a875 ("mdio: mux: Enhanced MDIO mux framework for integrated multiplexers")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch '6lowpan-ndisc'
David S. Miller [Thu, 16 Jun 2016 03:41:24 +0000 (20:41 -0700)]
Merge branch '6lowpan-ndisc'

Alexander Aring says:

====================
6lowpan: introduce 6lowpan-nd

David can you please pick-up this patch-serie for your net-next tree?
Thanks in advance.

This patch series introduces the ndisc ops callback structure to add different
handling for IPv6 neighbour discovery cache functionality. It implements at first
the two following use-cases:

 - 6CO handling as userspace option (For all 6LoWPAN layers, BTLE/802.15.4) [0]
 - short address handling for 802.15.4 6LoWPAN only [1]

Since my last patch series, I completely changed the whole ndisc_ops callback
structure to not replace the whole ndisc functionality at recv/send level of
NS/NA/RS/RA which I send in my previous patch-series "6lowpan: introduce basic
6lowpan-nd". I changed it now to add different handling in a very low-level way
of ndisc functionality.

The ndisc_ops don't must be registered to dev->ndisc_ops anymore, if they are not
set, then no additional ipv6 ndisc handling will be done.

This patch series now introduce a complete handling of short address for
802.15.4 6LoWPAN in case of send/recv of NA/NS/RS and RA. In case of RA
(receive only) and PIO we also need a second prefix + short-address based
address.

This callback structure can be used later (I hope) for RFC 6775 [0]. This RFC
defines some new option fields and messages for 6LoWPAN-ND. This patch series
does not implement RFC 6775 (except we decide now to handle 6CO in userspace).

Additional we can use the current ops for parse/fill ndisc options for kernel
handled ndisc messages to add 6CIO, see [2].

I tested RA/NS/NA/RS messages with short address which seems to work, what I
didn't test is the redirect messages since I don't know how to generate them.
The short address for redirect messages are also some special case here, because
the short address by a L3 target address lookuped by neighbour cache need to be
added.

btw:
According to [3] sending redirect messages should be also disabled by default
on 6lowpan interfaces, but can be activated afterwards. This is maybe
something for the ipv6_devconf structure. There is a "accept_redirects" but
no "disable_redirects".

- Alex

[0] https://tools.ietf.org/html/rfc6775
[1] https://tools.ietf.org/html/rfc4944#section-8
[2] https://tools.ietf.org/html/rfc7400

changes since v3:
 - add acked-by and reviewed-by tags
 - fix url references in cover-letter
 - add cover-letter that this patch series is okay to go through net-next tree

changes since RFC:
 - add lowlevel functions __ndisc_opt_addr_space,
   __ndisc_opt_addr_data and __ndisc_fill_addr_option for corresponding
   functions which doesn't requires net_device argument.
 - move ndisc_ops e.g. ndisc_ops_fill_addr_option function call into the
   corresponding device argument function ndisc_fill_addr_option.
   (Introduced a special static inline function for redirect handling).
 - fix error handling in addrconf_prefix_rcv_add_addr.
   (Please see, introduce new API handling that second address registration
    (in case of 802.15.4 6LoWPAN) will still be notified if failed, because
    dev->addr was successful.
 - add ieee802154 sub-directory in short address entry for 6lowpan UAPI.
 - add lowpan_802154_is_valid_src_short_addr, because 802.15.4 6lowpan
   defines the first bit as multicast (don't know how this can be working
   at the end, because some hardware addresses will handle such addresses
   in L2 as unicast. See:
   https://www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml#_6lowpan-parameters-2

changes since v2:
 - Introduce ndisc_ops to have our own implementation for dealing with NS/NA
   which allows also to support RFC6775 (e.g. ARO).
 - add handling for handling 6CO as userspace option for RA messages in
   case of 6LoWPAN interfaces.
 - change lowpan_is_ll to check on linklayer type only.
 - added some reviewed-by's.
 - move short addr slaac to net/6lowpan instead ipv6 handling.
 - add handling for context based address compression in case for
   short address as link-layer address.
 - change strategy to use short address, a short address will always be used
   when it's available.
 - Handle override flag in NA messages to update short address information or
   not.
====================

Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years ago6lowpan: add support for 802.15.4 short addr handling
Alexander Aring [Wed, 15 Jun 2016 19:20:27 +0000 (21:20 +0200)]
6lowpan: add support for 802.15.4 short addr handling

This patch adds necessary handling for use the short address for
802.15.4 6lowpan. It contains support for IPHC address compression
and new matching algorithmn to decide which link layer address will be
used for 802.15.4 frame.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years ago6lowpan: add support for getting short address
Alexander Aring [Wed, 15 Jun 2016 19:20:26 +0000 (21:20 +0200)]
6lowpan: add support for getting short address

In case of sending RA messages we need some way to get the short address
from an 802.15.4 6LoWPAN interface. This patch will add a temporary
debugfs entry for experimental userspace api.

Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years ago6lowpan: introduce 6lowpan-nd
Alexander Aring [Wed, 15 Jun 2016 19:20:25 +0000 (21:20 +0200)]
6lowpan: introduce 6lowpan-nd

This patch introduce different 6lowpan handling for receive and transmit
NS/NA messages for the ipv6 neighbour discovery. The first use-case is
for supporting 802.15.4 short addresses inside the option fields and
handling for RFC6775 6CO option field as userspace option.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: export several functions
Alexander Aring [Wed, 15 Jun 2016 19:20:24 +0000 (21:20 +0200)]
ipv6: export several functions

This patch exports some neighbour discovery functions which can be used
by 6lowpan neighbour discovery ops functionality then.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoipv6: introduce neighbour discovery ops
Alexander Aring [Wed, 15 Jun 2016 19:20:23 +0000 (21:20 +0200)]
ipv6: introduce neighbour discovery ops

This patch introduces neighbour discovery ops callback structure. The
idea is to separate the handling for 6LoWPAN into the 6lowpan module.

These callback offers 6lowpan different handling, such as 802.15.4 short
address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
over 6LoWPANs).

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoaddrconf: put prefix address add in an own function
Alexander Aring [Wed, 15 Jun 2016 19:20:22 +0000 (21:20 +0200)]
addrconf: put prefix address add in an own function

This patch moves the functionality to add a RA PIO prefix generated
address in an own function. This move prepares to add a hook for
adding a second address for a second link-layer address. E.g. short
address for 802.15.4 6LoWPAN.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agondisc: add __ndisc_fill_addr_option function
Alexander Aring [Wed, 15 Jun 2016 19:20:21 +0000 (21:20 +0200)]
ndisc: add __ndisc_fill_addr_option function

This patch adds __ndisc_fill_addr_option as low-level function for
ndisc_fill_addr_option which doesn't depend on net_device parameter.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agondisc: add __ndisc_opt_addr_data function
Alexander Aring [Wed, 15 Jun 2016 19:20:20 +0000 (21:20 +0200)]
ndisc: add __ndisc_opt_addr_data function

This patch adds __ndisc_opt_addr_data as low-level function for
ndisc_opt_addr_data which doesn't depend on net_device parameter.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agondisc: add __ndisc_opt_addr_space function
Alexander Aring [Wed, 15 Jun 2016 19:20:19 +0000 (21:20 +0200)]
ndisc: add __ndisc_opt_addr_space function

This patch adds __ndisc_opt_addr_space as low-level function for
ndisc_opt_addr_space which doesn't depend on net_device parameter.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years ago6lowpan: remove ipv6 module request
Alexander Aring [Wed, 15 Jun 2016 19:20:18 +0000 (21:20 +0200)]
6lowpan: remove ipv6 module request

Since we use exported function from ipv6 kernel module we don't need to
request the module anymore to have ipv6 functionality.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years ago6lowpan: add 802.15.4 short addr slaac
Alexander Aring [Wed, 15 Jun 2016 19:20:17 +0000 (21:20 +0200)]
6lowpan: add 802.15.4 short addr slaac

This patch adds the autoconfiguration if a valid 802.15.4 short address
is available for 802.15.4 6LoWPAN interfaces.

Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years ago6lowpan: add private neighbour data
Alexander Aring [Wed, 15 Jun 2016 19:20:16 +0000 (21:20 +0200)]
6lowpan: add private neighbour data

This patch will introduce a 6lowpan neighbour private data. Like the
interface private data we handle private data for generic 6lowpan and
for link-layer specific 6lowpan.

The current first use case if to save the short address for a 802.15.4
6lowpan neighbour.

Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Stefan Schmidt <stefan@osg.samsung.com>
Acked-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Alexander Aring <aar@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'cxgb4-sriov-sysfs'
David S. Miller [Wed, 15 Jun 2016 21:46:05 +0000 (14:46 -0700)]
Merge branch 'cxgb4-sriov-sysfs'

Hariprasad Shenai says:

====================
Add SRIOV configuration via sysfs and few fixes

This series adds support to configure SR-IOV via PCI sysfs interface,
reduces resource allocation in kdump kernel by disabling offload. Also
synchronize unicast and multicast mac address, even in the interface is in
Promiscuous mode.

This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.

We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agocxgb4/cxgb4vf: Synchronize all MAC addresses
Hariprasad Shenai [Tue, 14 Jun 2016 09:09:32 +0000 (14:39 +0530)]
cxgb4/cxgb4vf: Synchronize all MAC addresses

Even if interface is in Promiscuous mode/Allmulti mode synchronize
MAC addresses.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agocxgb4: Enable SR-IOV configuration via PCI sysfs interface
Hariprasad Shenai [Tue, 14 Jun 2016 09:09:31 +0000 (14:39 +0530)]
cxgb4: Enable SR-IOV configuration via PCI sysfs interface

Implement callback in the driver for the new PCI bus driver
interface that allows the user to enable/disable SR-IOV
virtual functions in a device via the sysfs interface.

Deprecate module parameter used to configure SRIOV

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agocxgb4: Force cxgb4 driver as MASTER in kdump kernel
Hariprasad Shenai [Tue, 14 Jun 2016 09:09:30 +0000 (14:39 +0530)]
cxgb4: Force cxgb4 driver as MASTER in kdump kernel

When is_kdump_kernel() is true, Forcing cxgb4 driver as Master so we can
reinitialize the Firmware/Chip. Also reduce memory usage by disabling
offload.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'sched_skb_free_defer'
David S. Miller [Wed, 15 Jun 2016 21:08:36 +0000 (14:08 -0700)]
Merge branch 'sched_skb_free_defer'

Eric Dumazet says:

====================
net_sched: defer skb freeing while changing qdiscs

qdiscs/classes are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

I saw spikes of 50+ ms on quite fast hardware...

This patch series adds a simple queue protected by RTNL
where skbs can be placed until RTNL is released.

Note that this might also serve in the future for optional
reinjection of packets when a qdisc is replaced.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_fq: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:59 +0000 (20:21 -0700)]
net_sched: sch_fq: defer skb freeing

sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_pie: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:58 +0000 (20:21 -0700)]
net_sched: sch_pie: defer skb freeing

pie_change() can use rtnl_qdisc_drop() to benefit from
deferred freeing.

pie_reset() is already using qdisc_reset_queue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_netem: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:57 +0000 (20:21 -0700)]
net_sched: sch_netem: defer skb freeing

rtnl_kfree_skbs() can be used in tfifo_reset()

It would be nice if we could iterate through rb tree instead
of removing one skb at a time, and build a single skb chain.
But this is left for a future patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_htb: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:56 +0000 (20:21 -0700)]
net_sched: sch_htb: defer skb freeing

Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
instead of __skb_queue_purge() to defer skb freeing of internal
queues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_hhf: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:55 +0000 (20:21 -0700)]
net_sched: sch_hhf: defer skb freeing

Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: fq_codel: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:54 +0000 (20:21 -0700)]
net_sched: fq_codel: defer skb freeing

Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_fq: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:53 +0000 (20:21 -0700)]
net_sched: sch_fq: defer skb freeing

Both fq_change() and fq_reset() can use rtnl_kfree_skbs()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_codel: defer skb freeing in codel_change()
Eric Dumazet [Tue, 14 Jun 2016 03:21:52 +0000 (20:21 -0700)]
net_sched: sch_codel: defer skb freeing in codel_change()

codel_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

codel_reset() already has support for deferred skb freeing
because it uses qdisc_reset_queue()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: sch_choke: defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:51 +0000 (20:21 -0700)]
net_sched: sch_choke: defer skb freeing

choke_reset() and choke_change() can use rtnl_qdisc_drop()
to defer expensive skb freeing after locks are released.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: add the ability to defer skb freeing
Eric Dumazet [Tue, 14 Jun 2016 03:21:50 +0000 (20:21 -0700)]
net_sched: add the ability to defer skb freeing

qdisc are changed under RTNL protection and often
while blocking BH and root qdisc spinlock.

When lots of skbs need to be dropped, we free
them under these locks causing TX/RX freezes,
and more generally latency spikes.

This commit adds rtnl_kfree_skbs(), used to queue
skbs for deferred freeing.

Actual freeing happens right after RTNL is released,
with appropriate scheduling points.

rtnl_qdisc_drop() can also be used in place
of disc_drop() when RTNL is held.

qdisc_reset_queue() and __qdisc_reset_queue() get
the new behavior, so standard qdiscs like pfifo, pfifo_fast...
have their ->reset() method automatically handled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotipc: add neighbor monitoring framework
Jon Paul Maloy [Tue, 14 Jun 2016 00:46:22 +0000 (20:46 -0400)]
tipc: add neighbor monitoring framework

TIPC based clusters are by default set up with full-mesh link
connectivity between all nodes. Those links are expected to provide
a short failure detection time, by default set to 1500 ms. Because
of this, the background load for neighbor monitoring in an N-node
cluster increases with a factor N on each node, while the overall
monitoring traffic through the network infrastructure increases at
a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
scale well beyond ~100 nodes unless we significantly increase failure
discovery tolerance.

This commit introduces a framework and an algorithm that drastically
reduces this background load, while basically maintaining the original
failure detection times across the whole cluster. Using this algorithm,
background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
now have to actively monitor 38 neighbors in a 400-node cluster, instead
of as before 399.

This "Overlapping Ring Supervision Algorithm" is completely distributed
and employs no centralized or coordinated state. It goes as follows:

- Each node makes up a linearly ascending, circular list of all its N
  known neighbors, based on their TIPC node identity. This algorithm
  must be the same on all nodes.

- The node then selects the next M = sqrt(N) - 1 nodes downstream from
  itself in the list, and chooses to actively monitor those. This is
  called its "local monitoring domain".

- It creates a domain record describing the monitoring domain, and
  piggy-backs this in the data area of all neighbor monitoring messages
  (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
  the cluster eventually (default within 400 ms) will learn about
  its monitoring domain.

- Whenever a node discovers a change in its local domain, e.g., a node
  has been added or has gone down, it creates and sends out a new
  version of its node record to inform all neighbors about the change.

- A node receiving a domain record from anybody outside its local domain
  matches this against its own list (which may not look the same), and
  chooses to not actively monitor those members of the received domain
  record that are also present in its own list. Instead, it relies on
  indications from the direct monitoring nodes if an indirectly
  monitored node has gone up or down. If a node is indicated lost, the
  receiving node temporarily activates its own direct monitoring towards
  that node in order to confirm, or not, that it is actually gone.

- Since each node is actively monitoring sqrt(N) downstream neighbors,
  each node is also actively monitored by the same number of upstream
  neighbors. This means that all non-direct monitoring nodes normally
  will receive sqrt(N) indications that a node is gone.

- A major drawback with ring monitoring is how it handles failures that
  cause massive network partitionings. If both a lost node and all its
  direct monitoring neighbors are inside the lost partition, the nodes in
  the remaining partition will never receive indications about the loss.
  To overcome this, each node also chooses to actively monitor some
  nodes outside its local domain. Those nodes are called remote domain
  "heads", and are selected in such a way that no node in the cluster
  will be more than two direct monitoring hops away. Because of this,
  each node, apart from monitoring the member of its local domain, will
  also typically monitor sqrt(N) remote head nodes.

- As an optimization, local list status, domain status and domain
  records are marked with a generation number. This saves senders from
  unnecessarily conveying  unaltered domain records, and receivers from
  performing unneeded re-adaptations of their node monitoring list, such
  as re-assigning domain heads.

- As a measure of caution we have added the possibility to disable the
  new algorithm through configuration. We do this by keeping a threshold
  value for the cluster size; a cluster that grows beyond this value
  will switch from full-mesh to ring monitoring, and vice versa when
  it shrinks below the value. This means that if the threshold is set to
  a value larger than any anticipated cluster size (default size is 32)
  the new algorithm is effectively disabled. A patch set for altering the
  threshold value and for listing the table contents will follow shortly.

- This change is fully backwards compatible.

Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: vrf: Update flags and features settings
David Ahern [Tue, 14 Jun 2016 00:14:12 +0000 (17:14 -0700)]
net: vrf: Update flags and features settings

1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users
   can add one as desired.

2. Disable adding a VLAN to a VRF device.

3. Enable offloads and hardware features similar to other logical
   devices (e.g., dummy, veth)

Change provides a significant boost in TCP stream Tx performance,
from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the
performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark
using qemu with virtio+vhost for the NICs

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agotun: fix csum generation for tap devices
Paolo Abeni [Mon, 13 Jun 2016 22:00:04 +0000 (00:00 +0200)]
tun: fix csum generation for tap devices

The commit 34166093639b ("tuntap: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.

Fixes: 34166093639 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'skb_array'
David S. Miller [Wed, 15 Jun 2016 20:58:34 +0000 (13:58 -0700)]
Merge branch 'skb_array'

Michael S. Tsirkin says:

====================
skb_array: array based FIFO for skbs

This is in response to the proposal by Jason to make tun
rx packet queue lockless using a circular buffer.
My testing seems to show that at least for the common usecase
in networking, which isn't lockless, circular buffer
with indices does not perform that well, because
each index access causes a cache line to bounce between
CPUs, and index access causes stalls due to the dependency.

By comparison, an array of pointers where NULL means invalid
and !NULL means valid, can be updated without messing up barriers
at all and does not have this issue.

On the flip side, cache pressure may be caused by using large queues.
tun has a queue of 1000 entries by default and that's 8K.
At this point I'm not sure this can be solved efficiently.
The correct solution might be sizing the queues appropriately.

Here's an implementation of this idea: it can be used more
or less whenever sk_buff_head can be used, except you need
to know the queue size in advance.

As this might be useful outside of networking, I implemented
a generic array of void pointers, with a type-safe wrapper for skbs.

It remains to be seen whether resizing is required, in case it is
I included patches implementing resizing by holding both the
consumer and the producer locks.

I think this code works fine without any extra memory barriers since we
always read and write the same location, so the accesses can not be
reordered.
Multiple writes of the same value into memory would mess things up
for us, I don't think compilers would do it though.
But if people feel it's better to be safe wrt compiler optimizations,
specifying queue as volatile would probably do it in a cleaner way
than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?

The only issue is with calls within a loop using the __ptr_ring_XXX
accessors - in theory compiler could hoist accesses out of the loop.

Following volatile-considered-harmful.txt I merely
documented that callers that busy-poll should invoke cpu_relax().
Most people will use the external skb_array_XXX APIs with a spinlock,
so this should not be an issue for them.

Eric Dumazet suggested adding an extra pointer to skb for when
we have a single outstanding packet. I could not figure out
a way to implement this without a shared consumer/producer lock
though, which would cause cache line bounces by itself.

Jesper, Jason, I know that both of you tested this,
please post Tested-by tags for whatever was tested.

changes since v7
fix typos noticed by Jesper Brouer

changes since v6
resize implemented. peek/full calls are no longer lockless

replaced _FIELD macros with _CALL which invoke a function
on the pointer rather than just returning a value

destroy now scans the array and frees all queued skbs

changes since v5
implemented a generic ptr_ring api, and
made skb_array a type-safe wrapper
apis for taking the spinlock in different contexts
following expected usecase in tun
changes since v4 (v3 was never posted)
documentation
dropped SKB_ARRAY_MIN_SIZE heuristic
unit test (in userspace, included as patch 2)

changes since v2:
        fixed integer overflow pointed out by Eric.
        added some comments.

changes since v1:
        fixed bug pointed out by Eric.
====================

Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoskb_array: resize support
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:50 +0000 (23:54 +0300)]
skb_array: resize support

Update skb_array after ptr_ring API changes.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoptr_ring: resize support
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:45 +0000 (23:54 +0300)]
ptr_ring: resize support

This adds ring resize support. Seems to be necessary as
users such as tun allow userspace control over queue size.

If resize is used, this costs us ability to peek at queue without
consumer lock - should not be a big deal as peek and consumer are
usually run on the same CPU.

If ring is made bigger, ring contents is preserved.  If ring is made
smaller, extra pointers are passed to an optional destructor callback.

Cleanup function also gains destructor callback such that
all pointers in queue can be cleaned up.

This changes some APIs but we don't have any users yet,
so it won't break bisect.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoskb_array: array based FIFO for skbs
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:41 +0000 (23:54 +0300)]
skb_array: array based FIFO for skbs

A simple array based FIFO of pointers.  Intended for net stack so uses
skbs for type safety. Implemented as a set of wrappers around ptr_ring.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoptr_ring: ring test
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:36 +0000 (23:54 +0300)]
ptr_ring: ring test

Add ringtest based unit test for ptr ring.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoptr_ring: array based FIFO for pointers
Michael S. Tsirkin [Mon, 13 Jun 2016 20:54:31 +0000 (23:54 +0300)]
ptr_ring: array based FIFO for pointers

A simple array based FIFO of pointers.  Intended for net stack which
commonly has a single consumer/producer.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: make tcf_hash_check() boolean
WANG Cong [Mon, 13 Jun 2016 20:46:28 +0000 (13:46 -0700)]
net_sched: make tcf_hash_check() boolean

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'vrf-ipv6-mcast-link-local'
David S. Miller [Wed, 15 Jun 2016 19:34:34 +0000 (12:34 -0700)]
Merge branch 'vrf-ipv6-mcast-link-local'

David Ahern says:

====================
net: vrf: Handle ipv6 multicast and link-local addresses

IPv6 multicast and link-local addresses require special handling by the
VRF driver. Rather than using the VRF device index and full FIB lookups,
packets to/from these addresses should use direct FIB lookups based on
the VRF device table.

Multicast routes do not make sense for the L3 master device directly.
Accordingly, do not add mcast routes for the device, and the VRF driver
should fail attempts to send packets to ipv6 mcast addresses on the
device (e.g, ping6 ff02::1%<vrf> should fail)

With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses (icmp, tcp, and udp).  e.g.,

1. packets into VM with VRF config:
    ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
    ping6 -c3 ff02::1%br1
    ssh -6 fe80::e0:f9ff:fe1c:b974%br1

2. packets going out a VRF enslaved device:
    ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
    ping6 -c3 ff02::1%eth1
    ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: vrf: Handle ipv6 multicast and link-local addresses
David Ahern [Mon, 13 Jun 2016 20:44:19 +0000 (13:44 -0700)]
net: vrf: Handle ipv6 multicast and link-local addresses

IPv6 multicast and link-local addresses require special handling by the
VRF driver:
1. Rather than using the VRF device index and full FIB lookups,
   packets to/from these addresses should use direct FIB lookups based on
   the VRF device table.

2. fail sends/receives on a VRF device to/from a multicast address
   (e.g, make ping6 ff02::1%<vrf> fail)

3. move the setting of the flow oif to the first dst lookup and revert
   the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF
   support to IPv6 stack"). Linklocal/mcast addresses require use of the
   skb->dev.

With this change connections into and out of a VRF enslaved device work
for multicast and link-local addresses work (icmp, tcp, and udp)
e.g.,

1. packets into VM with VRF config:
    ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
    ping6 -c3 ff02::1%br1

    ssh -6 fe80::e0:f9ff:fe1c:b974%br1

2. packets going out a VRF enslaved device:
    ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
    ping6 -c3 ff02::1%eth1
    ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: ipv6: Do not add multicast route for l3 master devices
David Ahern [Mon, 13 Jun 2016 20:44:18 +0000 (13:44 -0700)]
net: ipv6: Do not add multicast route for l3 master devices

L3 master devices are virtual devices similar to the loopback
device. Link local and multicast routes for these devices do
not make sense. The ipv6 addrconf code already skips adding a
linklocal address; do the same for the mcast route.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: l3mdev: Remove const from flowi6 arg to get_rt6_dst
David Ahern [Mon, 13 Jun 2016 20:44:17 +0000 (13:44 -0700)]
net: l3mdev: Remove const from flowi6 arg to get_rt6_dst

Allow drivers to pass flow arg to functions where the arg is not const
and allow the driver to make updates as needed (eg., setting oif).

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'af_iucv-big-bufs'
David S. Miller [Wed, 15 Jun 2016 19:21:05 +0000 (12:21 -0700)]
Merge branch 'af_iucv-big-bufs'

Ursula Braun says:

====================
s390: af_iucv patches

here are improvements for af_iucv relaxing the pressure to allocate
big contiguous kernel buffers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoaf_iucv: use paged SKBs for big inbound messages
Eugene Crosser [Mon, 13 Jun 2016 16:46:16 +0000 (18:46 +0200)]
af_iucv: use paged SKBs for big inbound messages

When an inbound message is bigger than a page, allocate a paged SKB,
and subsequently use IUCV receive primitive with IPBUFLST flag.
This relaxes the pressure to allocate big contiguous kernel buffers.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoaf_iucv: remove fragment_skb() to use paged SKBs
Eugene Crosser [Mon, 13 Jun 2016 16:46:15 +0000 (18:46 +0200)]
af_iucv: remove fragment_skb() to use paged SKBs

Before introducing paged skbs in the receive path, get rid of the
function `iucv_fragment_skb()` that replaces one large linear skb
with several smaller linear skbs.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoaf_iucv: use paged SKBs for big outbound messages
Eugene Crosser [Mon, 13 Jun 2016 16:46:14 +0000 (18:46 +0200)]
af_iucv: use paged SKBs for big outbound messages

When an outbound message is bigger than a page, allocate and fill
a paged SKB, and subsequently use IUCV send primitive with IPBUFLST
flag. This relaxes the pressure to allocate big contiguous kernel
buffers.

Signed-off-by: Eugene Crosser <Eugene.Crosser@ru.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agodt: bindings: Add bindings for Cirrus Logic CS89x0 ethernet chip
Alexander Shiyan [Mon, 13 Jun 2016 15:52:17 +0000 (18:52 +0300)]
dt: bindings: Add bindings for Cirrus Logic CS89x0 ethernet chip

Add device tree binding documentation details for Cirrus Logic
CS8900/CS8920 ethernet chip.

Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet: cx89x0: Add DT support
Alexander Shiyan [Mon, 13 Jun 2016 15:51:05 +0000 (18:51 +0300)]
net: cx89x0: Add DT support

Add DT support to the Cirrus Logic CS89x0 driver.

Signed-off-by: Alexander Shiyan <shc_work@mail.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agorxrpc: Rework local endpoint management
David Howells [Mon, 4 Apr 2016 13:00:35 +0000 (14:00 +0100)]
rxrpc: Rework local endpoint management

Rework the local RxRPC endpoint management.

Local endpoint objects are maintained in a flat list as before.  This
should be okay as there shouldn't be more than one per open AF_RXRPC socket
(there can be fewer as local endpoints can be shared if their local service
ID is 0 and they share the same local transport parameters).

Changes:

 (1) Local endpoints may now only be shared if they have local service ID 0
     (ie. they're not being used for listening).

     This prevents a scenario where process A is listening of the Cache
     Manager port and process B contacts a fileserver - which may then
     attempt to send CM requests back to B.  But if A and B are sharing a
     local endpoint, A will get the CM requests meant for B.

 (2) We use a mutex to handle lookups and don't provide RCU-only lookups
     since we only expect to access the list when opening a socket or
     destroying an endpoint.

     The local endpoint object is pointed to by the transport socket's
     sk_user_data for the life of the transport socket - allowing us to
     refer to it directly from the sk_data_ready and sk_error_report
     callbacks.

 (3) atomic_inc_not_zero() now exists and can be used to only share a local
     endpoint if the last reference hasn't yet gone.

 (4) We can remove rxrpc_local_lock - a spinlock that had to be taken with
     BH processing disabled given that we assume sk_user_data won't change
     under us.

 (5) The transport socket is shut down before we clear the sk_user_data
     pointer so that we can be sure that the transport socket's callbacks
     won't be invoked once the RCU destruction is scheduled.

 (6) Local endpoints have a work item that handles both destruction and
     event processing.  The means that destruction doesn't then need to
     wait for event processing.  The event queues can then be cleared after
     the transport socket is shut down.

 (7) Local endpoints are no longer available for resurrection beyond the
     life of the sockets that had them open.  As soon as their last ref
     goes, they are scheduled for destruction and may not have their usage
     count moved from 0.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Separate local endpoint event handling out into its own file
David Howells [Mon, 4 Apr 2016 13:00:34 +0000 (14:00 +0100)]
rxrpc: Separate local endpoint event handling out into its own file

Separate local endpoint event handling out into its own file preparatory to
overhauling the object management aspect (which remains in the original
file).

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Use the peer record to distribute network errors
David Howells [Mon, 4 Apr 2016 13:00:34 +0000 (14:00 +0100)]
rxrpc: Use the peer record to distribute network errors

Use the peer record to distribute network errors rather than the transport
object (which I want to get rid of).  An error from a particular peer
terminates all calls on that peer.

For future consideration:

 (1) For ICMP-induced errors it might be worth trying to extract the RxRPC
     header from the offending packet, if one is returned attached to the
     ICMP packet, to better direct the error.

     This may be overkill, though, since an ICMP packet would be expected
     to be relating to the destination port, machine or network.  RxRPC
     ABORT and BUSY packets give notice at RxRPC level.

 (2) To also abort connection-level communications (such as CHALLENGE
     packets) where indicted by an error - but that requires some revamping
     of the connection event handling first.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Do a little bit of tidying in the ICMP processing
David Howells [Mon, 4 Apr 2016 13:00:34 +0000 (14:00 +0100)]
rxrpc: Do a little bit of tidying in the ICMP processing

Do a little bit of tidying in the ICMP processing code.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Don't assume anything about the address in an ICMP packet
David Howells [Mon, 4 Apr 2016 13:00:33 +0000 (14:00 +0100)]
rxrpc: Don't assume anything about the address in an ICMP packet

Don't assume anything about the address in an ICMP packet in
rxrpc_error_report() as the address may not be IPv4 in future, especially
since we're just printing these details.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Break MTU determination from ICMP into its own function
David Howells [Mon, 4 Apr 2016 13:00:33 +0000 (14:00 +0100)]
rxrpc: Break MTU determination from ICMP into its own function

Break MTU determination from ICMP out into its own function to reduce the
complexity of the error report handler.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Rename rxrpc_UDP_error_report() to rxrpc_error_report()
David Howells [Mon, 4 Apr 2016 13:00:32 +0000 (14:00 +0100)]
rxrpc: Rename rxrpc_UDP_error_report() to rxrpc_error_report()

Rename rxrpc_UDP_error_report() to rxrpc_error_report() as it might get
called for something other than UDP.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agorxrpc: Rework peer object handling to use hash table and RCU
David Howells [Mon, 4 Apr 2016 13:00:32 +0000 (14:00 +0100)]
rxrpc: Rework peer object handling to use hash table and RCU

Rework peer object handling to use a hash table instead of a flat list and
to use RCU.  Peer objects are no longer destroyed by passing them to a
workqueue to process, but rather are just passed to the RCU garbage
collector as kfree'able objects.

The hash function uses the local endpoint plus all the components of the
remote address, except for the RxRPC service ID.  Peers thus represent a
UDP port on the remote machine as contacted by a UDP port on this machine.

The RCU read lock is used to handle non-creating lookups so that they can
be called from bottom half context in the sk_error_report handler without
having to lock the hash table against modification.
rxrpc_lookup_peer_rcu() *does* take a reference on the peer object as in
the future, this will be passed to a work item for error distribution in
the error_report path and this function will cease being used in the
data_ready path.

Creating lookups are done under spinlock rather than mutex as they might be
set up due to an external stimulus if the local endpoint is a server.

Captured network error messages (ICMP) are handled with respect to this
struct and MTU size and RTT are cached here.

Signed-off-by: David Howells <dhowells@redhat.com>
7 years agoact_police: rename tcf_act_police_locate() to tcf_act_police_init()
WANG Cong [Mon, 13 Jun 2016 17:47:44 +0000 (10:47 -0700)]
act_police: rename tcf_act_police_locate() to tcf_act_police_init()

This function is just ->init(), rename it to make it obvious.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agonet_sched: remove internal use of TC_POLICE_*
WANG Cong [Mon, 13 Jun 2016 17:47:43 +0000 (10:47 -0700)]
net_sched: remove internal use of TC_POLICE_*

These should be gone when we removed CONFIG_NET_CLS_POLICE.
We can not totally remove them since they are exposed
to userspace.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoMerge branch 'rds-mprds-foundations'
David S. Miller [Wed, 15 Jun 2016 06:50:44 +0000 (23:50 -0700)]
Merge branch 'rds-mprds-foundations'

Sowmini Varadhan says:

====================
RDS: multiple connection paths for scaling

Today RDS-over-TCP is implemented by demux-ing multiple PF_RDS sockets
between any 2 endpoints (where endpoint == [IP address, port]) over a
single TCP socket between the 2 IP addresses involved. This has the
limitation that it ends up funneling multiple RDS flows over a single
TCP flow, thus the rds/tcp connection is
   (a) upper-bounded to the single-flow bandwidth,
   (b) suffers from head-of-line blocking for the RDS sockets.

Better throughput (for a fixed small packet size, MTU) can be achieved
by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
RDS (mprds).  Each such TCP/IP flow constitutes a path for the rds/tcp
connection. RDS sockets will be attached to a path based on some hash
(e.g., of local address and RDS port number) and packets for that RDS
socket will be sent over the attached path using TCP to segment/reassemble
RDS datagrams on that path.

The table below, generated using a prototype that implements mprds,
shows that this is significant for scaling to 40G.  Packet sizes
used were: 8K byte req, 256 byte resp. MTU: 1500.  The parameters for
RDS-concurrency used below are described in the rds-stress(1) man page-
the number listed is proportional to the number of threads at which max
throughput was attained.

  -------------------------------------------------------------------
     RDS-concurrency   Num of       tx+rx K/s (iops)       throughput
     (-t N -d N)       TCP paths
  -------------------------------------------------------------------
        16             1             600K -  700K            4 Gbps
        28             8            5000K - 6000K           32 Gbps
  -------------------------------------------------------------------

FAQ: what is the relation between mprds and mptcp?
  mprds is orthogonal to mptcp. Whereas mptcp creates
  sub-flows for a single TCP connection, mprds parallelizes tx/rx
  at the RDS layer. MPRDS with N paths will allow N datagrams to
  be sent in parallel; each path will continue to send one
  datagram at a time, with sender and receiver keeping track of
  the retransmit and dgram-assembly state based on the RDS header.
  If desired, mptcp can additionally be used to speed up each TCP
  path. That acceleration is orthogonal to the parallelization benefits
  of mprds.

This patch series lays down the foundational data-structures to support
mprds in the kernel. It implements the changes to split up the
rds_connection structure into a common (to all paths) part,
and a per-path rds_conn_path. All I/O workqs are driven from
the rds_conn_path.

Note that this patchset does not (yet) actually enable multipathing
for any of the transports; all transports will continue to use a
single path with the refactored data-structures. A subsequent patchset
will  add the changes to the rds-tcp module to actually use mprds
in rds-tcp.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoRDS: Update rds_conn_destroy to be MP capable
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:42 +0000 (09:44 -0700)]
RDS: Update rds_conn_destroy to be MP capable

Refactor rds_conn_destroy() so that the per-path dismantling
is done in rds_conn_path_destroy, and then iterate as needed
over rds_conn_path_destroy().

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoRDS: Update rds_conn_shutdown to work with rds_conn_path
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:41 +0000 (09:44 -0700)]
RDS: Update rds_conn_shutdown to work with rds_conn_path

This commit changes rds_conn_shutdown to take a rds_conn_path *
argument, allowing it to shutdown paths other than c_path[0] for
MP-capable transports.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoRDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:40 +0000 (09:44 -0700)]
RDS: Initialize all RDS_MPATH_WORKERS in __rds_conn_create

Add a for() loop in __rds_conn_create to initialize all the
conn_paths, in preparate for MP capable transports.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoRDS: Add rds_conn_path_error()
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:39 +0000 (09:44 -0700)]
RDS: Add rds_conn_path_error()

rds_conn_path_error() is the MP-aware analog of rds_conn_error,
to be used by multipath-capable callers.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoRDS: update rds-info related functions to traverse multiple conn_paths
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:38 +0000 (09:44 -0700)]
RDS: update rds-info related functions to traverse multiple conn_paths

This commit updates the callbacks related to the rds-info command
so that they walk through all the rds_conn_path structures and
report the requested info.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
7 years agoRDS: Add rds_conn_path_connect_if_down() for MP-aware callers
Sowmini Varadhan [Mon, 13 Jun 2016 16:44:37 +0000 (09:44 -0700)]
RDS: Add rds_conn_path_connect_if_down() for MP-aware callers

rds_conn_path_connect_if_down() works on the rds_conn_path
that it is passed. Callers who are not t_m_capable may continue
calling rds_conn_connect_if_down, which will invoke
rds_conn_path_connect_if_down() with the default c_path[0].

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>