%Network Subsystem
%Thadeu Cascardo

# Socket Buffers

* include linux/skbuff.h
* struct sk\\_buff
	- struct net\\_device *dev
* alloc\\_skb(len, gfp)
* dev\\_alloc\\_skb(len) - uses ATOMIC, reserves PAD
* netdev\\_alloc\\_skb(netdev, len) - uses ATOMIC, since 2.6.18
* kfree\\_skb, dev\\_kfree\\_skb

# Diagram

* head - start of allocated buffer
* end - end of allocated buffer
* data - start of data, giving space after headroom
* tail - end of used space
* headroom - space between head and data
* tailroom - space between tail and end

# Socket Buffers operations

* skb\\_put(skb, len) - adds data to end of skb
* skb\\_push(skb, len) - adds data to start of skb
* skb\\_pull(skb, len) - allocates headroom
* skb\\_headroom and skb\\_tailroom
* skb\\_reserve - only allowed for empty buffer, reserves headroom
* skb\\_orphan - release it from its socket holder

# Network Device

* include linux/netdevice.h
* struct net\\_device
	- char name[]
	- features
	- stats
	- netdev\\_ops
	- ethtool\\_ops
	- header\\_ops
	- flags
	- mtu
	- type
	- hard\\_header\\_len

# Network Device Setup

* alloc\\_netdev(szpriv, name, setup)
* setup function
* include linux/etherdevice.h
* ether\\_setup
* alloc\\_etherdev
* register\\_netdev
* unregister\\_netdev
* free\\_netdev

# Network Device Operations

* struct net\\_device\\_ops
* ndo\\_init
* ndo\\_open
	- should call netif\\_start\\_queue
* ndo\\_stop
	- should call netif\\_stop\\_queue

# Network Device Address

* struct net\\_device
	- dev\\_addr
* random\\_ether\\_addr
* struct net\\_device\\_ops
	- ndo\\_set\\_mac\\_address
* eth\\_mac\\_addr

# Transmission

* ndo\\_start\\_xmit
* Called with softirqs disabled or in softirq context
* Called with a held lock

# Limits on transmission

* When TX buffers are full, xmit may call netif\\_stop\\_queue
* Should arrange to get netif\\_wake\\_queue called after TX buffers are free
  again

# Transmission timeout

* Transmission may timeout after queue has been stopped
* Current code already updates last time packet was transmitted
* Driver should set watchdog\\_timeo and ndo\\_tx\\_timeout

# Reception

* Usually happens in an interrupt handler
* Driver allocates skb: some drivers allocate it when setting up and arrange the
  device to write into the buffers directly
* Must set skb field protocol: easily done for ethernet drivers with
  eth\\_type\\_trans
* Finally, call netif\\_rx

# NAPI

* To allow more performance, NAPI introduces polling, avoiding too much
  interrupts when load is high
* Driver disables interrupts and enables polling in its interrupt handler
  when RX happens
* Network subsystem uses a softirq to do the polling
* The driver poll function disables polling and reenabled interrupts when it's
  done with its hardware queue

# NAPI

* struct napi\\_struct
* netif\\_napi\\_add(dev, napi, poll\\_func, weight)
* napi\\_enable: called in open
* napi\\_disable: called in stop - awaits completion
* napi\\_schedule
	- napi\\_schedule\\_prep
	- \\_\\_napi\\_schedule
* napi\\_complete: called in poll when all is done
* Use netif\\_receive\\_skb instead of netif\\_rx

# NAPI step by step

* In the interrupt handler:
	- Checks that the interrupt received is RX
	- Call napi\\_schedule\\_prep to check that napi isn't already scheduled
	- Disable RX
	- Call \\_\\_napi\\_schedule

# Weight and Budget

* The weight is the start budget for the interface, usually 16
* The poll function must not dequeue more frames than the budget
* It must call napi\\_complete if and only if it has exhausted the hardware
  queues with less than the budget
* It must return the number of entries in the queue processed

# Changes in net device

* Use netdev\\_priv, no priv anymore
* struct net\\_device\\_ops introduced in 2.6.29, with compatibility provided
  for drivers
* Compatibility removed in 2.6.31
* netdev\\_tx\\_t: NETDEV\\_TX\\_OK, NETDEV\\_TX\\_BUSY, NETDEV\\_TX\\_LOCKED

# Other recent changes

* Some members moved to netdev\\_queue to increase cache-line usage
* GRO/GSO - Handle hardware checksum acceleration
* Multi-queue support, for devices with multiple queues, so they can be handled
  in different CPUs
* RPS - Receive Packet Steering, which distributes protocol processing amongst
  multiple CPUs in case of a single device, single queue system
* RFS - Receive Flow Steering, which tries to handle the packet in the CPU where
  the application is running