vserver development mailing list: [Vserver] [RFC] Future Linux-Vserver Networking (Part 1)

[Vserver] [RFC] Future Linux-Vserver Networking (Part 1):
vserver development mailing list
[Next/Previous Months] [Main vserver Project Homepage] [Howto Subscribe/Unsubscribe] [Paul Sladen's vserver stuff]

I finished investigating the options we have regarding the network
(interface) development in future linux-vserver versions, and I'd
like to get your opinion on several issues and/or ideas ...

this is going to be a little longer, and I'd suggest to read it
thoroughly and think about it before replying (but you probably
do that anyways ;)

I'll do that in several parts, so I can accumulate questions,
suggestions, answers, etc, and respond to them, while I proceed,
so do not expect this to be something final, and do not hesitate
to ask questions and/or provide feedback ...

first, a short overview about the basic principles in use, and
the 'building blocks', I identified and researched.

  Network Interfaces [ip link]
    - provide a handle to a physical or virtual device
    - have a physical address (eg. MAC for ethernet)
    - do traffic accounting rx/tx errors/drops/...

  IP Addresses [ip addr]
    - provide an internet address ipv4/ipv6/...
    - associated with a link (interface) as primary/secondary
    - have/define a network (address/netmask)

  Network Sockets [netstat -atuw]
    - provide an interface to send/receive messages
    - associated with an address (not an interface)

  Network Context [chbind]
    - limits the addresses to a given set of addresses
    - is inherited from parent to child process
    - is applied to socket operations
    - limits the visibility of addresses
    - doesn't know anything about interfaces
    - can not be modified or migrated into

  basically a network interface is something, where a packet
  enters or leaves the host (server), and that is, what the
  tun/tap device on the host, and the network driver on the
  UML/QEMU/VMware client does.

  consider the following setup:

    host: eth0: <some-network-ip>
            tun0: 10.0.0.1/24
            lo: 127.0.0.1/8

    client: eth0: 10.0.0.2/24
            lo: 127.0.0.1/8

  what happens on a 'ping -c 1 10.0.0.2' issued on the host?

  H# ping -c 1 10.0.0.2
  PING 10.0.0.2 (10.0.0.2) from 10.0.0.1 : 56(84) bytes of data.
  64 bytes from 10.0.0.2: icmp_seq=0 ttl=64 time=44.554 msec

  HOST (MAC-H, 10.0.0.1) (MAC-C, 10.0.0.2) CLIENT
  | |
  | arp: who-has 10.0.0.2 tell 10.0.0.1 ---------------------> |
  | <------------------------- arp: reply 10.0.0.2 is-at MAC-C |
  | |
  | icmp: 10.0.0.1 > 10.0.0.2: echo request -----------------> |
  | |
  | <--------------------- arp: who-has 10.0.0.1 tell 10.0.0.2 |
  | arp: reply 10.0.0.1 is-at MAC-H -------------------------> |
  | |
  | <------------------- icmp: 10.0.0.2 > 10.0.0.1: echo reply |

and the ifconfig on the client (and on the host, except for
some differences in the packet size[1]) now show:

  eth0 Link encap:Ethernet HWaddr MAC-C
        inet addr:10.0.0.2 Bcast:10.0.0.255 Mask: ...
        UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
        RX packets:3 errors:0 dropped:0 overruns:0 frame:0
        TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
        RX bytes:218 (218.0 b) TX bytes:218 (218.0 b)

  what did UML/QEMU/VMware do in that process? simple, the
  application did receive 3 packets from host, via tun0, and
  transmitted them to the client kernel via eth0, and it also
  received 3 packets from the client, which it delivered via
  the tun0 device to the network stack of the host.

  now, let's have a look at the same ping on the client side:

  C# ping -c 1 10.0.0.2
  PING 10.0.0.2 (10.0.0.2) from 10.0.0.2 : 56(84) bytes of data.
  64 bytes from 10.0.0.2: icmp_seq=0 ttl=64 time=4.391 msec

  CLIENT (MAC-C, 10.0.0.2) (MAC-C, 10.0.0.2) CLIENT
  | |
  | icmp: 10.0.0.2 > 10.0.0.2: echo request -----------------> |
  | <------------------- icmp: 10.0.0.2 > 10.0.0.2: echo reply |

  and the ifconfig on the client shows:

  eth0 Link encap:Ethernet HWaddr MAC-C
        inet addr:10.0.0.2 Bcast:10.0.0.255 Mask: ...
        UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
        RX packets:0 errors:0 dropped:0 overruns:0 frame:0
        TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
        RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)

  lo Link encap:Local Loopback
        inet addr:127.0.0.1 Mask:255.0.0.0
        UP LOOPBACK RUNNING MTU:16436 Metric:1
        RX packets:2 errors:0 dropped:0 overruns:0 frame:0
        TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
        RX bytes:168 (168.0 b) TX bytes:168 (168.0 b)

  what was the part of UML/QEMU/VMware in that process?
  at least nothing network related, because the entire ping
  was handled on the client, which used the loopback interface
  to reach one of its local addresses, disabling the lo device
  would cause the ping to fail.

interesting things to spend a second thought on:

  - why does the host->client ping take ~10 times longer?
  - why does lo show 2 packets received and 2 transmitted?
  - why does lo account a different size than tun0?
  - why does tun0 account a different size than eth0?

next part: routing and netfilter (probably)

[1] if you do a detailed dump and have a close look at the
accounted network data, you will find that the client
receives more data than the host transmits (via tun0)