Next Previous Contents

3. The arp Problem

3.1 The problem

If you follow the instructions and setup the examples in the LVS-mini-HOWTO, then you don't need to know about the arp problem. If you're going to setup grander LVS's, then you'll need to understand the arp problem.

Although this section comes early in the HOWTO, it has lots of pitfalls. You shouldn't be reading this unless you've at least setup a working LVS-NAT and LVS-DR LVS using the canned instructions in the mini-HOWTO.

The LVS allows several machines to function as one machine. For LVS-DR and LVS-Tun some trickery was needed to split the various handshakes etc involved in establishing and maintaining a tcpip connection so that some parts of it came from one machine and other parts from another machine. Most of these problems are handled, and some problems only occur for certain services (eg identd) and we've learned to live with them. The worst problem, which ironically only happens with realservers running Linux 2.2.x and 2.4.x kernels, is the "arp problem" (it's just as well we have the source code).

With LVS-DR and LVS-Tun, all the machines (director, realservers) in the LVS have an extra IP, the VIP. Here's a LVS-DR in a test setup where all machines and IPs are on the same physical network (i.e. are using the same link layer and can hear each other's broadcasts).



                        ________
                       |        |
                       | client |
                       |________|
                           |
                           |
                        (router)
                           |
                           |
                           |       __________
                           |  DIP |          |
                           |------| director |
                           |  VIP |__________|
                           |
                           |
                           |
         ------------------------------------
         |                 |                |
         |                 |                |
     RIP1, VIP         RIP2, VIP        RIP3, VIP
   ______________    ______________    ______________
  |              |  |              |  |              |
  | realserver1  |  | realserver2  |  | realserver3  |
  |______________|  |______________|  |______________|


When the client requests a connection to the VIP, it must connect to the VIP on the director and not to the VIP on the realservers.

The director box acts as an IP router, accepting packets destined for the VIP and then sending them on to a realserver (where the real work is done and a reply is generated). When the client (or router) puts out the arp request "who has VIP, tell client", the client/router must receive the MAC address of the director for the LVS to work. After receiving the arp reply, the client will send the connect request to the director. (The director will update its ipvsadm tables to keep track of the connections that it's in charge of and then forward the connect request packet to the chosen realserver).

If instead, the client gets the MAC address of one of the realservers, then the packets will be sent directly to that realserver, bypassing the LVS action of the director. If the client's packets are consistently sent to the same realserver, then the client will have a normal session connected to that realserver. You can't count on this happening, the MAC address the client gets might change in the middle of a session and a new realserver will start getting packets for connections it knows nothing about (the realserver will send tcp resets). If nothing is done to direct arp requests for the VIP specifically to the director ( ref), then in some setups, one particular realserver's MAC address will be in the client/router's arp table for the VIP and the client will only see one realserver. (In my setup, the machine with the fastest CPU is in the client's arp table, suggesting that it's the first machine to reply that gets in. Horms and Steven WIlliams have written that they think it's the last machine to reply whose entry in in the client's arp table.) In other setups where the realservers are identical, the client will connect to different realservers each time the arp cache times out (see comment by Steven WIlliams elsewhere). There the client's connection will hang as the new realserver will be presented with packets from an established connection that it knows nothing about. If the director always gets its MAC address in the router arp table, then the LVS will work without any changes to the realservers (as happened in my case with a director with the fastest CPU in the LVS), although this may not be a reliable solution for production.

Getting the MAC address of the director (instead of the realservers) to the client when the client/router does an arp request is the key to solving the "arp problem".

The arp problem is handled in 2.0.x kernels as serveral devices which don't reply to arp requests (eg dummy0, tunl0, lo:0) were available for the the VIP. For other OS's, the NOARP flag for ifconfig would stop the VIP on the realservers from replying to arp requests.

However with 2.2.x (and now 2.4.x) kernels, the devices which didn't reply to arp requests in 2.0.x, now reply to arp requests. There is a "-arp" (NOARP) option for ifconfig which (according to the man pages) turns off replies to arp requests for that device, and an "arp" option which turns them back on again. Linux does not always honour this flag. You couldn't turn on replies to arp requests for the dummy0 devices in 2.0.36 kernels and you can't turn it off for tunl0 in 2.2.x kernels. eth0 behaves properly in 2.0.36 but in 2.2.x kernels it arps even when you tell it not to arp. This behaviour of not honouring the NOARP flag in the Linux 2.2.x kernels is not regarded as a "problem" by those writing the Linux TCPIP code and is not going to be "fixed".

Julian 22 May 2001

The flag is used to allow arp requests for the specified device. Although "lo" doesn't reply to arp requests, the requests for the VIP go through eth*, and so the NOARP flag is of no help to us. We can't drop the flag for eth.

Another wrinkle is that in 2.0.36 kernels, aliased devices (eg eth0:1) could be setup independantly of the options on the primary (eth0) device. Thus eth0:1 behaved as if it were on a separate NIC and it's arp'ing behaviour could be set independantly of the primary interface. The settings of an aliased device belonged to the IP. With the 2.2.x kernels, the aliased devices are now just alternate names for each other: you change an option (eg -arp) or up/down of one alias (or primary) the other aliases follow. With 2.2.x kernels, the settings of the aliased device belong to the primary device (there is only one device with several IPs).

When LVS was running on 2.0.36 machines, the VIP was usually configured as an alias (eg lo:0, tunl0) on the main ethernet device (eth0), allowing the nodes in an LVS to have only one NIC.

With 2.2.x kernels care is needed when only one NIC is used on the realserver (the usual case). On a realserver with eth0 carrying the RIP, and the realserver having only one NIC, eth0 must reply to arp requests (to receive packets), then eth0:1 carrying the VIP will reply to arp requests too, even if you ifconfig it with -noarp. Thus if a realserver is running a 2.2.x kernel and has the VIP on an ip_alias, then the VIP on the realserver will reply to arp requests received from the router.

With the 2.4 kernels, the use of ip_aliases is still allowed. However the new tools that come with 2.4 (iproute2 and ip_tables) do not recognise aliases, which have been replaced by secondary IPs.

3.2 The cure(s)

Several cures have been produced in an attempt to solve the arp problem. They involve either

Pick one -

Note: Some of these cures involve applying a patch to the kernel on the Linux 2.2.x or 2.4.x realserver. The patch (e.g. the "hidden" patch), which you apply to the realserver, is different to the patch which you apply to the director (the "ipvs" patch). For the full scoop on the hidden patch see julian's page.

2.2.x kernels

The "hidden" patches for kernel >=2.2.14 are now in the standard linux distribution (i.e. you can use the "hidden" feature with a standard kernel and don't have to patch the kernel on the realserver anymore). The arp patches allow you to hide a device from arp requests, returning to the no_arp behaviour of the 2.0.x kernels.

To hide devices from arp calls , on the realservers do

       #to activate the hidden feature
       echo 1 > /proc/sys/net/ipv4/conf/all/hidden
       #to make lo:0 -arp, put lo here
       echo 1 > /proc/sys/net/ipv4/conf/<interface_name>/hidden

To test that the network device (here lo:0) is hidden from arp requests -

There is a possible race condition in hiding the VIP -

Kyle Sparger, 15 Feb 2001

I've found an interesting, but not totally unexpected race condition under DR in 2.2.x that I've managed to create when installing VIP's on a machine in DR mode. Basically, the cause is this:

ifconfig dummy0 10.0.1.15
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden

You'll notice that there's going to be a small gap between the two which allows an ARP request to come in, and for the server to reply. And yes, it is big enough to be bitten by -- I've been bitten twice by it so far :)

Julian

On boot:

echo 1 > /proc/sys/net/ipv4/conf/all/hidden
# For each hidden interface (dummy, lo, tunl):
modprobe dummy0
ifconfig dummy0 0.0.0.0 up
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
# Now set any other IP address 

Kyle's suggestion

echo 1 > /proc/sys/net/ipv4/conf/default/hidden
ifconfig dummy0 10.0.1.15
echo 0 > /proc/sys/net/ipv4/conf/default/hidden

The echo 0 command is incase I want to configure other interfaces later that I _do_ want responding to ARP requests. Technically, it's not necessary, I just find it useful in my particular setup.

Older 2.2 kernels (<2.2.12)

These are getting old now and it would better to upgrade. However if you have them, you apply the arp patches to the kernel code of the 2.2.x realservers. These patches are separate from the ipvs patch applied to the kernel on the director.

For kernels <2.2.12, Julian's patch is on the lvs website.

http://www.linuxvirtualserver.org/arp_invisible-2213-2.diff

The patch by Stephen WIllIams is at

http://www.linuxvirtualserver.org/sdw_fullarpfix.patch

This patch is against a 2.2.5 kernel but can be applied to later kernels (tested to 2.2.13). The file appears to have DOS carriage control. Depending what you get on your disk, you may have to convert the file to unix carriage control (with `tr -d '\015'`) (the unix line extension of '\' doesn't work in combination with DOS carriage control).

The whitespace may not match your file so do

$ cd /usr/src/linux
$ patch -p1 -l < sdw_fullarpfix.patch

(If you are running one of these old kernels, you could upgrade to your kernel.)

If you are using Julian's martian modification you will need the forward_shared-hidden patch (applied to both director and realservers).

2.4.x kernels

Julian's hidden patch to the standard 2.2.x kernel is not being included in the 2.4.x kernels.

For early 2.4.x kernels (eg x=0), the patch is available at http://www.linuxvirtualserver.org/hidden-2.3.41-1.diff. (This patches a part of the kernel that isn't being actively fiddled with, so hopefully the patch will work against later 2.4.x kernels too.)

The 2.4.x "hidden" patch in now being actively maintained and is included in ipvs-x.x.x/contrib/patches/hidden-x.x.x.diff

Assuming you are patching 2.4.2 with the ipvs-0.2.5 files

cd /usr/src/linux
patch -p1 <../ipvs-0.2.5/contrib/patches/hidden-2.4.2-1.diff

Then build the kernel (can use same options as for the 2.4 director kernel build).

You activate the hidden feature as for 2.2 (see hidden).

As to why the hidden patch is in the 2.2 kernels but not the 2.4 kernels see the the mailing list archives or for the thread

Put an extra NIC on the realserver to carry the VIP (on eth1)

Possible cards would be a discarded ISA card (WD80x3), or a cheap 100Mbit PCI card (eg Netgear FA310TX, $16 in USA in Nov 99) There is no traffic going through this NIC and it doesn't matter that it's an old slow card. The extra card is only required so that the realserver can have the VIP on the machine. With 2.2.x kernels you can't stop this device (eth1) from replying to arp requests, but if you don't connect the cable to it or don't put a route to it in the realserver's routing table, then the client won't be able to send it an arp request.

To set this up with the configure script, enter eth1 as the device for the VIP on the realserver.

Put the realservers on a different network to the VIP

Setup routing tables so that the client cannot route to the realserver network (Lars' method). This method requires the director to not forward packets for the VIP (easy to implement if 2 NICS on the director).

On the client(router), set the routing to the VIP to go only to the director

You can hardwire the MAC address of the director as the MAC address of the VIP. You can do this with

#arp -s lvs.mack.net 00:80:C8:CA:A7:E4

or 

arp -f /etc/ethers.

Here is my /etc/ethers file (on the client)

lvs.mack.net 00:80:C8:CA:A7:E4

This requires no extra NICs or patching of realservers. However in a production environment, redundant directors with heartbeat/failover may be required and some method (eg running send-arp) will be needed to change the static arp entry as the failover occurs. If multiple NICs are involved, it is possible that the above instruction will result in a route through the wrong NIC. In this case bring up the NIC of interest first and then run the above command.

Alternately if the router has several NICs, use one for the director and another for the realservers. Route the VIP to the director.

Use transparent proxy allow the incoming packet to be accepted locally - Horms' method.

see LVS-DR and LVS-Tun for details. The configure script will set this up for you.

3.3 The ARP problem, the first inklings

History: ARP behaviour changed with 2.2.x kernels. Here's the original posting by Wensong and a reply from Alexy Kuznet (2.2 tcpip author)

Wensong Zhang wensong@iinchina.net 24 Mar 1999

Today I upgraded the kernel to 2.2.3 with tunneling support on one of a real server, and found a problem that the Linux 2.2.3 tunnel device answers ARP requests. Even if I used the NOARP options as follows:

realserver:# ifconfig tunl0 172.26.20.110 -arp netmask 255.255.255.255 broadcast 172.26.20.110

It still answers the ARP requests. This will greatly affect the virtual server via tunneling work properly. In fact, the tunnel device shouldn't answer the ARP requests from the ethernet. I think it is a bug of linux/net/ipv4/ipip.c, which is now a clone of ip_gre.c not the original tunneling code.

If you are interested, you can test yourself on kernel 2.2.3, choose a free IP address of your ethernet and configure it on the tunl0 device, then telnet to that IP address from other host, I guess you can. Finally, have a look at the ipip.c, maybe you can debug it. :-) --

But, what is the IFF_NOARP flag of the tunnel device for?

kuznet@ms2.inr.ac.ru

IFF_NOARP means that ARP is not used by THIS device. On normal IPIP tunnels it does not make much of sense, but may be used f.e. to turn on/off endpoint reachability detection.

I do not see any reasons to disable answering ARP in such curcumstances. Isolation of VPNs on adjacent segments is impossible at routing/arp level, it is just not well-defined behaviour.

If the isolation is made with firewall policy rules, then it is clear that arp policy must be handled at this level too.

In kernel 2.0.x, the tunnel device doesn't answer ARP requests.

Yes.

Yeah, we can have link-local addresses that doesn't answer ARP requests in kernel 2.2.x. For example, we can configure all the hosts in a network with the following command:

ifconfig lo:0 192.168.0.10 up
There will no collision. The lookback alias interfaces don't answer ARP requests.

Are you sure? I am not. Please, test.

BTW you risk adding non-loopback addresses on loopback device. They have the HIGHEST preference to be used as router identifier. so that VPN addresses cannot be added to loopback at all.

No, it doesn't fail. I tested it with kernel 2.0.36, it worked.
It does not work under 2.2. To be honest, I am about to stop to understand you. You talk about 2.2, but all your tests are made for 2.0. 8)

3.4 A posting to the mailinglist by Peter Kese explaining the "arp problem"

(saved for posterity by Ted Pavlic, minor editing by Joe)

peter.kese@ijs.si

Before we start, let's assume we have following network configuration for an LVS running LVS-DR.

client          10.10.10.10

gw              192.168.1.1

director        192.168.1.10    IP for admin (director IP)
                192.168.1.110   VIP (responds to arp requests)

real server     192.168.1.11    IP to which each service is listening (realserver IP)
                192.168.1.110   VIP (DOES NOT respond to arp requests)

The virtualserver is the combination of the director and the realserver running LVS.

Or goal is:

  1. Virtual server should respond to arp requests for both the VIP and the director IP.
  2. The realserver should respond to arp requests for the realserver IP but NOT the VIP.
  3. Gateway sends packets for the VIP to the director IP load balancer no matter what.

Problem 1: Interface aliases

Realserver and director need to have an interface with the VIP in order to respond to packets for virtual server. A real interface is not needed, an IP alias will do just fine and this interface alias could be either eth0:0 or lo:0.

On the 2.0 kernels, the ARP responding ability of an interface alias (eg eth0:0) could either be enabled or disabled independantly of the main (eth0) interface. If you wanted eth0:0 not to respond to ARP requests, you could simply say:

ifconfig eth0:0 192.168.1.2 -arp up

Thus in the 2.0 kernels it is possible, on a realserver, to have the realserver IP (on eth0) respond to arp requests and for the VIP (on eth0:0) to not respond.

In the 2.2 kernels this doesn't work any more. Whether the an interface alias responds to ARP requests or not, depends only on the way the real interface is configured. So if eth0 responds to ARP requests (which it normally will), eth0:0 carrying the VIP will also respond to ARP requests no matter what.

This means an ethernet alias (eth0:0) is not permitted on real servers, because real servers should not respond ARP requests.

On the other hand, loopback aliases never respond ARP requests, which means that the loopback alias (lo:0) must not be used on the director for the VIP.

Problem 2: Loopback aliases

I haven't done much checking on loopback interface problem, but it seems that if an alias is used on a loopback interface (as is required for LVS-DR) on a real server running kernel 2.2.x, the whole ARP gets screwed.

It appears that loopback interfaces get special ARP treatment in the kernel, so I suggest avoiding the loopback aliases as whole.

The question now is: What kind of an interface can I use on real servers?

As I already noted, eth0:0 alias can not be used, because such aliases respond to ARP requests. lo:0 aliases can not be used, because they make ARP problems too.

In case of tunneling VS configuration, the answer is trivial: tunl0. But to be honest, tunl0 interface can also be used for direct routing.

(from Joe, the dummy device is OK too)

With direct routing, the only thing we need an interface for is to let kernel know we posses an additional IP address. This means, we can set up any kind of an interface, as long as it doesn't respond ARP requests. Instead of tunl0, you could also set up a ppp0, slip0, eth1 or whatever. I suggest setting up a tunl0:

ifconfig tunl0 192.168.1.2 -arp up

Problem 3: Real server ARP requests.

Suppose we have set up a virtual server as described at the beginning. All computers are running, but no requests have been made.

Then the client sends a request to the VIP.

When the packet arrives to gateway, the gateway makes an ARP query for the VIP and the director responds. Gateway remembers the director's MAC address and sends the packet to the director. Director receives the packet, looks up its ipvsadm/LVS tables and chooses the real server and forwards the packet to the real server by direct routing or tunneling method.

Real server receives the packet and generates a response packet with destination=client, source=VIP.

(until now everything works correctly)

When real server wants to send the response packet to the gateway, it finds out, that it does not know the gateway's MAC address.

It sends an ARP request to the local network and asks for the gateway MAC address. This should look like:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)

But in reality, real server asks something like:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.110 (VIP),

because it takes the source address from the packet it wants to send.

Here the problems come in.

Gateway receives the packet and responds to it, which is correct. But at the same time, gatweay does a little optimization. It finds out, that the realserver's MAC address is not listed in its ARP tables and adds the entry into the table, just in case it might need that address in the near future.

The ARP request contained the VIP address and the realserver's MAC address, so from now on, the gateway will send all packets destined for the VIP to the real server instead (due to MAC address). This means all packets that follow will avoid the virtual server as whole and get responded by the realserver.

If the real server's ARP request would be:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)

all this would not have happened. Therefore I have patched the 2.2 VS kernel in such a way, that it composes ARP requests based on the address of the interface selected by the routing tables instead of the address taken from the packet itself.

In order for virtual server to work correctly, the real servers should have patched kernels as well, or at least copy the patched /usr/src/linux/net/ipv4/arp.c file to the real servers before compiling the kernels.

Conclusion

Those were my experience with ARP problems, and the 2.2 kernel virtual server.

I think it would be wise to add this letter to the web site and notify the network developers about our findings at some point in time.

Here are some golden rules I stick to, when I do virtual server configuration:

Rule 1:
        Do not use lo:0 alias on the director.
        Use eth0:0 alias instead.

Rule 2:
        Avoid using lo:0 alias, not even on realservers.
        Use tunl0 or some other simulated interface
        on real servers instead. (Joe: use dummy0)


Rule 3:
        Apply the VS patch to kernels on real servers.

3.5 random mailings on the arp problem

symptoms of realservers arp'ing :

Stephen WIlliams sdw@lig.net, Stephen wrote one of the patches that stop devices in 2.2.x kernels from replying to arp requests

If you don't use the patch you'll find that the 'active' box will bounce from machine to machine as each one sends an ARP reply that is heard last. Additionally you will get TCP Reset's as connections that were on one box suddenly start going to others. Very nasty and unusable.

This is called Lars' method

Lars

I have thought about how the ARP problem can occur at all with direct routing, because I never noticed it. Then it occured to me that your VIP comes from the same subnet as the RIP of the LVS and also all the real servers share this media.

To avoid the "ARP problem" in this case without adding a kernel patch or anything else, you can just add a direct route for the VIP using the RIP of the LVS as a gateway address on the router in front of the LVS. ("ip route VIP 255.255.255.255 real_ip" on a Cisco, or "route add -host VIP gw RIP" on Linux)

Since I just used 2 ethernet cards and had the LVS act as gateway/firewall anyway, I never noticed the ARP problem. (We have 2 LVS in a standby configuration to eliminate the SPOF)

The arp problem is handled if the router in front of the director has a static route for the VIP to the director (i.e. packets for the VIP from the outside world are sent to the director and cannot get to the realservers.

Wensong

For the clients who reach the virtual server through the router, there is no problem if a static route for VIP is added.

However, for the clients who are in the network of virtual server, the "ARP problem" will arise. There is fight in ARP response, and the clients don't know send the packets to the load balancer or the real server.

In my point of view, the VIP address is shared by the director and realservers in LVS-Tun or LVS-DR, only the director does ARP response for VIP to accept request packets, and the realservers has the VIP but don't, so that they can process packets destined for VIP.

Joe, 21 May 2001

Was looking at the ip notes and it says

ip arp on|off

--change NOARP flad on the device

1cm NB. This operation is not allowed if the device is in state UP.
Though neither ip utility nor kernel check for this condition, you can
get unpredictable results changing the flag while the device is running.

Julian Anastasov ja@ssi.bg 21 May 2001

This is the device ARP flag, same as ifconfig [-]arp. The flag is used to allow ARP packets for the specified device. It is correct that "lo" does not talk ARP, but you connect to the VIPs on "lo" through eth*, so the flag is of no help for LVS. We can't drop the flag for eth device.

Andreas J. Koenig, 02 Jun 2001

kernel 2.4.5 has arp_filter

Julian Anastasov ja@ssi.bg

arp_filter does not solve the ARP problem for LVS

This is a new proposal to control the ARP probes and replies based on route flag "noarp". It will be discussed on the netdev mailing list and may be something like this is going to be included in 2.4, may be in 2.2 too, not sure. All you know that the hidden feature is not considered to 2.4. The net developers have the final word. I'll try to maintain the hidden flag in all next kernels while this flag is more usable than the new feature and because the hidden flag has other semantic. And because may be there are some user space tools that rely on this.

3.6 Is the arp behaviour of 2.2.x kernel a bug?

Julian Anastasov is replying to correct an error in a previous version of the HOWTO where I state that the dummy0 device in 2.2.x kernels does not arp. Julian wrote one of the realserver patches which fix the "arp problem".

In fact, the documentation is incorrect. There is no difference, all devices are reported in the ARP replies: lo, tunl and dummy. So, only the ARP patch can solve the problem. This can be tested using this configuration with any device (before the patch applied):

Host A:
         eth:x 192.168.0.1

Host B:
         eth:x 192.168.0.2
         lo, dummy, tunl: 192.168.0.3

On host A try: ping 192.168.0.3

Host B replies for 192.168.0.3 through 192.168.0.2 device

So, the ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel.

Stephen WIlliams, who wrote another of the patches to fix the arp problem.

Of course the ARP code in the kernel needs to be fixed so my filter code isn't needed. Still, I'm confused by this statement. The IFF_NOARP flag determines whether a device arp replies or not. What's wrong with honoring that?

If you mean that arp replies should never be sent on another interface, that is what I currently believe to be correct.

Julian

My understanding is that 2.2.x ARP code is not buggy and there is no need to be "fixed". I must say that your patch is working for the LVS folks but not for all linux users.

IFF_NOARP means "Don't talk ARP on this device", from the 'man ifconfig':

[-]arp Enable or disable the use of the ARP protocol on this interface.

So, where is the bug ? The ARP code never talks through lo, dummy and tunl devices when they are set NOARP. It uses eth (ARP) device. If You hide all NOARP interfaces from the ARP protocol this is a bug. One example:

 +--------+ppp0                          +------+
 | Host A |------------ppp link----------|ROUTER|------ The World
 +--------+A.B.C.1 (www.domain.com)      +------+
   |eth0
   |A.B.C.2
   |
   |A.B.C.3
 +--------+
 | Host B |
 +--------+

Is it possible after your patch Host B to access www.domain.com ? How ? Host A doesn't send replies for A.B.C.1 through eth0 after your patch. OK, may be this is not fatal. Tell it to all kernel users. You hide all their NOARP interfaces. May be there are other examples where this is a problem too. Or may be there is something wrong in this configuration?

I want to say that this patch hurts all users if present in the kernel. On Nov 6 I posted one patch proposal to the linux-kernel list which adds the ability to hide interfaces from the ARP queries and replies. But the difference is that only specified interfaces are not replied, not all NOARP interfaces. Its arp_invisible sysctl can be used by LVS folks to hide lo, tunl or dummy interfaces but this feature doesn't hurt all kernel users. I think, this patch is more acceptable and can be included in the 2.2 kernel, may be after some tunning. And I'm still expecting comments from the net folks and from all LVS users.

3.7 How to tell if an interface is replying to arp requests

on the machine with that IP (usually the VIP)

$ ping VIP

look in /proc/net/arp for MAC address

on a machine on a network (eg 192.168.1.0/24) to see which addresses are replying to arp requests

$ ping 192.168.1.255

then before the arp tables expire (15secs - 2mins depending on the OS)

$ arp -a

3.8 Arp caching defeats Heartbeat switchover

From: Claudio Di-Martino claudio@claudio.csita.unige.it

I've set up a VS using direct routing composed of two linux-2.2.9 boxes with the 0.4 patch applied. The load balancer acts as a local node too. I configured mon to monitor the state of the services and update the redirect table accordingly. I also configured heartbeat so that when the load balancer fails the second machine takes over the virtual ip, sets up the redirect table and starts mon. When the load balancer restarts, the backup reconfigures itself as a real server, drops the interface alias that carries the virtual ip, stops mon, clears the redirect table. Although the configuration of the two machines is set up correctly it fails to restore the load balancer due to arp caching problems.

It seems that the local gateway keeps routing requests for the virtual ip to the load balancer backup. Sending gratuitous arp packets from the load balancer doesn't have effect since the interface of the backup is still alive and responding.

Has anyone encountered a similar problem and is there a hack or a proper solution to take back control of the virtual ip?

From: "Antony Lee" AntonyL@hwl.com.hk

I am new to LVS and I have a problem in setting up two LVSes for failover issue. The problem is related to the ARP caching of the primary LVS' MAC address in the real servers and the router connected to the Internet. The problem leads all the Internet connections stalled until all ARP caching in Web Servers and router to be expired. Can anyone help to solve the problem by making some changes in the Linux LVS ? ( It is because I am not able to change the router ARP cache time. The router is not owned by the Web hosting company not by me.)

In each LVS, there are two network card installed. The eth0 is connected to a router which is connected to the Internet. The eth1 is connected to a private network which is the same segment as the two NT IIS4.

The eth0 of the primary LVS is assigned an IP address 202.53.128.56
The eth0 of the backup LVS is assigned an IP address 202.53.128.57
The eth1 of the primary LVS is assigned an IP address 192.128.1.9
The eth1 of the primary LVS is assigned an IP address 192.128.1.10

In addition, both primary and backup LVS have enabled the IPV4 FORWARD and
IPV4 DEFRAG. In the file /etc/rc.d/rc.local the following command was also
added:
ipchains -A -j MASQ 192.168.1.0/24 -d 0.0.0.0/0

I use the piranha to configure the LVS so that the two LVS have a common IP address 202.53.128.58 in the eth0 as eth0:1. And have a IP address 192.128.1.1 in the eth1 as eth1:1

The pulse daemon is also automatically be run when the two LVSes were booted.

In my configuration, the Internet clients can still access to our Web server with one of the NT was disconnected from the LVS. The backup LVS --CAN AUTOMATICALLY-- take up the role of the primary LVS when the primary LVS is shut down or disconnected from the backup LVS. However, I found that all the NT Web Servers cannot reach the backup LVS through the common IP address 192.128.1.1, and all the Internet clients stalled to connect to our web servers.

Later, I found that the problem may due to the ARP caching in the Web Servers and router. I tried to limit the ARP cache time to 5 seconds in the NT servers and half of the problem has solved ,i.e. the NT Web servers can reach the backup LVS through the common IP address 192.128.1.1 when the primary LVS was down. However, it is still cannot be connected through the Internet clients when the LVS failover occur.

Wensong

I just tried two LVS boxes with piranha 0.3.15. When the primary LVS stops or fails, the backup will take over and send out 5 Gratuitous Arp packets for the VIP and the NAT router IP respectively, which should clean the ARP caching in both the web servers and the external router.

After the LVS failover occurs, the established connections from the clients will be lost in the current version, and the clients need to re-connection the LVS.

.. 5 ARP packets for each IP address, and 10 for both the VIP and
the NAT router IP. I saw the log file as follows:

Mar  3 11:12:14 PDL-Linux2 pulse[4910]: running command "/sbin/ifconfig" "eth0:5" "192.168.10.1" "up"
Mar  3 11:12:14 PDL-Linux2 pulse[4908]: running command "/usr/sbin/send_arp" "-i" "eth0" "192.168.10.1" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:14 PDL-Linux2 pulse[4913]: running command  "/sbin/ifconfig" "eth0:1" "172.26.20.118" "up"
Mar  3 11:12:14 PDL-Linux2 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)
Mar  3 11:12:14 PDL-Linux2 pulse[4909]: running command "/usr/sbin/send_arp" "-i" "eth0" "172.26.20.118" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:17 PDL-Linux2 nanny[4911]: making 192.168.10.2:80 available

I don't know if the target addresses of the 2 send_arp commands are set correctly. I am not sure if it is different when broadcast or source IP is used as target address, or any target address is OK.

Horms

Are there just 5 ARPs or 5 to start this and then more gratuitous ARPs at regular intervals. If the gratuitous ARPs only occur at fail-over then once the ARP caches on hosts expire there is a chance that a failed host - whose kernel is still functional - could reply to an ARP request.

wanger@redhat.com

When we put this together, I talked to Alan Cox about this. His opinion was that send 5 ARPs out at 2 seconds apart. If there is something out there listening and cares, then it will pick it up.

The way piranha works, as long as the kernel is alive, the backup (or failed node) will not maintain any interfaces that are Piranha managed. In other words, it removes any of those IPs/interfaces from its routing table upon failure recovery.

3.9 The device doesn't reply to arp requests, the kernel does.

ARP requests/replies are thought of as coming from a device and people make statements like

"the dummy device in 2.0.x kernels does not reply to arp requests while the same device in 2.2.x kernels does reply".

It is the kernel that handles arp requests according to a set of rules and not the device. The code for the dummy device is the same in 2.0.x and 2.2.x kernels and is not responsible for the change in arp behaviour.

(The RPC for ARP is at ftp://ftp.isi.edu/in-notes/std/std37.txt. - also see rfc826 and rfc1122. The model system used there is 2 machines on a single ethernet. It doesn't shed any light on the implementation of ARP on multi-interface systems like LVS.)

3.10 Properties of devices for the VIP

In a previous version of the HOWTO I stated that the dummy0 device did not arp in 2.2.x kernels and therefore could be used as the device for the VIP on an unpatched 2.2.13 realserver. Julian Anastasov replied that they did arp (see below for his posting and the ensuing discussions).

I hadn't actually tested whether the dummy0 device arp'ed but had concluded that it wasn't arp'ing because I had a working LVS using the dummy0 interface for the VIP on unpatched 2.2.x realservers and because as everyone knows ;-) an LVS needs to have a non-arp'ing device on the VIP of the realservers.

I had a LVS-DR LVS which worked with dummy0, lo:0 and tunl0 as the VIP device and which on further testing, I found also worked with eth0:1 or eth1 as the VIP device on 2.2.13 realservers. Whatever the arp'ing status of dummy0, lo:0 or tunl0, clearly eth1 replies to arp requests, so despite the conventional wisdom, it is possible to build an LVS with arp'ing VIP's on the realservers.

On investigating why this LVS worked, I found that the MAC address for the VIP in the client's arp cache (# arp -a) was always the director. I assume this was because the director is 3-4x the speed of the other machines in the LVS and it replies to arp requests first for the VIP (another posting from Stephen WIlliams says that the address which replies last is stored in the arp cache - we'll figure out what's really going on here eventually). On another LVS where the realservers were all identical hardware with 2.2.13 unpatched kernels, one particular realserver always was the machine in the client's arp cache for the VIP (to check, delete entry for VIP with arp -d, then ping again, then look in arp cache).

I found that I could get a working LVS using almost anything to hold the VIP on the realservers, including eth0:1 and eth1 (another NIC in the realserver). These devices carrying the VIP were pingable from the client and I could get the corresponding MAC addresses in the arp table of the client if the director was not setup with a VIP. When I setup a working LVS this way, I found each time that the MAC address for the VIP in the client's arp cache was the director's MAC address. For some reason, that I don't know, whenever the client does an arp request for the VIP, it gets the director's MAC address.

Possible reasons for the MAC address of the director always being associated with the VIP in my LVS -

1. I configure the director first and then the realservers. I don't make requests for a service till the realservers are setup. (Still I can't imagine the client asking for the MAC address of the VIP until it makes a connect request.)

2. The director is 3 times faster (CPU speed) than the next machine in the LVS and it always replies to arp request first.

3. I was lucky.

Since you can make a working LVS-DR LVS with the realserver VIP on an arp'ing eth0:1 device I decided that the relevent piece of information about arp'ing was (ta da!)

* an LVS will work if the client always gets the MAC address of the director when it asks for the MAC address of the VIP *

This provides an easy solution - you tell the client (or the router) the MAC address of the VIP with arp -s or arp -f .

here's my /etc/ethers

lvs.mack.net 00:A0:CC:55:7D:47

After installing the MAC address of the DIP (director) as the MAC address of the VIP (lvs) in the arp table (`$arp -f /etc/ethers`) I get

client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at 00:A0:CC:55:7D:47 [ether] PERM on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0

notice the "PERM" in the VIP entry on the client.

removing the permanent entry

client:/usr/src/temp/lvs# arp -d lvs.mack.net
client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at <incomplete> on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0

If I edited /etc/ethers changing the MAC address of lvs to anything else, the LVS did not work anymore. So the arp information is coming from /etc/ethers rather than some uncontrolled variable I'm not aware of.

I had thought that in an LVS with the VIP on realservers on an arping device that the VIP would hop from one machine to another (see the postings in the MISC section). Since naturally occuring LVS's with arping VIP's on realservers existed and worked well (mine), I set up an LVS by making a permanent entry for the VIP of the director in the arp cache of the client (router). This can be done by

$ arp -f /etc/ethers
or
$ arp -s 192.168.1.110 MAC_ADDRESS

There are 2 results of this

  1. the realservers can have the VIP on an an arp'ing device (eg eth0:1, eth1) - you don't need lo or dummy0, tunl0 for realservers with 2.0.36 and 2.2.x kernels.
  2. If two (or more) directors are setup in failover mode, the mechanism by for changing the VIP from one to another is broken by making a permanent entry for VIP on the director in the arp cache of the router. This is not a problem for a test setup to demonstrate an LVS but may be a problem in a high availability environment (a solution may be found n the meantime too).

The normal method for changing directors (eg with heartbeat) includes a gratuitous arp. To force a gratuitous arp

Julian

You can use Yuri Volobuev's send_arp.c from the 'fake' package or Alexey Kuznetsov's arping from its iputils package:

Here's some tests I did

LVS equipment: 2.2.13 client, and 0.9.4/2.2.13 director.
2 realservers
a) 2.0.36 kernel, libc5, gcc-2.7.2.3, net-tools 1.42.
b) 2.2.13 kernel, glibc, gcc-2.95,    net-tools 1.52

Experiment 1: Result - arp'ing is independant of [-]arp

Summary: the -arp/+arp option for ifconfig had no effect on any devices back to 2.0.36 kernels with net-tools 1.42. If it normally arps then -arp had no effect, if it normally doesn't arp, than "arp" doesn't turn it on (data below).

Method: IP=192.168.1.1/24 with VIP=192.168.1.110/32. The VIP was on dummy0. The test was to see if the VIP was pingable from another (external) machine on the 192.168.1.0/24 network or pingable from the machine itself (ie internally from the console). (I assume I had a route add -host for the VIP although I didn't record this). The test was done with ifconfig using arp or -arp (the output of ifconfig -a didn't change)

                 -----2.0.36------- -----2.2.13------
ping from        internal  external internal external
VIP device
dummy   ARP        +         -        +        +
        NOARP      +         -        +        +
        down       -         -        -        - (control)

Experiment2: Can the VIP be on a separate NIC?

Summary: yes, as long as the NIC doesn't have a cable plugged into it.

Method: same as above except VIP on eth1 (another NIC).

                 -----2.0.36-------
ping from        internal  external
VIP device
eth1 has cable connected to 192.168.1.0 network
eth1    ARP        +         +
        NOARP      +         +

eth1 cable to network removed
eth1    ARP        +         -
        NOARP      +         -
        works as realserver in LVS - yes

One of the reasons an no_arp interface is used on the realserver is that it is not visible to the rest of the network. Does the LVS work if the eth1 VIP on the realserver is not visible to the rest of the network?

Conclusion: for 2.0.36 dummy0 doesn't arp, and eth1 does arp. the arp/-arp option to ifconfig has no effect on arp behaviour. LVS works with both dummy0 and eth1, I assume since VIP need only be resolved as local on the realserver and does not need to be visible to the network.

Experiment 3: What devices and netmasks are neccessary for a working LVS?

Using the /etc/ethers approach for setting the MAC address of the VIP I then set up an LVS with pair of realservers serving telnet. All IPs are 192.168.1.x, all machines have a route to 192.168.1.0 via eth0. There is no default route.

1. 2.0.36, libc5, gcc 2.7.2.3, net-tools 1.42
2. 2.2.13, glibc-2.1.2, gcc-2.95, net-tools 1.52

with the following devices holding the VIP, tunl0, eth0:1, lo:0, dummy0, eth1. In each case there was no route entry for the VIP device and there was no cable connected to eth1 when it was used for the VIP. The table below shows whether the LVS worked. The VIP is installed with

ifconfig $DEVICE 192.168.1.110 netmask $NETMASK broadcast $BROADCAST

with $NETMASK="255.255.255.255" $BROADCAST="192.168.1.110"
or   $NETMASK="255.255.255.0"   $BROADCAST="192.168.1.255"

the result belong to 1 of 3 groups

+ works fine
- doesn't work (at $ prompt on client get
  "unable to connect to remote host.  Protocol not available"
  then client returns to regular unix $ prompt)
hang - client hangs, realserver cannot access network anymore,
  have to run rc.inet1 from console prompt on realserver to
  start network again.

netmask of VIP=255.255.255.255 (normal LVS setup)

LVS type  -----LVS-Tun------     ----LVS-DR------
kernel    2.0.36     2.2.13     2.0.36   2.2.13

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           -         +         +
dummy0     +           -         +         +
eth1       +           -         +         +

netmask of VIP=255.255.255.0 (not normally used for LVS)

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           hangs     +         hangs
dummy0     +           -         +         +
eth1       +           -         +         +

It would seem that any device and any netmask can be used for the VIP on a 2.0.36 realserver for both LVS-Tun and LVS-DR.

For 2.2.13 realserver, LVS-Tun, VIP on a tunl0 device only, any netmask (ie you need tunl0 on LVS-Tun with 2.2.x kernels)

LVS-DR,  lo:0 device netmask /32 only
       all other devices any netmask

For LVS-DR then on solaris/DEC/HP/NT... LVS can probably use a regular eth0 device rather than an lo:0 device (more work for Ratz to do :-).

Does anyone know why the lo:0 device has to be /32 for LVS-DR on kernel 2.2.13 while the other devices can be /24?

Jean-Francois Nadeau jna@microflex.ca 6 Dec 99

In kernel 2.2.1x with a virtual interface on lo:0 and netmask of 255.255.255.0 that the interface no longer arps.

Does anyone know why only the tunl0 device works for LVS-Tun on 2.2.x kernels?

Experiment 4: Effect of route entry for VIP and connection to VIP The VIP normally has an entry in the routing table eg

route add -host 192.168.1.110 $DEVICE

I found in Experiment 2 that a route entry was not neccessary for the LVS to work when the realserver had the VIP on eth0:1. Since I had always used a route entry for the VIP I wanted to find out when it was needed. The same LVS was used as for Experiment 3. The variables were

1) a route entry/no route entry for VIP/32
2) for eth1 whether the NIC was connected to the network by a cable.



kernel            ------2.0.36-------     -------2.2.13-------
VIP               eth1 eth1_nc eth0:1     eth1  eth1_nc eth0:1

no route
   LVS             +     +      +          +      +       +
   ping internal   -     -      -          +      +       +
   ping external   +     -      +          +      +       +

route
   LVS             +     +      +          +      +       +
   ping internal   +     +      +          +      +       +
   ping external   +     -      +          +      +       +

Conclusion 1: LVS works when for both cases of route/no_route for the VIP for eth0:1 and eth1 (ie you don't need a route entry for the VIP on the realservers).

Conclusion 2: having a network cable/no network cable does not affect whether the LVS works.

Conclusion 3: for 2.0.36 kernels you can choose to have the VIP pingable from the outside world but not pingable by the local host by having it on eth1 with a cable connection (this seems wierd and I can't think of any use for it just yet) or the reverse - pingable from the localhost but not by the external world by not have a cable connection.

(Note: using a hosts routable IP as the target - the IP on eth0 say - you can make a host unpingable from the console if you down the lo. The host is still pingable from elsewhere on the net.)

3.11 Topologies for LVS-DR and LVS-Tun LVS's

Traditional

The conventional LVS-DR/LVS-Tun topology which allows maximum scalability has each realserver with its own default gateway (to a router). (In a routerless test setup, the client would be the default gateway for the realservers. In a setup which is not network bound, i.e. is disk- or compute-bound, only one router may be needed. The changes in topology/routing are made by changing the IP of the default gw for the realservers)

Some method of handling the arp problem is needed here.

The packets sent to the realservers from the director, generate replies which go directly to the client. Failure messages (eg if a realservers is not available) do not get returned to the director, who cannot tell if a realserver has failed (see discussion of monitoring agents).


                       -------------clients-----------------------
                       |                         |       |       |
                    (router)                  (router)(router)(router)
                       |                         |       |       |
          _________    |                         |       |       |
        |          |   |    VIP                  |       |       |
        | director |---     DIP                  |       |       |
        |__________|   |                         |       |       |
                       |                         |       |       |
                       |                         |       |       |
        ---------------------------------        |       |       |
        |              |                |        |       |       |
        |              |                |        |       |       |
       RIP1           RIP2             RIP3      |       |       |
       VIP            VIP              VIP       |       |       |
 _____________   _____________   _____________   |       |       |
|             | |             | |             |  |       |       |
| realserver  | | realserver  | | realserver  |  |       |       |
|_____________| |_____________| |_____________|  |       |       |
        |              |                |        |       |       |
        |              |                ----------       |       |
        |              -----------------------------------       |
        ----------------------------------------------------------

Director sees replies

(from Julian Anastasov)

This discussion led to Julian's martian modification.

If the default gw for each realserver is changed to the DIP (see the Martian modification section) then

1. The director has to handle the reply packets as well as in the incoming packets, doubling the network load.

2. The director sees all the reply packets. Connection failure can be detected (in principle).


                        clients
                           |
                         router
                           |
             __________    |
            |          |   |    VIP
            | director |---     DIP
            |__________|   |
                           |
                           |
          ------------------------------------
          |                |                 |
          |                |                 |
         RIP1             RIP2              RIP3
         VIP              VIP               VIP
   _____________     _____________     _____________
  |             |   |             |   |             |
  | realserver  |   | realserver  |   | realserver  |
  |_____________|   |_____________|   |_____________|

Here's the original posting by Horms horms@vergenet.net

Hi, I have been setting up a test network to benchmark IPVS, the topology is as follows.

       node-1      node-6     node-7
       (client)   (client)   (client)
           |         |          |        client-net
  ---------+---------+----------+------ 192.168.2.0/24
                     |
                   node-3 (router)
                     |                   server-net
      ------+--------+----------+---     192.168.1.0/24
            |        |          |
         node-2    node-4     node-5
         (IPVS)   (server)   (server)
The question that I have is that the network I would really like to be testing is;
       node-1       node-6     node-7
       (client)   (client)   (client)
           |         |          |        client-net
  ---------+---------+----------+------ 192.168.2.0/24
                     |
                   node-2 (IPVS)
                     |                   server-net
      ---------+-----+----+---------     192.168.1.0/24
               |          |
             node-4     node-5
            (server)   (server)
.. other than using NAT, which has performance problems, is this possible? I tried this topology with direct routing and packets from the clients were multiplexed to the servers fine, but return packets from the servers to the client were not routed by the IPVS box.

Lars

Yes. The LVS box silently drops the return packets, since they have a src ip which is also bound as a local interface on the LVS. This is meant to be a simple anti-spoofing protection.

Note from Joe: The return packet from the realserver has src=VIP, dest=CIP. If this packet is routed via the director, which also has the VIP, the director will be receiving a packet from another machine with the the src being an one of its own IPs and the director will drop the packet).

You can enable logging these packets via
echo 1 >/proc/sys/net/ipv4/conf/all/log_martians
The only way around this with current Linux kernels is to disable the check in the kernel source or to use a separate box as the outward gateway. (Which is how DR is meant to be used for full performance)
This is not a problem as such as it probably makes a lot of sense on not to use an IPVS box as your gateway router,
Actually it makes a lot of sense to do just that IMHO. Less points of failure, less hard- & software to duplicate in a failover configuration.

Ray Bellis rpb@community.net.uk

It needs to be made more explicit in the documentation that LVS-DR will *only* work if you have a different return path.

Lars Marowsky-Bree lmb@teuto.net

... or if you have a suitably patched kernel.

We spent several man days trying to get this to work before figuring out why the packets were being dropped, at which point we had no alternative but to use LVS-NAT instead.
I agree. We still assume too much knowledge on the network admin side.
FYI, we have our LVS system working now, with LVS redundancy achieved by running OSPF routing (gated) on the LVS-NAT servers and having the VIP within the same IP subnet as the RIPs so that IGP routing policies automatically determine which LVS router the packets arrive on.
Yes, thats one option. Even better than heartbeat and IPAT, if all your systems support running a routing protocol. (IPAT = IP address takeover, part of heartbeat) In essence, heartbeat & IPAT is nothing but reinventing a subset of the functionality of a hardened routing protocol like OSPF/RIPv2/EIGRP.

On other schemes for director/realservers to exchange roles

Julian Anastasov uli@linux.tu-varna.acad.bg has pointed out on the mailing list that the prototype LVS can be redrawn as

                        ________
                       |        |
                       | client |
                       |________|
                           |
                           |
                        (router)
                           |
                           |
         ------------------------------------
         |                 |                |
         |                 |                |
      DIP, VIP         RIP1, VIP        RIP2, VIP
    ____________    ______________    ______________
   |            |  |              |  |              |
   |  director  |  | realserver1  |  | realserver2  |
   |____________|  |______________|  |______________|

and that any realserver is in a position to replace a failed director. No-one has bothered to write the code for this. It seems it's easier do have extra boxes in the director role (ready for failover) and others in realserver role. It's easier to wheel in another box for a spare director than to configure realservers to do two jobs reliably.

Julian

The director and the backup are in a shared network for incoming traffic, the backup sniff packets and change its connection state the same as the director (because the director is just on half client-to-server connection in LVS/TUN and LVS/DR), then drop packets.

It needs some investigation and probably lots of additional code too. ;-)

Wensong Zhang wensong@iinchina.net

I don't even think so - the main trick is getting the kernel to sniff the packets, which is probably quite easy with a little messing around. Not sending the packets out again (which would confuse the realservers) is easy with a ipchains output rule which silently drops them.

This doesn't work with a switch though, you need a shared network like a hub.

However, I have been talking with rusty about this. The problem is more general - HA shared-state firewalls are asked for all the time, so we want to do a generic thing for everything which builds upon Netfilter's state machine. This would not only cover LVS, but also masquerading and packet filtering in general. We intend to discuss this in greater detail at the Ottawa Linux Symposium latest.

You can see,the connections depend on the initalize status and realsevers realtime status. So another method is that when Director is down, backup-sever setup the ipvs with the connections,but it seems too late. How do you think about this?

TCP/IP should be able to cope with a few seconds delay and lost packets. You want to heartbeat once per second and take over after 3-4s though - this usually means takeover is complete in <10s, which TCP/IP should swallow.

Geographically distributed LVS

This has moved.

3.12 A discussion about the arp problem

(Joe and Julian)

Julian Anastasov uli@linux.tu-varna.acad.bg There is no difference between devices in 2.2.x, all devices are reported in the ARP replies: lo, tunl and dummy. This can be tested using this configuration with any device:


Host A:
        eth:x 192.168.0.1

Host B:
        eth:x 192.168.0.2
        lo, dummy, tunl: 192.168.0.3

On host A try: ping 192.168.0.3

Host B replies for 192.168.0.3 through 192.168.0.2 device

The ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel.

ARP problem, some rules:

ARP responses

For example:

realserver# ifconfig lo:0 192.168.1.1 netmask 255.255.255.0 broadcast 192.168.1.255 up

"real" treats all packets with source addr from 192.168.1.0/24 which come from the other devices (eth0) as invalid, i.e. source address validation works in this case and the ARP request are not replied. The kernel thinks: "The incoming packet arrived with saddr=local_IP1 and daddr=local_IP2(VIP), so it is invalid". By this way the host from the LAN can't talk to the real server if its lo alias is configured with netmask != 255.255.255.255

        ifconfig dummy0 192.168.1.1 netmask 255.255.255.255

registers only 192.168.1.1 as local ip but:

        ifconfig lo:0 192.168.1.1 netmask 255.255.255.0         

all 256 IPs are local. All IFF_LOOPBACK devices treat all IPs as local according to the used netmask.

Joe

I assume IFF_LOOPBACK devices are lo, lo:0..n?

Yes, currently only lo is marked as loopback. It is used to mark whole subnets as local.

lo:0 is not marked as loopback?

lo:0 is just attached IP address to the same device "lo". You can try "ifconfig lo:0 192.168.0.1 netmask 255.255.255.255" and display the interfaces using "ifconfig". There is LOOPBACK flag for lo:0 which is inherited from the device "lo". In Linux 2.2 all aliases inherit the device flags. Only the IFF_UP flag is used to add/delete the aliases.

Joe

Assume LVS-DR with VIP, RIPs all on the same /24 network on eth0 devices, realservers all have lo:0 with VIP/24 and have the standard 2.2.x kernel (no patches to hide interfaces). Router says "who has VIP", the arp request arrives at the realservers via eth0. Device lo:0 finds arp request which arrived on eth0 from router is on the same subnet as lo:0 and does not reply to the arp request.

Before checking if to answer the ARP the routing tables are checked, i.e. the source validation of the packet is performed. If 192.168.0.2 asks "who-has 192.168.1.1 tell 192.168.1.2" the real servers assumes that this is invalid packet, i.e. from one local IP to another local IP (from me to me => drop).

Joe

I notice that with the 2.2.x kernel, that lo:0 has to have netmask=255.255.255.255 to work, whereas with the 2.0.x kernels (where lo:0 doesn't reply to arp requests), that lo:0 can have the VIP on a 255.255.255.0 netmask and still work.

The rule is to use netmask 255.255.255.255 and to hide lo. The ARP works in different way in 2.2. It looks the "local" table to validate the source of the ARP request and after that it lookups the same table to check if daddr of the ARP request is local ip.

ARP requests: - all local addresses can be used by the kernel to announce them as the source for the ARP request.

is it OK to say

the kernel can (does?) use all local addresses as the source of ARP requests

It can and does. The real server thinks that it can use any local ip address as saddr in the ARP request and the answer will be returned back if this ip is uniq in the LAN.

Joe

do you mean "the realserver will receive a reply if the s_addr is unique in the LAN"?

The real server will receive answer if it uses RIP as saddr in the ARP request because the VIP(HIP) is hidden or when using transparent proxy because it is not local (the VIP). Real server must know how to ask (using uniq IP) or the trafic for the asked IP (ROUTER) will be blocked.

But the hidden addresses are not used because they are not uniq (2.2.14) and the answer will be returned to the Director.

Joe

do you mean "the non-hidden VIP on the director"?

Yes, when the real server ask "who-has ROUTER tell VIP" the ARP reply is received in the Director and the transmission in the real servers is stopped. The ROUTER sends everything destined to VIP to the Director. This is true for all clients on the LAN too if they are not in this cluster (if they don't handle packets for VIP).

Joe

I would have thought that the main device on each NIC, eg eth0, eth1 would have been used as the source address.

No, it is extracted from the outgoing datagram and if saddr is local ip it is used. But if this is not local ip, i.e. when using transparent proxy or the address is marked as hidden the main device ip is used.

Joe

how is arping part of transparent proxy?

It is not. When VIP is not local IP address in the real server this IP is not used from the ARP code. It is not in the "local" table. But TCP, UDP and ICMP use it via transparent proxy support.

They are extracted from the outgoing packet.

Joe

what is "They"? the source addresses? When you say "extracted", do you mean "removed from packet" or "looked at/detected"

The saddr from the data packet is used to build the ARP request.

We tell the kernel that these addresses are not uniq by setting <interface>/hidden=1 (starting with kernel 2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request.

Joe

the kernel can use any local address as s_addr but the code for hiding IPs from arp requests prevents the kernel from using hidden addresses as s_addr in an arp request?

Yes, the code to hide the addresses is already part of the source address autoselection (saddr in the ARP request in our case). We never autoselect hidden addresses, i.e. if the source address is not specified from the higher level. The code to hide interface:

- ignores ARP replies for hidden local addresses
- doesn't select hidden local addresses as source of the ARP request
- doesn't autoselect hidden local addresses for the IP level

Joe

When you say "We expect it is uniq in the LAN" do you mean - we expect you've set up your network properly and that you don't have the same RIP on 2 realservers? :-)

The LVS administrator must ensure that the RIPs are uniq, only the VIP is shared. We tell the kernel that the VIP addresses are not uniq by setting interfacehidden=1 (2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request. We expect it is uniq in the LAN.

So, the recommendation for using the "lo" interface in the real servers is:

- use netmask 255.255.255.255 when configuring lo alias. By this way source validation doesn't drop the incoming packets to this IP. LVS users usually define the net route through the eth interface, so we can talk to other hosts from this network, for example to send the packets to the client through the default gateway. It is not needed to configure the alias with mask != 255.255.255.255

So, the interfaces which can be used in the real servers to listen for VIP are:

- lo aliases with netmask 255.255.255.255
- tunl*
- dummy*

All these devices must be marked as hidden to solve the ARP problem when using Linux 2.2.

In the Director: there is no problem to configure the VIP even on lo alias or dummy interface. If the interface is not marked as hidden this VIP is visible for all hosts on the LAN.

3.13 ATM/ethernet and router problems

LVS has only been tested on ethernet. One person had an ATM setup which didn't work with LVS-DR as the ATM router expects packets from the VIP to have the same MAC address (in LVS-DR packets coming from the VIP could have the MAC address of any of the realservers). Apparently this is not easily fixable in the ATM world. It should be possible to use one of Julian's martian modifications to make LVS-DR work on ATM, but the person with the ATM setup disappeared off the mailing list without us convincing him of the joy in having the first ATM LVS.

Other people have found similar problems with ethernet -

From: Kyle Sparger ksparger@dialtoneinternet.net

I don't know if someone has gone over this, but here's a consideration I've come across when setting up LVS in DR mode:

When the real servers reply, cisco routers (ours do, at least) will pick up on the fact that it's replying from a different MAC address, and will start arping soon thereafter. This is sub-optimal, as it causes a constant flood of arp requests on the network. Our solution has been to hardcode the MAC address into the router, but this can cause other issues, for example during failover. That can be worked around, as you can set the MAC address on most cards, but that in itself may cause other issues.

Has anyone else experienced this? Has anyone else come up with a better solution than hardcoding it into the router?


Next Previous Contents