Next Previous Contents

5. Choose LVS Forwarding Type

5.1 Comparison of LVS-NAT, LVS-DR and LVS-Tun

The instructions in the following sections show how to set up LVS in LVS-NAT, LVS-Tun and LVS-DR modes for the service telnet.

If you just want to demonstrate to yourself that you can setup an LVS, then LVS-NAT has the advantage that any OS can be used on the realservers, and that no modifications are needed for the kernel on the realserver(s).

If you have a linux machine with a 2.0.x kernel, then it can be used as a realserver for an LVS operating in any mode without any modifications.

Because LVS-NAT was the first mode of LVS developed, it was the type first used by people setting an LVS. For production work LVS-DR scales to much higher throughput and is the most common setup for production. However for a simple test, LVS-NAT only requires patching 1 machine (the director) and an unmodified machine of any OS for a realservers. After a simple test, unless you need the features of LVS-NAT (ability to use realservers that provide services not found on Linux machines, port remapping, realservers with primitive tcpip stacks - e.g./printers, services that initiate connect requests such as identd), then it would be best to move to LVS-DR.

Here are the constraints for choosing the various flavors of LVS: LVS-NAT (network address translation), LVS-Tun (tunnelling) and LVS-DR (direct routing).

                       LVS-NAT      LVS-Tun            LVS-DR

realserver OS          any          must tunnel        most
realserver mods        none         tunl must not arp  lo must not arp
port remapping         yes          no                 no
realserver network     private      on internet        local
                       (remote  or  local)             -
realserver number      low          high               high
client connnects to    VIP          VIP                VIP
realserver default gw  director     own router         own router

5.2 Expected LVS performance

unknown

what is the maximum number of servers I can have behind the LVS without any risk of failure.

Horms horms@vergenet.net 03 Jul 2001

LVS does not set artificial limits on the number of servers that you can have. The real limitations are the number of packets you can get through the box, the amount of memory you have to store connection information and in the case of LVS-NAT the number of ports available for masquerading. These limitations effect the number of concurrent connections you can handle and your maximum through-put. This indirectly effects how many servers you can have.

(also see the section on port range limitations.)

Palmer J.D.F:

I know the JANet Web Cache Service does this, but I was hoping that someone had done it on a smaller scale.

Martin Hamilton martin@net.lut.ac.uk Nov 14 2001

we (JWCS) also use LVS on our home institutional caches. These are somewhat smaller scale, e.g. some 10m URLs/day at the moment for Loughborough's campus caches vs. 130m per day typically on the JANET caches. The good news is that LVS in tunnelling mode is happily load balancing peaks of 120MBit/s of traffic on a 550MHz PIII.

Folk in ac.uk are welcome to contact us at support@wwwcache.ja.net for advice on setting up and operating local caches. I'm afraid we can only provide this to people on the JANET network, like UK Universities and Colleges.

Michael McConnell:

Top doesn't display CPU usage of ipchains or ipvsadm. vmstat doesn't display CPU usage of ipchains or ipvsadm.

Joe

ipchains and ipvsadm are user tools that configure the kernel. After you've run them, they go away and the kernel does it's new thing (which you'll see in "system"). Unfortunately for some reason that no-one has explained to me "top/system" doesn't see everything. I can have a VS-DR director which is running 50Mbps on a 100Mpbs link and the load average doesn't get above 0.03 and system to be negligable. I would expect it to be higher.

Julian Anastasov ja@ssi.bg 10 Sep 2001

Yes, the column is named "%CPU", i.e. the CPU spend for one process related to all processes. As for the load average, it is based on the length (number of processes except the current one) of the queue with all processes in running state. As we know, LVS does not interract with any processes except the ipvsadm. So, the normal mode is the LVS box just to forward packets without spending any CPU cycles for processes. This is the reason we to see load average 0.00

OTOH, vmstat reads /proc/stat and there are the counters for all CPU times. Considering the current value for jiffies (the kernel tick counter) the user apps can see the system, the user and the idle CPU time. LVS is somewhere in the system time. For more accurate measurement for the CPU cycles in the kernel there are some kernel patches/tools that are exactly for this job - to see what time takes the CPU in some kernel functions.

If you are just setting up an LVS to see if you can set one up, then you don't care what your performance is. When you want to put one on-line for other people to use, you'll want to know the expected performance.

On the assumption that you have tuned/tweeked your farm of realservers and you know that they are capable of delivering data to clients at a total rate of bits/sec or packets/sec, you need to design a director capable of routing this number of requests and replies for the clients.

Before you can do this, some background information on networking hardware is required. At least for Linux (the OS I've measured, see performance data for single realserver LVS), a network rated at 100Mbps is not 100Mbps all the time. It's only 100Mbps when continuously carrying packets of mtu size (1500bytes). A packet with 1 bit of data takes as long to transmit as a full mtu sized packet. If your packets are <ack>s, or 1 character packets from your telnet editing session or requests for http pages and images, you'll barely reach 1Mbps on the same network. On the performance page, you'll notice that you can get higher hit rates on a website as the size of the hit targets (in bytes) gets smaller. Hit rate is not neccessarily a good indicator of network throughput.

Tcpip can't use the full 100Mbps of 100Mbps network hardware, as most packets are paired (data, ack; request, ack). A link carrying full mtu data packets and their corresponding <ack>s, will presumably be only carrying 50Mbps. A better measure of network capacity is the packet throughput. An estimate of the packet throughput comes from the network capacity (100Mbps)/mtu size(1500bytes) = 8333 packets/sec.

Thinking of a network as 100Mbps rather than ca.8000packets/sec is a triumph of marketing. When offered the choice, everyone will buy network hardware rated at 100Mbps even though this capacity can't be used with your protocols, over another network which will run continuously at 8000packets/sec for all protocols. Only for applications like ftp will near full network capacity be reached (then you'll be only running at 50% of the rated capacity as half the packets are <ack>s).

A netpipe test (on my realservers are 75MHz pentiums and can't saturate the 100Mbps network) shows that some packets must be "small". Julian's show_traffic script shows that for small packets (<128bytes), the throughput is constant at 1200packets/sec. As packets get bigger (upto mtu size), the packet throughput decreases to 700packets/sec, and then increases to 2600packets/sec for large packets.

The constant througput in packets/sec is a first order approximation of of tcpip network throughput and is the best information we have to predict director performance.

In the case where a client is in an exchange of small packets (<mtu size) with a realserver in a LVS-DR LVS, each of the links (client-director, director-realserver, realserver-client) would be saturated with packets, although the bps rate would be low. This is the typical case for non-persistent http when 7 packets are required for the setup and termination of the connection, 2 packets are required for data passing (eg the request GET /index.html and the reply) and an <ack> for each of these. Thus only 1 out of 11 packets is likely to be near mtu size, and throughput will be 10% of the rated bps throughput even though the network is saturated.

The first thing to determine then is the rate at which the realservers are generating/receiving packets. If the realservers are network limited, i.e. the realservers are returning data in memory cache (eg a disk-less squid) and have 100Mbps connections, then each realserver will saturate a 100Mbps link. If the service on the realserver requires disk or CPU access, then each realserver will be using proportionately less of the network. If the realserver is generating images on demand (and hence is compute bound) then it may be using very little of the network and the director can be handling packets for another realserver.

The forwarding method affects packet throughput. With LVS-NAT all packets go through the director in both directions. As well the LVS-NAT director has to rewrite incoming and reply packets for each realserver. This is a compute intensive process ( but less so for 2.4 LVS-NAT). In a LVS-DR or LVS-Tun LVS, the incoming packets are just forwarded (requiring little intervention by the director's CPU) and replies from the realservers return to the client directly by a separate path (via the realserver's default gw) and aren't seen by the director.

In a network limited LVS, for the same hardware, because there are separate paths for incoming and returning packets with LVS-DR and LVS-Tun, the maximum (packet) throughput is twice that of LVS-NAT. Because of the rewriting of packets in LVS-NAT, the load average on a LVS-NAT director will be higher than for a LVS-DR or LVS-Tun director managing twice the number of packets.

In a network bound situation, a single realserver will saturate a director of similar hardware. This is a relatively unusual case for the LVS's deployed so far. However it's the situation where replies are from data in the memory cache on the realservers (eg squids).

With a LVS-DR LVS, the realservers have their own connection to the internet, the rate limiting step is the NIC on the director which accepts packets (mostly <ack>s) from the clients. The incoming network is saturated for packets but is only carrying low bps traffic, while the realservers are sending full mtu sized packets out their default gw (presumably the full 100Mbps).

The information needed to design your director then is simply the number of packets/sec your realserver farm is delivering. The director doesn't know what's in the packets (being an L4 switch) and doesn't care how big they are (1 byte of payload or full mtu size).

If the realservers are network limited, then the director will need the same CPU and network capacity as the total of your realservers. If the realservers are not network limited, then the director will need correspondingly less capacity.

If you have 7 network limited realservers with 100Mbps NICs, then they'll be generating an average of 7x8000 = 50k packets/sec. Assuming the packets arrive randomly the standard deviation for 1 seconds worth of packets is +/- sqrt(50000)=200 (ie it's small compared to the rate of arrival of packets). You should be able to connect these realservers to a 1Gbps NIC via a switch, without saturating your outward link.

If you are connected to the outside world by a slow connection (eg T1 line), then no matter how many 8000packet/sec realservers you have, you are only going to get 1.5Mbps throughput (or half that, since half the packets are <ack>s).

Note: The carrying capacity of 100Mbps network of 8000packets/sec may only apply to tcpip exchanges. My 100Mbps network will carry 10,000 SYN packets/sec when tested with Julian's testlvs program.

Wayne wayne@compute-aid.com 03 Apr 2001

The performance page calculate the ack as 50% or so the total packets. I think that might not accurate. Since in the twist-pair and full duplex mode, ack and request are travelling on two different pairs. Even in the half duplex mode, the packets for two directions are transmit over two pairs, one for send, one for receive, only the card and driver can handle them in full duplex or half duplex mode. So the packets would be 8000 packets/sec all the times for the full duplex cards.

Unfortunately we only can approximately predict the performance of an LVS director. Still the best estimates come from comparing with a similar machine.

The performance page shows that a 133MHz pentium director can handle 50Mbps throughput. With LVS-NAT the load average on the director is unusably high, but with LVS-DR, the director has a low load average.

Statements on the website indicate that a 300MHz pentium LVS-DR director running a 2.2.x kernel can handle the traffic generated by a 100Mbps link to the clients. (A 550MHz PIII can direct 120Mbps.)

Other statements indicate that single CPU high end (800MHz) directors cannot handle 1Gbps networks. Presumably multiple directors or SMP directors will be needed for Gbps networks.

From: Jeffrey A Schoolcraft dream@dr3amscap3.com 7 Feb 2001

I'm curious if there are any known DR LVS bottlenecks? My company had the opportunity to put LVS to the test the day following the superbowl when we delivered 12TB of data in 1 day, and peaked at about 750Mbps.

In doing this we had a couple of problems with LVS (I think they were with LVS). I was using the latest lvs for 2.2.18, and ldiretord to keep the machines in and out of LVS. The LVS servers were running redhat with an EEPro100. I had two clusters, web and video. The web cluster was a couple of 1U's with an acenic gig card, running 2.4.0, thttpd, with a somewhat performance tuned system (parts of the C10K). At peak our LVS got slammed with 40K active connections (so said ipvsadmin). When we reached this number, or sometime before, LVS became in-accessible. I could however pull content directly from a server, just not through the LVS. LVS was running on a single proc p3, and load never went much above 3% the entire time, I could execute tasks on the LVS but http requests weren't getting passed along.

A similar thing occurred with our video LVS. While our real servers aren't quite capable of handling the C10K, we did about 1500 a piece and maxed out at about 150Mbps per machine. I think this is primarily modem users fault. I think we would have pushed more bandwidth to a smaller number of high bandwidth users (of course).

I know this volume of traffic choked LVS. What I'm wondering is, if there is anything I could do to prevent this. Until we got hit with too many connections (mostly modems I imagine) LVS performed superbly. I wonder if we could have better performance with a gig card, or some other algorithm (I started with wlc, but quickly changed to wrr because all the rr calculations should be done initially and never need to be done again unless we change weights, I thought this would save us).

Another problem I had was with ldirectord and the test (negotiate, connect). It seemed like I needed some type of test to put the servers in initially, then too many connections happened so I wanted no test (off), but the servers would still drop out from ldirectord. That's a snowball type problem for my amount of traffic, one server gets bumped because it's got too many connections, and then the other servers get over-loaded, they'll get dropped to, then I'll have an LVS directing to localhost.

So, if anyone has pushed DR LVS to the limits and has ideas to share on how to maximize it's potential for given hardware, please let me know.

5.3 Initial setup steps

Here's some rules of thumb until you know enough to make informed decisions.

Choose forwarding type

choose in this order

Choose number of networks

The realservers are normally run on a private network (eg 192.168.1.0/24). They are not contacted by clients directly. Sometimes the realservers are machines (on the local network) also being used for other things, in which case leave them on their original network.

The director is contacted by client(s) on the VIP. If the VIP must be publically available, then it will usually be on a different network to the realservers, in which case you will have a two network LVS. The VIP then will be in the network that connects the director to the router (or test client). If the client(s) are on the same network as the realservers, then you'll only have a one network LVS.

Choose number of NICs on director

If the director is in a 2 network LVS, then having 2 NICs on the director (one for each network) will increase throughput (as long as something else doesn't become rate limiting, eg the CPU speed or the PCI bus).

You can have 1 NIC on the director with a 2 network LVS. This is easy to do for LVS-NAT (for which there is an example conf file). Doing the same this for LVS-DR or LVS-Tun requires more though and is left as an exercise for the reader.

The configure script will handle 1 or 2 NICs on the director. In the 1 NIC case, the NIC connects to the outside world and to the realserver network. In the 2 NIC case, these two networks are physically separated. To increase throughput further, the director could have a NIC for each realserver. The configure script doesn't handle this (yet - let me know if you'd like it).

Pick a configure script

If you're using the configure script to setup your LVS, pick the appropriate lvs*.conf.* file and edit to suit you.


Next Previous Contents