ipvsadm is the user interface to LVS.
There are patches for ipvsadm
You use ipvsadm from the command line (or in rc files) to setup: -
doug@deja.com
points out that *lc schedulers
will not work properly if a particular realserver is used in two different LVSs.
proellt@gmx.de
) and multiple
firewalls (SH, source hash by Henrik Nordstrom hno@safecore.se
).any of these will do for a test setup (round robin will cycle connections to each realserver in turn, allowing you to check that all realservers are functioning in the LVS).
You use ipvsadm to
Compile and install ipvsadm on the director using the supplied Makefile. You can optionally compile ipvsadm with popt libraries, which allows ipvsadm to handle more complicated arguments on the command line. If your libpopt.a is too old, your ipvsadm will segv. (I'm compiling with a newer dynamic libpopt).
Since you compile ipvs and ipvsadm independantly and you cannot compile ipvsadm until you have patched the kernel headers, a common mistake is to compile the kernel and reboot, forgetting to compile/install ipvsadm.
Unfortunately there is only rudimentary version detection code into ipvs/ipvsadm. If you have a mismatched ipvs/ipvsadm pair, many times there won't be problems, as any particular version of ipvsadm will work with a wide range of patched kernels. Usually with 2.2.x kernels, if the ipvs/ipvsadm versions mismatch, you'll get wierd but non-obvious errors about not being able to install your LVS. Other possibilities are that the output of ipvsadm -L will have IP's that are clearly not IPs (or not the IP's you put in) and ports that are all wrong. It will look something like this
[root@infra /root]# ipvsadm IP Virtual Server version 1.0.4 (size=3D4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP C0A864D8:0050 rr -> 01000000:0000 Masq 0 0 0
rather than
director:/etc/lvs# ipvsadm IP Virtual Server version 0.9.4 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:ssh rr -> bashfull.mack.net:ssh Route 1 0 0
There was a change in the /proc file system for ipvs about 2.2.14 which caused problems for anyone with a mismatched ipvsadm/ipvs. The ipvsadm from different kernel series (2.2/2.4) do not recognise the ipvs kernel patches from the other series (they appear to not be patched for ipvs).
The later 2.2.x ipvsadms know the minimum version of ipvs that they'll run on, and will complain about a mismatch. They don't know the maximum version (produced presumably some time in the future) that they will run on. This protects you against the unlikely event of installing a new 2.2.x version of ipvsadm on an older version of ipvs, but will not protect you against the more likely scenerio where you forget to compile ipvsadm after building your kernel. The ipvsadm maintainers are aware of the problem. Fixing it will break the current code and they're waiting for the next code revision which breaks backward compatibility.
If you didn't even apply the kernel patches for ipvs, then ipvsadm will complain about missing modules and exit (i.e. you can't even do `ipvsadm -h`).
Ty Beedetybeede@metrolist.net
on a slackware 4.0 machine I went to compile ipvsadm and it gave me an error indicating that the iphdr type was undefined and it didn't like that when it saw the ip_fw.h header file. I added
#include <linux/ip.h>in ipvsadm.c, which is where the iphdr #structure is defined and everything went ok
Doug Bagley doug@deja.com
The reason that it fails "out of the box" is because fwp_iph's type definition (struct iphdr) was
#ifdef'd out in <linux/ip_fw.h>(and not included anywhere else) since the symbol __KERNEL_ was undefined.
Including <linux/ip.h> before <linux/ip_fw.h>in the .c file did the trick.
On receiving a connect request from a client, the director assigns a realserver to the client based on a "schedule". The scheduler type is set with ipvsadm. The schedulers available are
The rr,wrr,lc,wlc schedulers should all work similarly when the director is directing identical realservers with identical services. The lc scheduler will better handle situations where machines are brought down and up again (see thundering herd problem). If the realservers are offering different services and some have clients connected for a long time while others are connected for a short time, or some are compute bound, while others are network bound, then none of the schedulers will do a good job of distributing the load between the realservers. LVS doesn't have any load monitoring of the realservers. Figuring out a way of doing this that will work for a range of different types of services isn't simple (see load and failure monitoring).
The LBLC code (from Julian) and the dh scheduler (from Thomas Proell) are designed for web caching realservers (e.g. squids). For normal LVS services (eg ftp, http), the content offered by each realserver is the same and it doesn't matter which realserver the client is connected to. For a web cache, after the first fetch has been made, the web caches have different content. As more pages are fetched, the contents of the web caches will diverge. Since the web caches will be setup as peers, they can communicate by ICP (internet caching protocol) and find the cache(s) with the required page. This is faster than fetching the page from the original webserver. However, it would be better after the first fetch of a page from http://www.foo.com/* , for all subsequent clients wanting a page from http://www.foo.com/ to be connected to that realserver.
The original method for handling this was to make connections to the realservers persistent, so that all fetches from a client went to the same realserver.
The -dh (destination hash) algorythm makes a hash from the target IP and all requests to that IP will be sent to the same realserver. This means that content from a URL will not be retrieved multiple times from the remote server. The realservers (eg squids in this case) will each be retreiving content from different URLs.
The -sh (source hash) scheduler is for directors with multiple firewalls. The director hashes on the MAC address of the firewall. It's from Henrik Nordstrom, who is involved with developing web caches (squids).
Henrik Nordstrom 14 Feb 2001Julian - who uses NFC_ALTERED ?Here is a small patch to make LVS keep the MARK, and have return traffic inherit the mark.
We use this for routing purposes on a multihomed LVS server, to have return traffic routed back the same way as from where it was received. What we do is that we set the mark in the iptables mangle chain depending on source interface, and in the routing table use this mark to have return traffic routed back in the same (opposite) direction.
The patch also moves the priority of LVS INPUT hook back to infront of iptables filter hook, this to be able to filter the traffic not picked up by LVS but matchin it's service definitions. We are not (yet) interested of filtering traffic to the virtual servers, but very interested in filtering what traffic reaches the Linux LVS-box itself.
Netfilter. The packet is accepted by the hook but altered (mark changed).Julian - Give us an example (with dummy addresses) for setup that require such fwmark assignments.
For a start you need a LVS setup with more than one real interface receiving client traffic for this to be of any use. Some clients (due to routing outside the LVS server) comes in on one interface, other clients on another interface. In this setup you might not want to have a equally complex routing table on the actual LVS server itself.Regarding iptables / ipvs I currently "only" have three main issues.
- As the "INPUT" traffic bypasses most normal routes, the iptables conntrack will get quite confused by return traffic..
- Sessions will be tracked twice. Both by iptables conntrack and by IPVS.
- There is no obvious choice if IPVS LOCAL_IN sould be placed before or after iptables filter hook. Having it after enables the use of many fancy iptables options, but instead requires one to have rules in iptables for allowing ipvs traffic, and any mismatches (either in rulesets or IPVS operation) will cause the packets to actually hit the IP interface of the LVS server which in most cases is not what was intended.
Wensong Zhangwensong@gnuchina.org
16 Feb 2001Please see "man ipvsadm" for short description of DH and SH schedulers. I think some examples to use those two schedulers.
Example1: cache cluster shared by several load balancers.
Internet | |------cache array | |----------------------------- | | DH DH | | Access Access Network1 Network2The DH scheduler can keep the two load balancer redirect requests destined for the same IP address to the same cache server. If the server is dead or overloaded, the load balancer can use cache_bypass feature to send requests to the original server directly. (Make sure that the cache servers are added in the two load balancers in the same order)Note that the DH development is inspired by the consistent hashing scheduler patch from Thomas Proell
proellt@gmx.de
Example2: Firewall Load Balancing
|-- FW1 --| Internet ----- SH --| |-- DH -- Protected Network |-- FW2 --|Make sure that the firewall boxes are added in the load balancers in the same order. Then, request packets of a session are sent to a firewall, e.g. FW1, the DH can forward the response packets from protected network to the FW1 too. However, I don't have enough hardware to test this setup myself. Please let me know if any of you make it work for you. :)
For initial discussions on the -dh and -sh scheduler see on the mailing list under "some info for DH and SH schedulers" and "LVS with mark tracking".
I ran the polygraph simple.pg test on a LVS-NAT LVS with 4 realservers using rr scheduling. Since the responses from the realservers should average out I would have expected the number of connection and load average on the realservers to be equally distributed over the realservers.
Here's the output of ipvsadm shortly after the number of connections had reached steady state (about 5 mins).
IP Virtual Server version 0.2.12 (size=16384) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:polygraph rr -> doc.mack.net:polygraph Masq 1 0 883 -> dopey.mack.net:polygraph Masq 1 0 924 -> bashfull.mack.net:polygraph Masq 1 0 1186 -> sneezy.mack.net:polygraph Masq 1 0 982
The servers were identical hardware. I expect (but am not sure) that the utils/software on the machines is identical (I set up doc,dopey about 6 months after sneezy,bashfull). Bashfull was running 2.2.19, while the other 3 machine were running 2.4.3 kernels. The number of connections (all in TIME_WAIT) at the realservers was different for each (otherwise apparently identical) realserver and was in the range 450-500 for the 2.4.3 machines and 1000 for the 2.2.19 machine (measured with netstat -an | grep $polygraph_port |wc ) and varied about 10% over a long period.
This run had been done immediately after another run and InActConn had not been allowed to drop to 0. Here I repeated this run, after first waiting for InActConn to drop to 0
IP Virtual Server version 0.2.12 (size=16384) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:polygraph rr -> doc.mack.net:polygraph Masq 1 0 994 -> dopey.mack.net:polygraph Masq 1 0 994 -> bashfull.mack.net:polygraph Masq 1 0 994 -> sneezy.mack.net:polygraph Masq 1 1 992 TCP lvs2.mack.net:netpipe rr
Bashfull (the 2.2.19 machine) had 900 connections in TIME_WAIT while the other (2.4.3) machines were 400-600. Bashfull was also delivering about 50% more hits to the client.
Repeating the run using "lc" scheduling, the InActConn remains constant.
IP Virtual Server version 0.2.12 (size=16384) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:polygraph lc -> doc.mack.net:polygraph Masq 1 0 994 -> dopey.mack.net:polygraph Masq 1 0 994 -> bashfull.mack.net:polygraph Masq 1 0 994 -> sneezy.mack.net:polygraph Masq 1 0 993
The number of connections (all in TIME_WAIT) at the realservers did not change.
Joe, 14 May 2001
according to the ipvsadm man page, for "lc" scheduling, the new connections are assigned according to the number of "active connections". Is this the same as "ActConn" in the output of ipvsadm?
If the number of "active connections" used to determine the scheduling is "ActConn", then for services which don't maintain connections, the scheduler won't have much information, just "0" for all realservers?
JulianThe formula is: ActConn * K + InActConn
where K can be 32 to 50, I don't remember the last used value.
So, it is not only the active conns, this will break UDP.
I've been running the polygraph simple.pg test over the weekend using rr scheduling on what (AFAIK) are 4 identical realservers in a LVS-NAT LVS. There are no ActConn and a large number of InActConn. Presumably the client makes a new connection for each request.
The implicit persistence of TCP connection reuse can cause such side effects even for RR. When the setup includes small number of hosts and the used rate is big enough to reuse the client's port, the LVS detects existing connections and new connections are not created. This is the reason you can see some of the rs not to be used at all, even for such method as RR.
the client is using ports from 1025-4999 (has about 2000 open at one time) and it's not going above the 4999 barrier. ipvsadm shows a constant InActConn of 990-995 for all realservers, but the number of connections on each of the realservers (netstat -an) ranges from 400-900.
So if the client is reusing ports (I thought you always incremented the port by 1 till you got to 64k and then it rolled over again), LVS won't create a new entry in the hash table if the old one hasn't expired?
Yes, it seems you have (5000-1024) connections that never expire in LVS.
Presumably because the director doesn't know the number of connections at the realservers (it only has the number of entries in its tables), and because even apparently identical realservers aren't identical (the hardware here is the same, but I set them up at different times, presumably not all the files and time outs are the same), the throughput of different realservers may not be the same.
When setting up a service, you set the weight with a command like (default for -w is 1).
ipvsadm -a -t $VIP:$SERVICE -r $REALSERVER_NAME:$SERVICE $FORWARDING -w 1
If you set the weight for the service to "0", then no new connections will be made to that service (see also man ipvsadm, about the -w option).
Lars Marowsky-Breelmb@suse.de
11 May 2001Setting weight = 0 means that no further connections will be assigned to the machine, but current ones remain established. This allows to smoothly take a real server out of service, ie for maintenance.
Removing the server hard cuts all active connections. This is the correct response to a monitoring failure, so that clients receive immediate notice that the server they are connected to died so they can reconnect.
Laurent LefollLaurent.Lefoll@mobileway.com
11 May 2001Is there a way to clear some entries in the ipvs tables ? If a server reboots or crashes, the connection entries remains in the ipvsadm table. Is there a way to remove manually some entries? I have tried to remove the real server from the service (with ipvsadm -d .... ), but the entries are still there.
JoeAfter a service (or realserver) failure, some agent external to LVS will run ipvsadm to delete the entry for the service. Once this is done no new connections can be made to that service, but the entries are kept in the table till they timeout. (If the service is still up, you can delete the entries and then re-add the service and the client will not have been disconnected). You can't "remove" those entries, you can only change the timeout values.
Any clients connected through those entries to the failed service(s) will find their connection hung or deranged in some way. We can't do anything about that. The client will have to disconnect and make a new connection. For http where the client makes a new connection almost every page fetch, this is not a problem. Someone connected to a database may find their screen has frozen.
If you are going to set the weight of a connection, you need to first know the state of the LVS. If the service is not already in the ipvsadm table, you add (-a) it. If the service is already in the ipvsadm table, you edit (-a) it. There is no command to just set the weight no matter what the state. A patch exists to do this (from Horms) but Wensong doesn't want to include it. Scripts which dynamically add, delete or change weights on services will have to know the state of the LVS before making any changes, or else trap errors from running the wrong command.
This section is a bit out of date now. See the
ipvsadm
new schedulers by Thomas Prouell for web caches
and by Henrik Norstrom for firewalls. Ratz ratz@tac.ch
has produced
a scheduler which will keep activity on a particular realserver
below a fixed level.
For this next code write to Ty or grab the code off the list server
Ty Beedetybeede@metrolist.net
23 Feb 2000This is a hack to the ip_vs_wlc.c schedualing algorithm. It is curently implemnted in a quick, ad hoc fashion. It's purpose is to support limiting the total number of connections to a real server. Currently it is implmented using the weigh value as the upper limit on the number of activeconns(connections in an established TCP state). This is a very simple implementation and only took a few minutes after reading through the source. I would like, however, to develop it further.
Due to it's simple nature it will not function in several types of enviroments, those based on connectionless protocals (UDP, this uses the inactconns variable to keep track of things, simply change the activeconns varible-in the weigh check- to inactconns for UDP) and it may impose complecations when persistance is implemented. The current algorimthm simply checks that weight > activeconns before including a server in the standard wlc scheduling. This works for my enviroment, but could be changed to perhaps (weight * 50) > (activeconns * 50) + inactconns to include the inactconns but make the activeconns more important in the decison.
Currently the greatest weight value a user may specify is approimalty 65000, independant of this modification. As long as the user keeps most importanly the weight values correct for the total number of connections and in porportion to one another the things should function as expected.
In the event that the cluster is full, all real severs have maxed out, then it might be neccessary for overflow control, or the client's end will hang. I haven't tested this idea but it could simply be implemented by specifing the over flow server last, after the real severs using the ipvsadm tool. This will work because as each real server is added using ipvsadm it is put on a list, with the last one added being last on the list. The scheduling algorithm traverses this list linearly from start to finish and if it finds that all severs are maxed out, then the last one will be the overflow and that will be the only one to send traffic to.
Anyway this is just a little hack, read the code and it should make sense. It has been included as an attachment. If you would like to test this simply replace the old ip_vs_wlc.c scheduling file in /usr/src/linux/net/ipv4 with this one. Compile it in and set the weight on the real severs to the max number of connections in an established TCP state or modifiy the source to your liking.
From: Ty Beede
tybeede@metrolist.net
28 Feb 2000I wrote a little patch and posted it a few days ago... I indicated that overflow might be accomplished by adding the overflow server to the lvs last. This statement is completely off the wall wrong. I'm not really sure why I thought that would work but it won't, first of all the linked list adds each new instance of a real sever to the start of the real servers list, not the end like I though. Also it would be impossible do distingish the overflow server from the real servers in the case that not all the realservers were busy. I don't know where I got that idea from but I'm going to blame it on my "bushy eyed youth". In responce to needing overflow support I'm thinking about implementing "prority groups" into the lvs code. This would logically group the real severs into different groups, though with a higher priority group would fillup before those with a lower grouping. If anybody could comment on this it would be nice to hear what the rest of you think about overflow code.
Julian
It seems to me it would be useful in some cases to use the total number of connections to a real server in the load balancing calculation, in the case where the real server participates in servicing a number of different VIPs.
WensongYes, if a real server is used from two or more directors the "lc" method is useless.Yeah, it is true. Sometimes, we need tradeoff between simplicity/performance and functionality. Let me think more about this, and probably maximum connection scheduling together together too. For a rather big server cluster, there may be a dedicated load balancer for web traffic and another load balancer for mail traffic, then the two load balancers may need exchange status periodically, it is rather complicated.
Actually, I just thought that dynamic weight adaption according to periodical load feedback of each server might solve all the above problems.
Joe - this is part of a greater problem with LVS, we don't have good monitoring tools and we don't have a lot of information on the varying loads that realservers have, in order to develope strategies for informed load regulation. See load and failure monitoring.
JulianFrom my experience with real servers for web, the only useful parameters for the real server load are:
- cpu idle time
If you use real servers with equal CPUs (MHz) the cpu idle time in percents can be used. In other cases the MHz must be included in a expression for the weight.
- free ram
According to the web load the right expression must be used including the cpu idle time and the free ram.
- free swap
Very bad if the web is swapping.
The easiest parameter to get, the Load Average is always < 5. So, it can't be used for weights in this case. May be for SMTP ? The sendmail guys use only the load average in sendmail when evaluating the load :)
So, the monitoring software must send these parameters to all directors. But even now each of the directors use these weights to create connections proportionally. So, it is useful these parameters for the load to be updated in short intervals and they must be averaged for this period. It is very bad to use current value for a parameter to evaluate the weight in the director. For example, it is very useful to use something like "Average value for the cpu idle time for the last 10 seconds" and to broadcast this value to the director on each 10 seconds. If the cpu idle time is 0, the free ram must be used. It depends on which resource zeroed first: the cpu idle time or the free ram. The weight must be changed slightly :)
The "*lc" algorithms help for simple setups, eg. with one director and for some of the services, eg http, https. It is difficult even for ftp and smtp to use these schedulers. When the requests are very different, the only valid information is the load in the real server.
Other useful parameter is the network traffic (ftp). But again, all these parameters must be used from the director to build the weight using a complex expression.
I think the complex weight for the real server based on connection number (lc) is not useful due to the different load from each of the services. May be for the "wlc" scheduling method ? I know that the users want LVS to do everything but the load balancing is very complex job. If you handle web traffic you can be happy with any of the current scheduling methods. I didn't tried to balance ftp traffic but I don't expect much help from *lc methods. The real server can be loaded, for example, if you build new Linux kernel while the server is in the cluster :) Very easy way to switch to swap mode if your load is near 100%.