The director maintains a hash table of connections marked with
<CIP, CPort, VIP, VPort, RIP, RPORT>
where
The hash table speeds up the connection lookup and keeps state so that packets belonging to a connection from the client will be sent to the allocated realserver. If you are editing the .config by hand look for CONFIG_IP_MASQUERADE_VS_TAB_BITS.
The default LVS hash table size (2^12 entries) originally meant 2^12 simultanous connections. These early versions of ipvs would crash your machine if you alloted too much memory to this table.
Julian 7 Jun 2001This was because the resulting bzImage was too big. Users selected a value too big for the hash table and even the empty table (without linked connections) couldn't fit in the available memory.
This problem has been fixed in kernels>0.9.9 with the connection table being a linked list.
(Note: If you're looking for memory use with "top", it reports memory allocated, not memory you are using. No matter how much memory you have, Linux will eventually allocate all of it as you continue to run the machine and load programs.)
Each connection entry takes 128 bytes, 2^12 connections requires 512kbytes. (Note: not all connections are active - some are waiting to timeout).
As of ipvs-0.9.9 the hash table is different.
Julian Anastasovuli@linux.tu-varna.acad.bg
With CONFIG_IP_MASQUERADE_VS_TAB_BITS we specify not the max number of the entries (connections in your case) but the number of the rows in a hash table. This table has columns which are unlimited. You can set your table to 256 rows and to have 1,800,000 connections in 7000 columns average. But the lookup is slower. The lookup function chooses one row using hash function and starts to search all these 7000 entries for match. So, by increasing the number of rows we want to speedup the lookup. There is _no_ connection limit. It depends on the free memory. Try to tune the number of rows in this way that the columns will not exceed 16 (average), for example. It is not fatal if the columns are more (average) but if your CPU is fast enough this is not a problem.
All entries are included in a table with (1 << IP_VS_TAB_BITS) rows and unlimited number of columns. 2^16 rows is enough. Currently, LVS 0.9.7 can eat all your memory for entries (using any number of rows). The memory checks are planned in the next LVS versions (are in 0.9.9?).
Julian 7 Jun 2001
Here is the picture:the hash table is an array of double-linked list heads, i.e.
struct list_head *ip_vs_conn_tab;
In some versions ago ( < 0.9.9? ) it was a static array, i.e.
struct list_head ip_vs_table[IP_VS_TAB_SIZE];
struct list_head is 8 bytes (d-linked list), the next and prev pointers
In the second variant when IP_VS_TAB_SIZE is selected too high the kernel crashes on boot. Currently (the first variant), vmalloc(IP_VS_TAB_SIZE*sizeof(struct list_head)) is used to allocate the space for the empty hash table for connections. Once the table is created, more memory is allocated only for connections, not for the table itself.
In any case, after boot, before any connections are created, the occupied memory for this empty table is IP_VS_TAB_SIZE*8 bytes. For 20 bits this is (2^20)*8 bytes=8MB. When we start to create connection they are enqueued in one of these 2^20 double-linked lists after evaluating a hash function. In the ideal case you can have one connection per row (a dream), so 2^20 connections. When I'm talking about columns, in this example we have 2^20 rows and average 1 column used.
The *TAB_BITS define only the number of rows (the power of 2 is useful to mask the hash function result with the IP_VS_TAB_SIZE-1 instead of using '%' module operation). But this is not a limit for the number of connections. When the value is selected from the user, the real number of connections must be considered. For example, if you think your site can accept 1,000,000 simultaneous connections, you have to select such number of hash rows that will spread all connections in short rows. You can create these 1,000,000 conns with TAB_BITS=1 too but then all these connections will be linked in two rows and the lookup process will take too much time to walk 500,000 entries. This lookup is performed on each received packet.
The selection of *TAB_BITS is entirely based on the recommendation to keep the d-linked lists short (less than 20, not 500,000). This will speedup the lookup dramatically.
So, for our example of 1,000,000 we must select table with 1,000,000/20 rows, i.e. 50,000 rows. In our case the min TAB_BITS value is 16 (2^16=65536 >= 50000). If we select 15 bits (32768 rows) we can expect 30 entries in one row (d-linked list) which increases the average time to access these connections.
So, the TAB_BITS selection is a compromise between the memory that will use the empty table and the lookup speed in one table row. They are orthogonal. More rows => More memory => faster access. So, for 1,000,000 entries (which is an real limit for 128MB directors) you don't need more than 16 bits for the conn hash table. And the space occupied by such empty table is 65536*8=512KBytes. Bits greater than 16 can speedup the lookup more but we waste too much memory. And usually we don't achieve 1,000,000 conns with 128MB directors, some memory is occupied for other things.
The reason to move to vmalloc-ed buffer is because an 65536-row table occupies 512KB and if the table is statically defined in the kernel the boot image is with 512KB longer which is obviously very bad. So, the new definition is a pointer (4 bytes instead of 512KB in the bzImage) to the vmalloc'ed area.
Ratz's code adds limits per service while this sysctl can limit everything. Or it can be additional strategy (oh, another one) vs/lowmem. The semantic can be "Don't allocate memory for new connections when the low memory threshold is reached". It can work for the masquerading connections too (2.2). By this way we will reserve memory for the user space. Very dangerous option, though.
Joe
what's dangerous about it?
One user process can allocate too much memory and to cause the LVS to drop new connections because the lowmem threshold is reached.why have a min_conn_limit in here? If you put more memory into the director, hen you'll have to recompile your kernel. Is it because finding conn_number is cheaper than finding free_memory?
May be conn_limit is better or something like this:
if (conn_number > min_conn_limit && free_memory < lowmem_thresh) DROP_THIS_PACKET_FOR_NEW_CONN
:) The above example with real numbers:
if (conn_number > 500000 && free_memory < 10MB) DROPI.e. don't allow the user processes to use memory that LVS can use. But when there are "enough" LVS connections created we can consider reserving 10MB for the user space and to start dropping new connections early, i.e. when there are less than 10MB free memory. If conn_number < 500000 LVS simply will hit the 0MB free memory point and the user space will be hurted because these processes allocated too much memory in this case.But obtaining the "free_memory" may be costs CPU cycles. May be we can stick with a snapshot on each second.
The number of valid connections shouldn't change dramatically in 1 sec. However a DoS might still cause problems.
Yes, the problem is on SYN attack.
Ratzmax amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.
what's the security problem?
SYN/RST flood. My patch will set the weight of the realserver to 0 in case the upper threshold is reached. But I do not test if the requesting traffic is malicious or not, so in case of SYN-flood it may be 99% of the packets causing the server to be taken out of service. In the end we have set all server to weight 0 and the load balancer is non-functional either. But you don't have the memory problem :)
And it hasn't crashed either.
RatzI kinda like it but as you said, there is the amem_thresh, my approach (which was not actually done because of this problem :) and now having a lowmem_thresh. I think this will end up in a orthogonal semantic for memory allocation. For example if you enable the amem_thresh the conn_number > min_conn_limit && free_memory < lowmem_thresh would never be the case. OTOH if you set the lowmem_thresh to low the amem_thresh is ineffective. My patch would suffer from this too.
Julian Anastasovja@ssi.bg
08 Jun 2001lowmem_thresh is not related to amemthresh but when amemthresh < lowmem_thresh the strategies will never be activated. lowmem_thresh should be less than amemthresh. Then the strategies will try to keep the free memory in the lowmem_thresh:amemthresh range instead of the current range 0:amemthresh
Example (I hope you have memory to waste):
lowmem_thresh=16MB (think of it as reserved for user processes and kernel) amemthresh=32MB (when the defense strategies trigger) min_conn_limit=500000 (think of it as 60MB reserved for LVS connections)
So, the conn_number can grow far away after min_conn_limit but only while lowmem_thresh is not reached. If conn_number < 500000 and free_memory < lowmem_thresh we will wait the OOM killer to help us. So, we have 2 tuning parameters: the desired number of connections and some space reserved for user processes. And may be this is difficult to tune, we don't know how the kernel prevents problems in VM before activating the killer, i.e. swapping, etc. And the cluster software can take some care when allocating memory.
How long are the connection entries held for ? (Column 8 of /proc/net/ip_masquerade ?)
JulianThe default timeout value for TCP session is 15 minutes, TCP session after receiving FIN is 2 miniutes, and UDP session 5 minutes. You can use "ipchains -M -S tcp tcpfin udp" to set your own time values.
If we assume a clunky set of web servers being balanced that take 3s to serve an object, then if the connection entries are dropped immediately then we can balance about 20 million web requests per minute with 128M RAM. If however the connection entries are kept for a longer time period this puts a limit on the balancer.
Yeah, it is true.
Eg (assuming column 8 is the thing I'm after!)
Actually, the column 8 is the delta value in sequence numbers. The timeout value is in column 10.
[zathras@consus /]$ head -n 1000 /proc/net/ip_masquerade |sed -e "s/ */ /g"|cut -d" " -f8|sort -nr|tail -n500|head -n1 8398i.e. Held for about 2.3 hours, which would limit a 128Mb machine to balance about 10.4 million requests per day. (Which is definitely on the low side knowing our throughput...)
Horms horms@vergenet.net
When a connection is recieved by an IPVS server and forwarded (by whatever means) to a back-end server at what stage is this connection entered into the IPVS table. It is before or as the packet is sent to the back-end server or delayed until after the 3 way handshake is complete.
LarsThe first packet is when the connection is assigned to a real server, thus it must be entered into the table then, otherwise the 3 way handshake would likely hit 3 different real servers.
It has been alleged that IBMs Net Director waits until the completion of the three way handshake to avoid the table being filled up in the case of a SYN flood. To my mind the existing SYN flood protection in Linux should protect the IPVS table in any case and the connection needs to be in the IPVS table to enable the 3 way handshake to be completed.
WensongThere is state management in connection entries in the IPVS table. The connection in different states has different timeout value, for example, the timeout of the SYN_RECV state is 1 minute, the timeout of the ESTABLISHED state is 15 minutes (the default). Each connection entry occupy 128 bytes effective memory. Supposing that there is 128 Mbytes free memory, the box can have 1 million connection entries. The over 16,667 packet/second rate SYN flood can make the box run out of memory, and the syn-flooding attacker probably need to allocate T3 link or more to perform the attack. It is difficult to syn-flood a IPVS box. It would be much more difficult to attach a box with more memory.
I assume that the timeout is tunable, though reducing the timeout could have implications for prematurely dropping connections. Is there a possibility of implementing random SYN drops if too many SYN are received as I believe is implemented in the kernel TCP stack.
Yup, I should implement random early drop of SYN entries long time ago as Alan Cox suggested. Actually, it would be simple to add this feature into the existing IPVS code, because the slow timer handler is activated every second to collect stale entries. I just need to some code to that handler, if over 90% (or 95%) memory is used, run drop_random_entry to randomly tranverse 10% (or 5%) entries and drop the SYN-RCV entries in them.
A second, related question is if a packet is forwarded to a server, and this server has failed and is sunsequently removed from the available pool using something like ldirectord. Is there a window where the packet can be retransmitted to a second server. This would only really work if the packet was a new connection.
Yes, it is true. If the primary load balaner fails over, all the established connections will be lost after the backup takes over. We probably need to investigate how to exchange the state (connection entries) periodically between the primary and the backup without too much performance degradation.
If persistent connections are being used and a client is cached but doesn't have any active connections does this count as a connection as far as load balancing, particularly lc and wlc is concerned. I am thinking no. This being the case, is the memory requirement for each client that is cached but has no connections 128bytes as per the memory required for a connection.
The reason that the existing code uses one template and creates different entries for different connections from the same client is to manage the state of different connections from the same client, and it is easy to seemlessly add into existing IP Masquerading code. If only one template is used for all the connections from the same client, the box receives a RST packet and it is impossible to identify from which connection.
We using Hash Table to record an established network connection. How do we know the data transmission by one conection is over and when should we delete it from the Hash Table?
Julian Anastasovja@ssi.bg
24 Dec 2000OK, here we'll analyze the LVS and mostly the MASQ transition tables from net/ipv4/ip_masq.c. LVS support adds some extensions to the original MASQ code but the handling is same.
First, we have three protocols handled: TCP, UDP and ICMP. The first one (TCP) has many states and with different timeout values, most of them set to reasonable values corresponding to the recommendations from some TCP related rfc* documents. For UDP and ICMP there are other timeout values that try to keep the both ends connected for reasonable time without creating many connection entries for each packet.
There are some rules that keep the things working:
- when a packet is received for an existing connection or when a new connection is created a timer is started/restarted for this connection. The timeout used is selected according to the connection state. If a packet is received for this connection (from one of the both ends) the timer is restarted again (and may be after a state change). If no packet is received during the selected period, the masq_expire() function is called to try to release the connection entry. It is possible masq_expire() to restart the timer again for this connection if it is used from other entries. This is the case for the templates used to implement the persistent timeout. They occupy one entry with timer set to the value of the persistent time interval. There are other cases, mostly used from the MASQ code, where helper connections are used and masq_expire() can't release the expired connection because it is used from others.
- according to the direction of the packet we distinguish two cases: INPUT where the packet comes in demasq direction (from the world) and OUTPUT where the packet comes from internal host in masq direction.
masq. What does "masq direction" mean for packets that are not translated using NAT (masquerading), for example, for Direct Routing or Tunneling? The short answer is: there is no masq direction for these two forwarding methods. It is explained in the LVS docs. In short, we have packets in both directions when NAT is used and packets only in one direction (INPUT) when DR or TUN are used. The packets are not demasqueraded for DR and TUN method. LVS just hooks the LOCAL_IN chain as the MASQ code is privileged in Linux 2.2 to inspect the incoming traffic when the routing decides that the traffic must be delivered locally. After some hacking, the demasquerading is avoided for these two methods, of course, after some changes in the packet and in its next destination - the real servers. Don't forget that without LVS or MASQ rules, these packets hit the local socket listeners.
How are the connection states changed? Let's analyze for example the masq_tcp_states table (we analyze the TCP states here, UDP and ICMP are trivial). The columns specify the current state. The rows explain the TCP flag used to select the next TCP state and its timeout. The TCP flag is selected from masq_tcp_state_idx(). This function analyzes the TCP header and decides which flag (if many are set) is meaningful for the transition. The row (flag index) in the state table is returned. masq_tcp_state() is called to change ms->state according to the current ms->state and the TCP flag looking in the transition table. The transition table is selected according to the packet direction: INPUT, OUTPUT. This helps us to react differently when the packets come from different directions. This is explained later, but in short the transitions are separated in such way (between INPUT and OUTPUT) that transitions to states with longer timeouts are avoided, when they are caused from packets coming from the world. Everyone understands the reason for this: the world can flood us with many packets that can eat all the memory in our box. This is the reason for this complex scheme of states and transitions. The ideal case is when there is no different timeouts for the different states and when we use one timeout value for all TCP states as in UDP and ICMP. Why not one for all these protocols? The world is not ideal. We try to give more time for the established connections and if they are active (i.e. they don't expire in the 15 mins we give them by default) they can live forever (at least to the next kernel crash^H^H^H^H^Hupgrade).
How does LVS extend this scheme? For the DR and TUN method we have packets coming from the world only. We don't use the OUTPUT table to select the next state (the director doesn't see packets returning from the internal hosts). We need to relax our INPUT rules and to switch to the state required by the external hosts :( We can't derive our transitions from the trusted internal hosts. We change the state only based on the packets coming from the the clients. When we use the INPUT_ONLY table (for DR and TUN) the LVS expects a SYN packet and then an ACK packet from the client to enter the established state. The director enters the established state after a two packet sequence from the client without knowing what happens in the real server, which can drop the packets (if they are invalid) or establish a connection. When an attacket sends SYN and ACK packets to flood a LVS-DR or LVS-Tun director, many connections are established state. Each each established connection will allocate resources (memory) for 15 mins by default. If the attacker uses many different source addresses for this attack the director will run out of memory.
For these two methods LVS introduces one more transition table: the INPUT_ONLY table which is used for the connections created for the DR and TUN forwarding methods. The main goal: don't enter established state too easily - make it harder.
Oh, maybe you're just reading the TCP specifications. There are sequence numbers that the both ends attach to each TCP packet. And you don't see the masq or LVS code to try to filter the packets according to the sequence numbers. This can be fatal for some connections as the attacker can cause state change by hitting a connection with RST packet, for example (ES->CL). The only info needed for this kind of attack is the source and destination IP addresses and ports. Such kind of attacks are possible but not always fatal for the active connections. The MASQ code tries to avoid such attacks by selecting minimal timeouts that are enough for the active connections to resurrect. For example, if the connection is hit by TCP RST packet from attacker, this connection has 10 seconds to give an evidence for its existance by passing an ACK packet through the masq box.
To make the things complex and harder for the attacker to block a masq box with many established connections, LVS extends the NAT mode (INPUT and OUTPUT tables) by introducing internal server driven state transitions: the secure_tcp defense strategy. When enabled, the TCP flags in the client's packet can't trigger switching to established state without acknowledgement from the internal end of this connection. secure_tcp changes the transition tables and the state timeouts to achieve this goal. The mechanism is simple: keep the connection is SR state with timeout 10 seconds instead of the default 60 seconds when the secure_tcp is not enabled.
This trick depends on the different defense power in the real servers. If they don't implement SYN cookies and so sometimes don't send SYN+ACK (because the incoming SYN is dropped from their full backlog queue), the connection expires in LVS after 10 seconds. This action assumes that this is a connection created from attacker, since one SYN packet is not followed by another, as part from the retransmissions provided from the client's TCP stack.
We give 10 seconds to the real server to reply with SYN+ACK (even 2 are enough). If the real server implements SYN cookies the SYN+ACK reply follows the SYN request immediatelly. But if there are no SYN cookies implemented the SYN requests are dropped when the backlog queue length is exceeded. So secure_tcp is by default useful for real servers that don't implement SYN cookies. In this case the LVS expires the connections in SYN state in a short time, releasing the memory resources allocated from them. In any case, secure_tcp does not allow switching to established state by looking in the clients packets. We expect ACK from the realserver to allow this transition to EST state.
The main goal of the defense strategies is to keep the LVS box with more free memory for other connections. The defense for the real servers can be build in the real servers. But may be I'll propose to Wensong to add a per-connection packet rate limit. This will help against attacks that create small number of connections but send many packets and by this way load the real servers dramatically. May be two values: rate limit for all incoming packets and rate limit per connection.
The good news is that all these timeout values can be changed in the LVS setup, but only when the secure_tcp strategy is enabled. An SR timeout of 2 seconds is a good value for LVS clusters when realservers don't implement SYN cookies: if there is no SYN+ACK from the realserver then drop the entry at the director.
The bad news is of course, for the DR and TUN methods. The director doesn't see the packets returning from the realservers and LVS-DR and LVS-Tun forwarding can't use the internal server driven mechanism. There are other defense strategies that help when using these methods. All these defense strategies keep the director with memory free for more new connections. There is no known way to pass only valid requests to the internal servers. This is because the realservers don't provide information to the director and we don't know which packet is dropped or accepted from the socket listener. We can know this only by receiving an ACK packet from the internal server when the three-way handshake is completed and the client is identified from the internal server as a valid client, not as spoofed one. This is possible only for the NAT method.
ksparger@dialtoneinternet.net
(29 Jan 2001) rephrases this
by saying the LVS-NAT is layer-3 aware.
For example, NAT can 'see' if a real server responds to a packet it's been
sent or not, since it's watching all of the traffic anyway. If the
server doesn't respond within a certain period of time, the director
can automatically route that packet to another server.
LVS doesn't support this right now, but, NAT would be the
more likely candidate to support it in
the future, as NAT understands all of the IP layer concepts, and DR
doesn't necessarily.
JulianSomeone must put back the real server when it is alive. This sounds like a user space job. The traffic will not start until we send requests. We have to send L4 probes to the real server (from the user space) or to probe it with requests (LVS from kernel space)?
The tcp timeout values have their values for good reason (even if you don't know them they are), and realservers operating as an LVS must appear as normal tcp servers to the clients.
Wayne, 19 Oct 2001I have a question about the 'IP_MASQ_S_FIN_TIMEOUT" values in "net/ipv4/ip_masq.c" for the 2.2.x kernel. What purpose is served by having the terminated masqueraded TCP connection entries remain in memory for the default timeout of 2 minutes? Why isn't the entry freed immediately?
Julian Anastasov ja@ssi.bg 20 Oct 2001
>
Because the TCP connection is full-duplex. The internal-end sends FIN and waits for the FIN from external host. Then TIME_WAIT is entered.
Perhaps what I'm really asking is why there is an mFW state at all.
[IP_MASQ_S_FIN_WAIT] = 2*60*HZ, /* OUTPUT */ /* mNO, mES, mSS, mSR, mFW, mTW, mCL, mCW, mLA, mLI */ /*syn*/ {{mSS, mES, mSS, mSR, mSS, mSS, mSS, mSS, mSS, mLI }}, /*fin*/ {{mTW, mFW, mSS, mTW, mFW, mTW, mCL, mTW, mLA, mLI }}, /*ack*/ {{mES, mES, mSS, mES, mFW, mTW, mCL, mCW, mLA, mES }}, /*rst*/ {{mCL, mCL, mSS, mCL, mCL, mTW, mCL, mCL, mCL, mCL }}, }; /mFW
This state has timeout corresponding to the similar state in the internal end. The remote end is still sending data while the internal side is in FIN_WAIT state (after a shutdown). The remote end can claim that it is still in established state not seeing the FIN packet from internal side. In any case, the remote end has 2 minutes to reply. It can even work for longer time if each packet follows in these two minutes not allowing the timer to expire. It depends in what state is the internal end, FIN_WAIT1 or FIN_WAIT2. May be the socket in the internal end is already closed.
The only thing I can think of is if the other end of the TCP connection spontaneously issues a half close before the initiator sends his half close. Then it might be desirable to wait a while for the initiator to send his half close prior to disposing of the connection totally. What would be the consequences of using "ipchains -M -S" to set this value to, say, 1 second?
In any case, timeout values lower than those in the internal hosts are not recommended. If we drop the entry in LVS, the internal end still can retransmit its FIN packet. And the remote end has two minutes to flush its sending queue and to ack the FIN. IMO, you should claim that the timer in FIN_WAIT state should not be restarted on packets coming from remote end. Anything else is not good because it can drop valid packets that fit into the normal FIN_WAIT time range.
Yasser NabiIP Virtual Server version 0.9.0 (size=16777216)
Julian Anastasov ja@ssi.bg
25 May 2001
Too much, it takes 128MB for the table only. Use 16 bits for example.
Is this a hidden/undocumented problem with IPVS or it's just an observation of memory waste ? (we use 18 bits in production)
empty hash tables: 18 bits occupy 2MB RAM 24 bits occupy 128MB RAM
If the box has 128MB and the bits are 24 the kernel crash is mandatory, soon or later. And this is a good reason the virtual service not to be hit. Expect funny things to happen on box with low memory
I forgot that not anyone uses 256Mb or more RAM on directors :)
Yes, 256MB in real situation is 1,500,000 connections, 128 bytes each, 64MB for other things ... until someone experiments with SYN attack
However, for me it makes sense to use up to 66% of total memory for LVS, especially on high-traffic directors (in the idea that the directors doesn't run all the desktop garbage that comes with most distros).
If the used bits are 24, an empty hash table is 128MB. For the rest 128MB you can allocate 1048576 entries, 128 bytes each ... after the kernel killed all processes.
Some calcs considering the magic value 16 as average bucket length and for 256MB memory:
For 17 bits:
2^17=131072 => 1MB for empty hash table
131072*16=2097152 entries=256MB for connections
For 18 bits:
2^18=262144 => 2MB for empty hash table
for each MB for hash table we lose space for 8192 entries but we speedup the lookup.
So, even for 1GB directors, 19 or 20 is the recommended value. Anything above is a waste of memory for hash table. In 128MB we can put 1048576 entries. In the 24-bit case they are allocated for d-linked list heads.
Joe 6 Jun 2001
what happens after the table fills up? Does ipvs handle new connect requests gracefully (ie drops them and doesn't crash)?
Julian
The table has fixed number of rows and unlimited number of columns (d-lists where the connection structures are enqueued). The number of connections allocated depends on the free memory.
Once there is no memory to allocate connection structure, the connection requests will be dropped. Expect crashes maybe at another place (usually user space) :)
I'm not sure what the kernel will decide in this situation but don't rely on the fact some processes will not be killed. There is a constant network activity and a need for memory for packets (floods/bursts).
And the reason the defense strategies to exist is to free memory for new connections by removing the stalled ones. The defense strategy can be automatically activated on memory threshold. Killing the cluster software on memory presure is not good.
So, the memory can be controlled, for example, by setting drop_entry to 1 and tuning amemthresh. On floods it can be increased. It depends on the network speed too: 100/1000mbit. Thresholds of 16 or 32 megabytes can be used in such situations, of course, when there are more memory chips.
Roberto Nibaliratz@tac.ch
The director never crashes because of exhaustion of memory. If he tries to allocate memory for a new entry into the table and kmalloc returns NULL, we return, or better drop the packet in processing and generate a page fault.
You could use my treshhold limitation patch. You calculate how many connections you can sustain with your memory by multiplying each connection entry with 128bytes and divide by the amount of realserver and set the limitation alike. Example:
128MByte, persistency 300s: max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshhold per realserver to 873. Like this you would never have a memory problem but a security problem.
Joe
It would seem that we need a method of stopping the director hash table from using all memory whether as a result of a DoS attack or in normal service. Let's say you fill up RAM with the hash table and all user processes go to swap, then there will be problems - I don't know what, but it doesn't sound great - at a high number of connections I expect the user space processes will be needed too. I expect we need to leave a certain amount for user space processes and not allow the director to take more than a certain amount of memory.
It would be nice if the director didn't crash when the number of connections got large. Presumably a director would be functioning only as a director and the amount of memory allocated to user space processes wouldn't change a whole lot (ie you'd know how much memory it needed).
Joe Feb 2001
With sufficient number of connections, a director could start to swap out its tables (is this true?) In this case, throughput could slow to a crawl. I presume the kernel would have to retrieve parts of the table to find the realserver associated with incoming packets. I would think in this case it would be better to drop connect requests than to accept them.
JulianIMO, this is not true. LVS uses GFP_ATOMIC kind of allocations and as I know such allocations can't be swapped out.
If it's possible for LVS to start the director to swap, is there some way to stop this?
You can try with testlvs whether the LVS uses swap. Start the kernel with LILO option mem=8M and with large swap area. Then check whether more than 8MB swap is used.
In earlier verions of LVS, you set the amount of memory for the tables (in bytes). Now you allocate a number of hashes, whose size can grow without limit, allowing an unlimited number of connections. Once the number of connections becomes sufficiently large, then other resources will become limiting.
The ipvs code doesn't handle this, presumably the director will crash (also see the threshhold patch). Instead you handle this by brute force, adding enough memory to accept the maximum number of connections your setup will ever be asked to handle (e.g. under a DoS attack). This memory size can be determined by the multiplying the rate at which your network connection can push connect requests to the director, by the timeout values, which are set by FIN_WAIT or the persistence.
You can expand the number of ports to 65k, but eventually you'll reach the 65k port limit.
Sometimes client processes on the realservers need to connect with machines on the internet (see clients on LVS-NAT realservers and clients on LVS-DR realservers).
Waynewayne@compute-aid.com
Nov 5 2001Say you have a web page that has to retrieve on-line ads from one of your advertiser (people who pay you for showing their ads). If you have 50,000 visitors on your site, you will open 50,000 connections between your web server and the ad server out there somewhere. The masquerade limit is 4,096 per pair of IP addresses, and 40,960 per LVS box. In our case, the realserver is behind the LVS-NAT director, which also functions as the firewall, so the realserver MUST use the director to reach the ad servers.
Usually the RIP is private (e.g.192.168/16) and will have to be NAT'ed to the outside world. This can be done with LVS-NAT or LVS-DR by adding masquerading rules to the director's iptables/ipchains rules. (With LVS-DR, you also have to route the packets from the RIP - this routing is setup by default with the configure script.)
Less often you want to use more ports on your LVS client machines.
Wang HaiguangMy client machine it uses port numbers between 1024 - 4096. After reaching 4096, it will loop back to 1024 and reuse the ports. I want to use more port nubmers
michael_e_brown@dell.com
06 Feb 2001
echo 1024 65000 > /proc/sys/net/ipv4/ip_local_port_range /usr/src/linux/Documentation/networking/ip-sysctl.txt
While normal client processes start using ports at 1024, masqueraded ports start at 61000 (see clients on LVS-NAT realservers). The masquerading code does not check if other processes are requesting ports and thus port collisions could occur. It is assumed on a NAT box that no other processes are initiating connections (i.e. you aren't running a passive ftp server).
Waynewayne@compute-aid.com
14 May 2000,If running a load balancer tester, say the one from IXIA to issue connections to 100 powerful web servers, would all the parameters in Julian's description need to be changed, or it should not be a problem for having many many connections from a single tester?
Julian
There is no limit for the connections from the internal hosts. Currently, the masquerading allows one internal host to create 40960 TCP connections. But the limit of 4096 connections to one external service is still valid.
If 10 internal hosts try to connect to one external service, each internal host can create 4096/10 => 409 connections.
For UDP the problem is sometimes worse. It depends on the /proc/sys/net/ipv4/ip_masq_udp_dloose value.
Joewhich is internal and which is external here? The client, the realservers?
This is a plain masquerading so internal and external refer to masquerading. These limits are not for the LVS connections, they are only for the 2.2 masquerading.
/ 65095 Internal Servers External Server:PORT - ... MADDR -------------------- \ 61000
When many internal clients try to connect to same external real service, the total number of TCP connections from one MADDR to this remote service can be 4096 because the masq uses only 4096 masq ports by default. This is a normal TCP limit, we distinguish the TCP connections by the fact they use different ports, nothing more. And the masq code is restricted by default to use the above range of 4096 ports.
In the whole masquerading table there is a space only for 40960 TCP, 40960 UDP and 40960 ICMP connections. These values can be tuned by changing ip_masq.c:PORT_MASQ_MUL.
1 Nov 2001 WayneJulian continuingwayne@compute-aid.com
PORT_MASQ_MUL appears to serve only as a check to make sure the masquerading facility does not hog all the memory, and that actually things would still work no matter how large PORT_MASQ_MUL is, or even if the checks using it are disabled. Is this true?
JulianBy multiplying this constant with the masq port range, you define the connection limit for each protocol. This is related to the memory used for masquerading. This is a real limit, but not for LVS connections, because they are usually not limited by port collisions, and LVS does not check this limit.
What about using more than the 32k range? What is the maximum I could select?
Peter Muellerpmueller@sidestep.com
You should be able to use about 60k, i.e. 1024-6100. I hope you have lots of RAM :-)
The PORT_MASQ_MUL value simply determines the recommended length of one row in the masq hash table for connections, but in fact it is involved in the above connection limits. It is recommended that the busy masq routers must increase this value. May be the 4096 masq port range too. This involves squid servers behind masq router.
LVS uses another table without limits. For LVS setups the same TCP restrictions apply but for the external clients:
4999 \ Client - VIP:VPORT LVS Director 1024 /
The limit of client connections to one VIP:VPORT is limited to the number of used client ports from same Client IP.
The same restrictions apply to UDP. UDP has the same port ranges. But for UDP the 2.2 kernel can apply different restrictions. They are caused from some optimizations that try to create one UDP entry for many connections. The reason for this is the fact that one UDP client can connect to many UDP servers while this is not common for TCP.
Joewhen you increase the port range, you need more memory. Is this only because you can have more connections and hence will need a bigger ipvsadm table?
Yes, the first need is for more masqueraded connections and they allocate memory. LVS uses separate table and it is not limited. We distinguish LVS-NAT from Masquerade. LVS-NAT (and any other method) does not allocate extra ports, even for other ranges. It shadows only the defined port. No other ports are involved until masquerading is used.
ipvs doesn't check port ranges and so collisions can occur with regular services (ftp was mentioned). I would have thought that a process needing to open a IP connnection would ask the tcp code in the kernel for a connection and let that code handle the assignment of the port.
LVS does not allocate local ports. When the masquerade is used to help with some protocol, the masquerade performs the check (ftp for example).
The port range has nothing to do with LVS. It helps the masquerading to create more connections because there is fixed limit for each protocol. But sometimes LVS for 2.2 uses ip_masq_ftp, so may be only then this mport range is used.
X-window connections are at 6000.. Will you be able to start an X-session if these ports are in use by a director masquerading out connections from the realservers?
If we put LVS (ipvsadm -A ) in front of this port 6000 then X sessions will be stopped. OTOH, masquerade does not select ports in this range, the default is above 61000. So, any FTP sessions will not disturb local ports, of course, if you don't increase the mport range to cover the well defined server ports such as X.
LVS Account, 27 Feb 2001I'm trying to do some load testing of LVS using a reverse proxy cache server as the load balanced app. The error I get is from a load generating app.. Here is the error:
byte count wrong 166/151
Julian Anastasov ja@ssi.bg
>
Broken app.
this goes on for a few hundred requests then I start getting:
Address already in use
App uses too many local ports.
This is when I can't telnet to port 80 any more... If I try to telnet to 10.0.0.80 80 I get this:
$ telnet 10.0.0.80 80 Trying 10.0.0.80... telnet: Unable to connect to remote host: Resource temporarily unavailable
No more free local ports.
If I go directly to the web server OR if I go directly to the IP of the reverse proxy cache server, I don't get these errors.
Hm, there are free local ports now.
I'm using a load balancing app that I call this way:
/home/httpload/load -sequential -proxyaddr 10.0.0.80 -proxyport 0 -parallel 120 -seconds 6000000 /home/httpload/urlupping the local port range has helped tremendously
Here's a case where a realserver ran out of udp ports doing DNS looksup while serving http.
Hendrik Thielthiel@falkag.de
I am using IP Virtual Server version 0.9.14 (size=4096). We have 6 Realservers each.
-> RemoteAddress:Port Forward Weight ActiveConn InActConn -> server1:www Masq 1 68 12391Today we reached a new peak (very fast, few minutes) 30Mbps, up from the normal 15Mbit/s. Afterwards the following kernel messages (dmesg) showed up...
IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31894). IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31894). IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31888).
Julian Anastasov ja@ssi.bg
20 Nov 2001 (heavily edited by Joe)
It seems you are flooding a single remote host with UDP requests from a realserver. Your service, www, is TCP and is not directly connected to these messages. You've reached the UDP limit per destination (4096), there are still free UDP ports on the realserver for other destinations.
Hendrikyes it's DNS, each realserver is a caching DNS.
resolv.conf nameserver 127.0.0.1 nameserver external IP
LVS is vunerable to DoS by an attacker making repeated connection requests. Each connection requires 128bytes of memory - eventually the director will run out of memory. This will take a while but an attacker has plenty of time if you're asleep. As well with LVS-DR and LVS-Tun, the director doesn't have access to the TCPIP tables in the realserver(s) which show whether a connection has closed (see director hash table). The director can only guess that the connection has really closed, and does so using timeouts.
For information on DoS strategies for LVS see DoS page.
Laurent LefollLaurent.Lefoll@mobileway.com
14 Feb 2001If I am not misunderstanding something, the variable /proc/sys/net/ipv4/vs/timeout_established gives the time a TCP connection can be idle and after that the entry corresponding to this connection is cleared. My problem is that it seems that sometimes it's not the case. For example I have a system (2.2.16 and ipvs 0.9.15) with /proc/sys/net/ipv4/vs/timeout_established = 480 but the entries are created with a real timeout of 120 ! On another system
Julian Anastasov ja@ssi.bg
Read The secure_tcp defense strategy where the timeouts are explained. They are valid for the defense strategies only. For TCP EST state you need to read the ipchains man page.
For more explanation of the secure_tcp strategy also see the explanation of the director's hash table.
when I play with "ipchains -M -S > [value] 0 0" the variable /proc/sys/net/ipv4/vs/timeout_established is modified even when /proc/sys/net/ipv4/vs/secure_tcp is set to 0, so I'm not using the secure TCP defense. The "real" timeout is of course set to [value] when a new TCP connection appears. So should I understand that timeout_established, timeout_udp,... are always modified by "ipchains -M -S ...." whatever I use or not secure TCP defense but if secure-tcp is set to 0, other variables give the timeouts to use ? If so, are these variable accessible or how to check their value ?
ipchains -M -S modifies the two TCP and the UDP timeouts in the two secure_tcp modes: off and on. So, ipchains changes the three timeout_XXX vars. When you change the timeout_* vars you change them for secure_tcp=on only. Think for the timeouts as you have two sets: for the two secure_tcp modes: on and off. ipchains changes the 3 vars in the both sets. While secure_tcp is off changing timeout_* does not affect the connection timeouts. They are used when secure_tcp is on.
(Joe: `ipchains 0 value 0`, where value=10 does not change the timeout values or number of entries seen in InActConn or seen with netstat -M, or ipchains -M -L -n).
LVS has its own tcpip state table, when in secure_tcp mode.
carl.huangwhat are the vs_tcp_states[ ] and vs_tcp_states_dos[ ] elements in the in ip_vs_conn structure for?
Roberto Nibali ratz@tac.ch
16 Apr 2001
The vs_tcp_states[] table is the modified state transition table for the TCP state machine. The vs_tcp_states_dos[] is a yet again modified state table in case we are under attack and secure_tcp is enabled. It is tigher but not conforming to the RFC anymore. Let's take an example how you can read it:
static struct vs_tcp_states_t vs_tcp_states [] = { /* INPUT */ /* sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */ /*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }}, /*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }}, /*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }}, /*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }},
The elements 'sXX' mean state XX, so for example, sFW means TCP state FIN_WAIT, sSR means TCP state SYN_RECV and so on. Now the table describes the state transition of the TCP state machine from one TCP state to another one after a state event occured. For example: Take row 2 starting with sES and ending with sCL. At the first, commentary row, you see the incoming TCP flags (syn,fin,ack,rst) which are important for the state transition. So the rest is easy. Let's say, you're in row 2 and get a fin so you go from sES to sCW, which should by conforming to RFC and Stevens.
Short illustration:
/* , sES, /*syn*/ {{ , , /*fin*/ {{ , sCW,
It was some months ago last year when Wensong, Julian and me discussed about a security enhancement for the TCP state transition and after some heavy discussion they implemented it. So the second table vs_tcp_states_dos[] was born. (look in the mailing list in early 2000).
joern maier 22 Nov 2000
I've got some Problem protecting my VS from SYN - flood attacks. Somehow the drop_entry mechanism seems not to work. Doing a SYN-Flood with 3 clients to my VS ( 1 director + 3 RS ) the System gets unreachable. A single realserver under the same attack by those clients stays alive.
JulianYou can't SYN flood the director with only 3 clients. You need more clients (or as an alternative, you can download "testlvs" from the web site). What does ipvsadm -Ln show under attack? How you activate drop_entry? What does "cat drop_entry" show?
all realserver have tcp_syncookies enabled (1), tcp_max_syn_backlog=128, the director is set drop_entry var to 1 (echo 1 > drop_entry). Before compiling the Kernel I set the table size to 2^20. My Director has 256 MB and no other applications running.
You don't need such large table, really.
With testlvs and two clients, my LVS gets to a denial of service, although "cat drop_entry" shows me a "1".
ipvsadm -Ln:
192.168.10.1:80 lc 192.168.1.4:80 Tunnel 1 0 33246 192.168.1.3:80 Tunnel 1 0 33244 192.168.1.2:80 Tunnel 1 0 33246
run testlvs with 100,000 source addresses.
during the flooding attack the connection values stay around this size. Using the SYN-Flood tool with which I tried it before, ivsadm shows me this:
192.168.10.1:80 lc 192.168.1.4:80 Tunnel 1 0 356046 192.168.1.3:80 Tunnel 1 0 355981 192.168.1.2:80 Tunnel 1 0 356013
so it shows me about ten times as many connections as your tool. I took a look at the packets, both are quiet similar, they only differ in the Windowsize (testlvs has 0, the other tool uses a size of 65534) and sequence numbers (o.k. checksum as well)
I am activating drop entry like this:
I switch on my computer (director) and start linux with the LVS Kernel
cd /proc/sys/net/ipv4/vs echo 1 > drop_entry
JulianMaybe you need to tune amemthresh. 1024 pages (4MB) are too low value. How much memory does "free" show under attack? You can try with 1/8 RAM size for example. The main goal of these defense strategies is to keep free memory in the director, nothing more. The defense strategies are activated according to the free memory size. The packet rate is not considered.
joern maier
That sounds all good to me, but what I'm really wondering about is, why has the drop_entry variable still a value of 1. I thought it has to be 2 when my System is under attack? To me it looks like LVS does not even think it's under attack and therefore does not use the drop_entry mechanism.
You are right. You forgot to specify when the LVS to think it is under attack. drop_entry switches automatically from 1 to 2 when the free memory reaches amemthresh. Do you know that your free memory is below 4MB. See http://www.linuxvirtualserver.org/defense.html.So, 1,000,000 entries created from the other tool occupy 128MB memory. You have 256MB :) Boot with mem=128MB (in lilo) or set amemthresh to 32768 or run testlvs with more source addresses (2,000,000). I'm not sure if the last will help if the other tool you use does not limit the number of spoofed addresses. But don't run testlvs with less than -srcnum 2000000. If the setup allows rate > 33,333 packets/sec LVS can create 2,000,000 entries that expire for 60 seconds (the SYN_RECV timeout). Better not to use the -random option in testlvs for this test.
So, you can test with such large values but make sure you tune amemthresh in production with the best value for your director. The default value is not very useful. You can test whether 1/8 is a good value (8192 for 4K page size).
Alan Cox alan@lxorguk.ukuu.org.uk
>
The biggest problem with load balancing, when you need to do this sort of trickery (and its one the existing load balancing patches seem to have is that if you store per connection state then a synflood will take out your box (if you run out of ram) or run a delightfully efficient DoS attack if you don't. The moment you factor time into your state tables you are basically well and truely screwed.
Lars Marowsky-Breelmb@teuto.net
> 8 Jun 1999This can be solved with a hashtable, where you take the source ip as the key and look up the server to direct the request. Since the hash table is fixed size, we can do with fixed resources.
Given a proper hash function, this scheme is _ideal_ for basic round-robin and weighted round-robin with fixed weights and we should look at implementing this. Keeping state if not necessary _is_ a bug.
We are screwed however and can't do this if we want to do least-connections, dynamic load-based balancing, add servers at a later time etc and still deliver sticky connections (ie that connections from client A will stay on server B until a timeout happens or server B dies).
Basically, since we _need_ to keep state on a per-client basis for this we can be screwed easily by bombarding us with a randomized source IP.
Now - for all but the most simple load balancing, we NEED to keep state. So, we need to weasle our way out of this mess somehow.
One approach would be to integrate SYN cookies into the load-balancer itself and only pass on the connection if the TCP handshake succeeded. Now, there are a few obvious problems with this: It is a very complex task. And, it still screws us in the case of an UDP flood.
"The easy way out" for TCP connections is to do this stuff in user space - a load-balancing proxy, which connects to the backend servers. Problems with this are that it isn't transparent to the backend servers anymore (all connections come from the IP of the loadbalancer), it does not scale as well (no direct routing approach etc possible), and we still did not solve UDP.
I propose the following: We continue to maintain state like we always did. But when we hit, lets say, 32,000 simulteanous connections, we go into "overload" mode - that is, all new connections are mapped using the hash table like Alan proposed, but we still search the stateful database first.
There are a few problems with this too: It is not as fast as the pure hash table, since we need to look into the stateful database before consulting the hashtable. If weights change during overload mode, sticky connections can't be easily guaranteed (I thus propose to NOT change weights during overload mode, or at least ignore the changes with regard to the hashing).
However, these are disadvantages which only happen under attack. At the moment, we would simply crash, so it IS an improvement. It is a fully transparent approach and works with UDP too. The effort to implement this is acceptable. (if it were userspace I would give it a try sometime;)
And if we implement this scheme for fixed loadbalancing, which someone else definetely should, reusing the code here might not be that much of a problem.
Michael McConnell michaelm@eyeball.com
08 Oct 2001
the command
#ipchains -L -Mreturns a list of masqueraded connections, i.e.
TCP 01:38.01 10.1.1.41 21.1.112.43 80 (80) -> 4052 TCP 01:38.08 10.1.1.41 21.1.112.43 80 (80) -> 4053 TCP 00:25.09 10.1.1.11 20.170.180.17 80 (80) -> 4430
If ipchains (kernel 2.2) has been set with a 10hr TCP timeout
ipchains -M -S 7200 0 0 (10 hour TCP timeout)
Now these connections remain (will populate the ipvsadm table) for 10 hours. Does anyone have any suggestions as to how to purge this table manually? If I run out of ports, I get a DoS (2 hr timeout, 30,000 TCP connections...DoS)
Peter MuellerIf you alter /proc/net/ip_masquerade, it will break the established connection. Isn't that what you want to do?
No matter what I do I can not seem to reset, clear or modify this manually.
if you do not like the prospect of altering directly perhaps try a shell script:
#!/bin/sh #hopefully this works and you won't shoot yourself in the foot... ipchains -M -S 1 0 0 sleep 5 ipchains -M -S 7200 0 0
Setting this Value only effects *NEW* connections, connections already set are unaffected.
Julian Anastasov ja@ssi.bg
>
Without a timeout values specific for each LVS virtual service and another for the masqueraded connections it is difficult to play such games. It seems only one timeout needs to be separated, the TCP EST timeout. The reason that such support is not in 2.2 is because nobody wants to touch the user structures. IMO, it can be added for 2.4 if there are enough free options in ipvsadm but it also depends on some implementation details.
If you worry for the free memory you can use some defense the LVS DoS defense strategies
echo 1 > drop_entry
iptables (2.4 kernels) is different to ipchains (2.2 kernels). For one thing there is no "iptables -C" to check your rules (at least yet - one is promised).
RatzJulianIf you're dealing with netfilter, packets don't travel through all chains anymore. Julian once wrote something about it:
packets coming from outside to the LVS do:
PRE_ROUTING -> LOCAL_IN(LVS in) -> POST_ROUTINGpackets leaving the LVS travel:
PRE_ROUTING -> FORWARD(LVS out) -> POST_ROUTING
From the iptables howto: COMPATIBILITY WITH IPCHAINSThis iptables is very similar to ipchains by Rusty Russell. The main difference is that the chains INPUT and OUTPUT are only traversed for packets coming into the local host and originating from the local host respectively. Hence every packet only passes through one of the three chains; previously a forwarded packet would pass through all three.
I don't yet understand all of iptables yet, but it helped making the whole filtering and NAPT stuff more smooth and more flexible. You have more possibilies to manage the traffic. If it is better has to be proven in reality, so far there are not many setups with complex firewall settings, like merging different advanced routing aspects with QoS and own Targets over different networks with all kind of non-TCP/UDP traffic and an IPV6 connection.
2.4 director:
Packets coming into the director (out->in):
packets leaving the LVS travel (in->out):
2.2 director:
INPUT in 2.2 is similar as PRE_ROUTING in 2.4, i.e. INPUT, OUTPUT and FORWARD are the 2.2 firewall chains
input routing: ip_route_input() output routing: ip_route_output() forwarding: ip_forward() local: ip_local_deliver()
Matthew S. CrockerJulianmatthew@crocker.com
31 Aug 2001How do I filter LVS? Does LVS grab the packets before or after iptables?
LVS is designed to work after any kind of firewall rules. So, you can put your ipchains/iptables rules safely. If you are using iptables put them on LOCAL_IN, not on FORWARD. The LVS packets do not go through FORWARD.
Joe
If you are being attacked, it might be better to filter upstream (e.g. the router or your ISP), to prevent the LAN from being flooded.
The output of ipsvadm lists connections, either as
Entries in the ActConn column come from
Entries in the InActConn column come from
The 3 way handshake to establish a connection takes only 3 exchanges of packets (i.e. it's quick on any normal network) and you won't be quick enough with ipvsadm to see the connection in the states before it becomes ESTABLISHED. However if the service on the realserver is under identd, you'll see an InActConn entry during the delay period.
In this case the 3 way handshake will never complete, the connection will hang, and there'll be an entry in the InActConn column.
Usually the number of InActConn will be larger or very much larger than the number of ActConn.
Here's a LVS-DR LVS, setup for ftp, telnet and http, after telnetting from the client (the client command line is at the telnet prompt).
director:# ipvsadm IP Virtual Server version 0.2.8 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> bashfull.mack.net:www Route 1 0 0 -> sneezy.mack.net:www Route 1 0 0 TCP lvs2.mack.net:0 rr persistent 360 -> sneezy.mack.net:0 Route 1 0 0 TCP lvs2.mack.net:telnet rr -> bashfull.mack.net:telnet Route 1 1 0 -> sneezy.mack.net:telnet Route 1 0 0
showing the ESTABLISHED telnet connection (here to realserver bashfull).
Here's the output of netstat -an | grep (appropriate IP) for the client and the realserver, showing that the connection is in the ESTABLISHED state.
client:# netstat -an | grep VIP tcp 0 0 client:1229 VIP:23 ESTABLISHED realserver:# netstat -an | grep CIP tcp 0 0 VIP:23 client:1229 ESTABLISHED <verb> Here's immediately after the client logs out from the telnet session. <verb> director# ipvsadm IP Virtual Server version 0.2.8 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> bashfull.mack.net:www Route 1 0 0 -> sneezy.mack.net:www Route 1 0 0 TCP lvs2.mack.net:0 rr persistent 360 -> sneezy.mack.net:0 Route 1 0 0 TCP lvs2.mack.net:telnet rr -> bashfull.mack.net:telnet Route 1 0 0 -> sneezy.mack.net:telnet Route 1 0 0 client:# netstat -an | grep VIP #ie nothing, the client has closed the connection #the realserver has closed the session in response #to the client's request to close out the session. #The telnet server has entered the TIME_WAIT state. realserver:/home/ftp/pub# netstat -an | grep 254 tcp 0 0 VIP:23 CIP:1236 TIME_WAIT #a minute later, the entry for the connection at the realserver is gone.
Here's the output after ftp'ing from the client and logging in, but before running any commands (like `dir` or `get filename`).
director:# ipvsadm IP Virtual Server version 0.2.8 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> bashfull.mack.net:www Route 1 0 0 -> sneezy.mack.net:www Route 1 0 0 TCP lvs2.mack.net:0 rr persistent 360 -> sneezy.mack.net:0 Route 1 1 1 TCP lvs2.mack.net:telnet rr -> bashfull.mack.net:telnet Route 1 0 0 -> sneezy.mack.net:telnet Route 1 0 0 client:# netstat -an | grep VIP tcp 0 0 CIP:1230 VIP:21 TIME_WAIT tcp 0 0 CIP:1233 VIP:21 ESTABLISHED realserver:# netstat -an | grep 254 tcp 0 0 VIP:21 CIP:1233 ESTABLISHED
The client opens 2 connections to the ftpd and leaves one open (the ftp prompt). The other connection, used to transfer the user/passwd information, is closed down after the login. The entry in the ipvsadm table corresponding to the TIME_WAIT state at the realserver is listed as InActConn. If nothing else is done at the client's ftp prompt, the connection will expire in 900 secs. Here's the realserver during this 900 secs.
realserver:# netstat -an | grep CIP tcp 0 0 VIP:21 CIP:1233 ESTABLISHED realserver:# netstat -an | grep CIP tcp 0 57 VIP:21 CIP:1233 FIN_WAIT1 realserver:# netstat -an | grep CIP #ie nothing, the connection has dropped #if you then go to the client, you'll find it has timed out. ftp> dir 421 Timeout (900 seconds): closing control connection.
http 1.0 connections are closed immediately after retrieving the URL (i.e. you won't see any ActConn in the ipvsadm table immediately after the URL has been fetched). Here's the outputs after retreiving a webpage from the LVS.
director:# ipvsadm IP Virtual Server version 0.2.8 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> bashfull.mack.net:www Route 1 0 1 -> sneezy.mack.net:www Route 1 0 0 client:~# netstat -an | grep VIP bashfull:/home/ftp/pub# netstat -an | grep CIP tcp 0 0 VIP:80 CIP:1238 TIME_WAIT
Ty Beede wrote:I am curious about the implementation of the inactconns and activeconns variables in the lvs source code.
Julian
Info about LVS <= 0.9.7 TCP active: all connections in ESTABLISHED state inactive: all connections not in ESTABLISHED state UDP active: 0 (none) (LVS <= 0.9.7) inactive: all (LVS <= 0.9.7) active + inactive = all
Look in this table for the used timeouts for each protocol/state:
/usr/src/linux/net/ipv4/ip_masq.c, masq_timeout_table
For VS/TUNNEL and VS/DR the TCP states are changed checking only the TCP flags from the incoming packets. For these methods UDP entries can expire (5 minutes?) if only the real servers sends packets and there are no packets from the client.
For info about the TCP states:
- /usr/src/linux/net/ipv4/tcp.c
- rfc793.txt
From: Jean-francois Nadeau jf.nadeau@videotron.ca
Done some testing (netmon) on this and here's my observations :
1. A connection becomes active when LVS sees the ACK flag in the TCP header incoming in the cluster : i.e when the socket gets established on the real server.
2. A connection becomes inactive when LVS sees the ACK-FIN flag in the TCP header incoming in the cluster. This does NOT corespond to the socket closing on the real server.
Example with my Apache Web server.
Client <--> Server A client request an object on the web server on port 80 : SYN REQUEST ----> SYN ACK <---- ACK -----> *** ActiveConn=1 and 1 ESTABLISHED socket on real server. HTTP get -----> *** The client request the object HTTP response <----- *** The server sends the object APACHE closes the socket : *** ActiveConn=1 and 0 ESTABLISHED socket on real server The CLIENT receives the object. (took 15 seconds in my test) ACK-FIN -----> *** ActiveConn=0 and 0 ESTABLISHED socket on real server
Conclusion : ActiveConn is the active number of CLIENT connections..... not on the server in the case of short transmissions (like objects on a web page). Its hard to calculate a server's capacity based on this because slower clients makes ActiveConn greater than what the server is really processing. You wont be able to reproduce that effect on a LAN, because the client receives the segment too fast.
In the LVS mailing list, many people explained that the correct way to balance the connections is to use monitoring software. The weights must be evaluated using values from the real server. In VS/DR and VS/TUNi, the Director can be easily fooled with invalid packets for some period and this can be enough to inbalance the cluster when using "*lc" schedulers.
I reproduce the effect connecting at 9600 bps and getting a 100k gif from Apache, while monitoring established sockets on port 80 on the real server and ipvsadm on the cluster.
JulianYou are probably using VS/DR or VS/TUN in your test. Right? Using these methods, the LVS is changing the TCP state based on the incoming packets, i.e. from the clients. This is the reason that the Director can't see the FIN packet from the real server. This is the reason that LVS can be easily SYN flooded, even flooded with ACK following the SYN packet. The LVS can't change the TCP state according to the state in the real server. This is possible only for VS/NAT mode. So, in some situations you can have invalid entries in ESTABLISHED state which do not correspond to the connections in the real server, which effectively ignores these SYN packets using cookies. The VS/NAT looks the better solution against the SYN flood attacks. Of course, the ESTABLISHED timeout can be changed to 5 minutes for example. Currently, the max timeout interval (excluding the ESTABLISHED state) is 2 minutes. If you think that you can serve the clients using a smaller timeout for the ESTABLISHED state, when under "ACK after SYN" attack, you can change it with ipchains. You don't need to change it under 2 minutes in LVS 0.9.7. In the last LVS version SYN+FIN switches the state to TIME_WAIT, which can't be controlled using ipchains. In other cases, you can change the timeout for the ESTABLISHED and FIN-WAIT states. But you can change it only down to 1 minute. If this doesn't help, buy 2GB RAM or more for the Director.
One thing that can be done, but this is may be paranoia:
change the INPUT_ONLY table:
from: FIN SR ---> TW to: FIN SR ---> FWOK, this is incorrect interpretation of the TCP states but this is a hack which allows the min state timeout to be 1 minute. Now using ipchains we can set the timeout to all TCP states to 1 minute.
If this is changed you can now set ESTABLISHED and FIN-WAIT timeouts down to 1 minute. In current LVS version the min effective timeout for ESTABLISHED and FINWAIT state is 2 minutes.
Jean-Francois Nadeau jf.nadeau@videotron.ca
I'm using DR on the cluster with 2 real servers. I'm trying to control the number of connections to acheive this :
The cluster in normal mode balances requests on the 2 real servers. If the real servers reaches a point where they can't serve clients fast enough, a new entry with a weight of 10000 is entered in LVS to send the ovry with a weight of 10000 is entered in LVS to send the ovry with a weight of 10000 is entered in LVS to send the overflow locally on a web server with a static web page saying "we're too busy". It's a cgi that intercept 'deep links' in our site and return a predefined page. A 600 seconds persistency ensure that already connected clients stays on the server they began to browse. The client only have to hit refresh until the number of AciveConns (I hoped) on the real servers gets lower and the overflow entry gets deleted.
Got the idea... Load balancing with overflow control.
JulianGood idea. But the LVS can't help you. When the clients are redirected to the Director they stay there for 600 seconds.
But when we activate the local redirection of requests due to overflow, ActiveConn continues to grow in LVS, while Inactconn decreases as expected. So the load on the real server gets OK... but LVS doesnt sees it and doesnt let new clients in. (it takes 12 minutes before ActiveConns decreases enough to reopen the site)
I need a way, a value to check at that says the server is overloaded, begin redirecing locally and the opposite.
I know that seems a little complicated....
Julian
What about trying to:
- use persistent timeout 1 second for the virtual service.
If you have one entry for this client you have all entries from this client to the same real server. I didn't tested it but may be a client will load the whole web page. If the server is overloaded the next web page will be "we're too busy".
- switch the weight for the Director between 0 and 10000. Don't delete the Director as real server.
Weight 0 means "No new connections to the server". You have to play with the weight for the Director, for example:
- if your real servers are loaded near 99% set the weight to 10000
- if you real servers are loaded before 95% set the weight to 0
From: Jean-Francois Nadeau jf.nadeau@videotron.ca
Will a weight of 0 redirect traffic to the other real servers (persistency remains ?)
JulianIf the persistent timeout is small, I think.
I can't get rid of the 600 seconds persistency because we run a transactionnal engine. i.e. if a client begins on a real server, he must complete the transaction on that server or get an error (transactionnal contexts are stored locally).
Such timeout can't help to redirect the clients back to the real servers.You can check the free ram or the cpu idle time for the real servers. By this way you can correctly set the weights for the real servers and to switch the weight for the Director.
These recommendations can be completely wrong. I've never tested them. If they can't help try to set httpd.conf:MaxClients to some reasonable value. Why not to put the Director as real server permanently. With 3 real servers is better.
Jean
Those are already optimized, bottleneck is when 1500 clients tries our site in less than 5 minutes.....
One of ours has suggested that the real servers check their own state (via TCP in use given by sockstat) and command the director to redirect traffic when needed.
Can you explain more in details why the number of ActiveConn on real server continue to grow while redirecting traffic locally with a weight of 10000 (and Inactonn on that real server decreasing normally).
JulianOnly the new clients are redirected to the Director at this moment. Where the active connections continue to grow, in the real servers or in the Director (weight=10000)?
testlvs (by Julian ja@ssi.bg
) is available on
Julian's software page.
It sends a stream of SYN packets (SYN flood) from a range of addresses (default starting at 10.0.0.1) simulating connect requests from many clients. Running testlvs from a client will occupy most of the resources of your director and the director's screen/mouse/keyboard will/may lock up for the period of the test. To run testlvs, I export the testlvs directory (from my director) to the realservers and the client and run everything off this exported directory.
The realserver is configured to reject packets with src_address 10.0.0.0/8.
Here's my modified version of Julian's show_traffic.sh , which is run on the realserver to measure throughput. Start this on the realserver before running testlvs on the client. For your interest you can look on the realserver terminal to see what's happening during a test.
#!/bin/sh #show_traffic.sh #by Julian Anastasov ja@ssi.bg #modified by Joseph Mack jmack@wm7d.net # #run this on the realserver before starting testlvs on the client #when finished, exit with ^C. # #suggested parameters for testlvs #testlvs VIP:port -tcp -packets 20000 #where # VIP:port - target IP:port for test # #packets are sent at about 10000 packets/sec on my #100Mbps setup using 75 and 133MHz pentium classics. # #------------------------------------------ # setup a few things to=10 #sleep time trap return INT #trap ^C from the keyboard (used to exit the program) iface="$1" #NIC to listen on #------------------------------------------ #user defined variables #network has to be the network of the -srcnet IP #that is used by the copy of testlvs being run on the client #(default for testlvs is 10.0.0.0) network="10.0.0.0" netmask="255.0.0.0" #------------------------------------------- function get_packets() { cat /proc/net/dev | sed -n "s/.*${iface}:\(.*\)/\1/p" | \ awk '{ packets += $2} ; END { print packets }' } function call_get_packets() { while true do sleep $to p1="`get_packets "$iface"`" echo "$((($p1-$p0)/$to)) packets/sec" p0="$p1" done } #------------------------------------------- echo "Hit control-C to exit" #initialise packets at $iface p0="`get_packets "$iface"`" #reject packets from $network route add -net $network netmask $netmask reject call_get_packets #restore routing table on exit route del -net $network netmask $netmask reject #-------------------------------------------
I used LVS-NAT on a 2.4.2 director, with netpipe (port 5002) as the service on two realservers. You won't be using netpipe for this test, ie you won't need a netpipe server on the realserver You just need a port that you can set up an LVS on and netpipe is in my /etc/services, so the port shows up as a name rather than a number.
Here's my director
director:/etc/lvs# ipvsadm IP Virtual Server version 0.2.6 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:netpipe rr -> bashfull.mack.net:netpipe Masq 1 0 0 -> sneezy.mack.net:netpipe Masq 1 0 0
run testlvs (I used v0.1) on the client. Here testlvs is sending 256 packets from 254 addresses (the default) in the 10.0.0.0 network. (My setup handles 10,000 packets/sec. 256 packets appears to be instantaneous.)
client: #./testlvs 192.168.2.110:5002 -tcp -packets 256
when the run has finished, go to the director
director:/etc/lvs# ipvsadm IP Virtual Server version 0.2.6 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:netpipe rr -> bashfull.mack.net:netpipe Masq 1 0 127 -> sneezy.mack.net:netpipe Masq 1 0 127
(If you are running a 2.2.x director, you can get more information from ipchains -M -L -n, or netstat -M. For 2.4.x use cat /proc/net/ip_conntrack.)
This output shows 254 connections that have closed are are waiting to timeout. A minute or so later, the InActConn will have cleared (on my machine, it's 50secs).
If you send the same number of packets (256), from 1000 different addresses, (or 1000 packets to 256 addresses), you'll get the same result in the output of ipvsadm (not shown)
client: #./testlvs 192.168.2.110:5002 -tcp -srcnum 1000 -packets 256
In all cases, you've made 254 connections.
If you send 1000 packets from 1000 addresses, you'd expect 1000 connections.
./testlvs 192.168.2.110:5002 -tcp -srcnum 1000 -packets 1000
Here's the total number of InActConn as a function of the number of packets (connection attempts). Results are for 3 consecutive runs, allowing the connections to timeout in between.
The numbers are not particularly consistent between runs (aren't computers deterministic?). Sometimes the blinking lights on the switch stopped during a test, possibly a result of the tcp race condition (see the performance page)
packets InActConn 1000 356, 368, 377 2000 420, 391, 529 4000 639, 770, 547 8000 704, 903,1000 16000 1000,1000,1000
You don't get 1000 InActConn with 1000 packets (connection attempts). We don't know why this is.
JulianI'm not sure what's going on. In my tests there are dropped packets too. They are dropped before reaching the director, maybe from the input device queue or from the routing cache. We have to check it.
repeating the control experiment above, but using the drop_entry strategy (see the DoS strategies for more information).
director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/drop_entry
packets InActConn, drop_entry=3 1000 369,368,371 2000 371,380,409 4000 467,578,458 8000 988,725,790 16000 999,994,990
The drop_entry strategy drops 1/32 of the entries every second, so the number of InActConn decreases linearly during the timeout period, rather than dropping suddenly at the end of the timeout period.
repeating the control experiment above, but using the drop_packet strategy (see the DoS strategies for more information).
director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/drop_packet
packets InActConn, drop_packet=3 1000 338,339,336 2000 331,421,382 4000 554,684,629 8000 922,897,480,662 16000 978,998,996
The drop_packet=3 strategy will drop 1/10 of the packets before sending them to the realserver. The connections will all timeout at the same time (as for the control experiment, about 1min), unlike for the drop_entry strategy. With the variability of the InActConn number, it is hard to see the drop_packet defense working here.
repeating the control experiment above, but using the secure_tcp strategy (see the DoS strategies for more information). The SYN_RECV value is the suggested value for LVS-NAT.
director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/secure_tcp director:/etc/lvs# echo "10" >/proc/sys/net/ipv4/vs/timeout_synrecv
packets InActConn, drop_packet=3 1000 338, 372, 359 2000 405, 367, 362, 4000 628, 507, 584 8000 642,1000, 886 16000 1000,1000,1000
This strategy drops the InActConn from the ipvsadm table after 10secs.
If you want to get the maximum number of InActConn, you need to run the test for longer than the FIN timeout period (here 50secs). 2M packets is enough here. As well you want as many different addresses used as possible. Since testlvs is connecting from the 10.0.0.0/8 network, you could have 254^3=16M connections. Since only 2M packets can be passed before connections start to timeout and the director connection table reaches a steady state with new connections arriving and old connections timing out, there is no point in sending packets from more that 2M source addresses.
Note: you can view the contents of the connection table with
2.2
2.4
Here's the InActConn with various defense strategies. The InActConn is the maximum number reachable, the scrnum and packets are the numbers needed to saturate the director. The time of the test must exceed the timeouts. InActConn was determined by running a command like this
client: #./testlvs 192.168.2.110:5002 -tcp -srcnum 1000000 -packets 2000000
and then adding the (two) entries in the InActConn column from the output of ipvsadm.
kernel DoS strategy InActConn -srcnum -packets (10k/sec) SYN cookie no secure_tcp 13,400 200,000 200,000 syn_recv=10 no none 99,400 500,000 1,000,000 yes non 70,400 1,000,000 2,000,000
edited from JulianThe memory used is 128 bytes/connection and 60k connections will tie up 7M of memory. LVS does not use system sockets. LVS has its own connection table. The limit is the amount of memory you have - virtually unlimited. The masq table (by default 40960 connections per protocol). is a separate table and is used only for LVS/NAT FTP or for normal MASQ connections.
However the director was quite busy during the testlvs test. Attempts to connect to other LVS'ed services (not shown in the above ipvsadm table) failed. Netpipe tests run at the same time from the client's IP (in the 192.168.1.0/24 network) stopped, but resumed at the expected rate after the testlvs run completed (i.e. but before the InActConn count dropped to 0).
Matthijs van der Klip matthijs.van.der.klip@nos.nl
10 Nov 2001
used a fast (Origin 200) single client to generate generate between 3000 and 3500 hits/connections per second to his LVS'ed web cluster. No matter how many/few realservers in the cluster, he could only get 65k connections.
Julian
You are missing one reason for this problem: the fact that your client(s) create connections from limited number of addresses and ports. Try to answer yourself from how many different client saddr/sport pairs you hit the LVS cluster. IMO, you reach this limit. I'm not sure how many test client hosts you are using. If the client host is only one then there is a limit of 65536 TCP ports per src IP addr. Each connection has expiration time according to its proto state. When the rate is high enough not to allow the old entries to expire, you reach a situation where the connections are reused, i.e. the connection number showed from ipvsadm -L does not increase.
echo x > /proc/sys/net/ipv4/debug_level where 0<x<9
Is there any way to debug/watch the path between the director and the realserver?
Wensongbelow the entry
CONFIG_IP_MASQUERADE_VS_WLC=m
in /usr/src/linux/.config, add the line
CONFIG_IP_VS_DEBUG=y
This switch affects ip_vs.h and ip_vs.c. make clean in /usr/src/linux/net/ipv4 and rebuild the kernel and modules.
(other switches you will find in the code are IP_VS_ERR IP_VS_DBG IP_VS_INFO )
Look in syslog/messages for the output. The actual location of the output is determined by /etc/syslog.conf. For instance
kern.* /usr/adm/kernsends kernel messages to /usr/adm/kern (re-HUP syslogd if you change /etc/syslog.conf). Here's the output when LVS is first setup with ipvsadm
$ tail /var/adm/kern Nov 13 17:26:52 grumpy kernel: IP_VS: RR scheduling module loaded.
( Note CONFIG_IP_VS_DEBUG is not a debug level output, so you don't need to add
*.=debug /usr/adm/debug
to your syslog.conf file )
Finally check whether packets are forwarded successfully through direct routing. (also you can use tcpdump to watch packets between machines.)
Ratzratz@tac.ch
Since some recent lvs-versions extensive debugging can be enabled to get either more information about what's exactly going on or to help you understanding the process of packet handling within the director's kernel. Be sure to have compiled in debug support for LVS (CONFIG_IP_VS_DEBUG=yes in .config)
You can enable debugging by setting:
echo $DEBUG_LEVEL > /proc/sys/net/ipv4/vs/debug_levelwhere DEBUG_LEVEL is between 0 and 10.The do a tail -f /var/log/kernlog and watch the output flying by while connecting to the VIP from a CIP.
If you want to disable debug messages in kernlog do:
echo 0 > /proc/sys/net/ipv4/vs/debug_levelIf you run tcpdump on the director and see a lot of packets with the same ISN and only SYN and the RST, then either
- you haven't handled the arp problem (most likely)
- you're trying to connect directly to the VIP from within the cluster itself
The HOWTO doesn't discuss securing your LVS (we can't do everything at once). However you need to handle it someway.
Roberto Nibali ratz@tac.ch
03 May 2001
It doesn't matter whether you're running an e-gov site or you mom's homepage. You have to secure it anyway, because the webserver is not the only machine on a net. A breach of the webserver will lead to a breach of the other systems too.The load balancer is basically on as secure as Linux itself is. ipchains settings don't affect LVS functionality (unless by mistake you use the same mark for ipchains and ipvsadm). LVS itself has some builtin security mainly to try to secure realserver in case of a DoS attack. There are several parameters you might want to set in the proc-fs.
- /proc/sys/net/ipv4/vs/amemthresh
- /proc/sys/net/ipv4/vs/am_droprate
- /proc/sys/net/ipv4/vs/drop_entry
- /proc/sys/net/ipv4/vs/drop_packet
- /proc/sys/net/ipv4/vs/secure_tcp
- /proc/sys/net/ipv4/vs/debug_level
With this you select the debug level (0: no debug output, >0: debug output in kernlog, the higher the number to higher the verbosity)
The following are timeout settings. For more information see TCP/IP Illustrated Vol. I, R. Stevens.
- /proc/sys/net/ipv4/vs/timeout_close - CLOSE
- /proc/sys/net/ipv4/vs/timeout_closewait - CLOSE_WAIT
- /proc/sys/net/ipv4/vs/timeout_established - ESTABLISHED
- /proc/sys/net/ipv4/vs/timeout_finwait - FIN_WAIT
- /proc/sys/net/ipv4/vs/timeout_icmp - ICMP
- /proc/sys/net/ipv4/vs/timeout_lastack - LAST_ACK
- /proc/sys/net/ipv4/vs/timeout_listen - LISTEN
- /proc/sys/net/ipv4/vs/timeout_synack - SYN_ACK
- /proc/sys/net/ipv4/vs/timeout_synrecv - SYN_RECEIVED
- /proc/sys/net/ipv4/vs/timeout_synsent - SYN_SENT
- /proc/sys/net/ipv4/vs/timeout_timewait - TIME_WAIT
- /proc/sys/net/ipv4/vs/timeout_udp - UDP
You don't want your director replying to pings.
(code for this was added sometime in 2000)
Eric Mehlhaff wrote:I was just updating ipchains rules and it struck me that I dont know what LVS does with the ICMP needs-fragmentation packets required for path MTU discovery to work. What does LVS do with such packets, when its not immediately obvious which real server they are supposed to go to?
Wensong
Sorry that there is no LVS code to handle ICMP packets and send them to the corresponding real servers. But, I am thinking about adding some code to handle this.
(later, after the code had been added.)
joern maier 13 Dec 2000what happens with ICMP messages specified for a Realserver. Or more exactly what happens if for example an ICMP host unreachable messages is send to the LVS because a client got down ? Are the entrys from the connection table removed ?
Julian Anastasov ja@ssi.bg
Wed, 13 Dec 2000
No
Are the messages forwarded to the Realservers ?
Julian 13 Dec 2000
Yes, the embedded TCP or UDP datagram is inspected and this information is used to forward the ICMP message to the right real server. All other messages that are not related to existing connections are accepted locally.
Eric Mehlhaff mehlhaff@cryptic.com
passed on more info
Theoreticaly, path-mtu-discovery happens on every new tcp connection. In most cases the default path MTU is fine. It's weird cases (ethernet LAN conenctions with low MTU WAN connections ) that point out broken path-MTU discovery. I.e. for a while I had my home LAN (MTU 1500) hooked up via a modem connection that I had set MTU to 500 for. The minimum MTU in this case was the 500 for my home but there were many broken web sites I could not see because they had blocked out the ICMP-must-fragment packets on their servers. One can also see the effects of broken path mtu discovery on FDDI local networks.
Anyway, here's some good web pages about it:
http://www.freelabs.com/~whitis/isp_mistakes.html http://www.worldgate.com/~marcs/mtu/
What happens if a realserver is connected to a client which is no longer reachable? ICMP replies go back to the VIP and will not neccessarily be forwarded to the correct realserver.
Jivko Velevjiko@tremor.net
Assume that we have TCP connections...and real server is trying to respond to the client, but it cannot reach it (the client is down, the route doesn't exist anymore, the intermadiate gateway is congested). In these cases your VIP will receive ICMP packets dest unreachable, source quench and friends. If you dont route these packets to the correct realserver you will affect performance of the LVS. For example the realserver will continue to resend packets to the client because they are not confirmed, and gateways will continue to send you ICMP packets back to VIP for every packets they droped. The TCP stack will drop these kind of connections after his timeouts expired, but if the director forwarded the ICMP packets to the appropriate realserver, this will occur a little bit earlier, and will avoid overloading the redirector with ICMP stuff.
When you receive a ICMP packet it contains the full IP header of the packet that cause this ICMP to be generated + 64bytes of its data, so you can assume that you have the TCP/UDP header too. So it is possible to implement "Persitance rules" for ICMP packages.
Summary: This problem was handled in kernel 2.2.12 and earlier by having the configure script turn off icmp redirects in the kernel (through the proc interface). For 2.2.13 the ipvs patch handles this. The configure script knows which kernel you are using on the director and does the Right Thing (TM).
Joe: from a posting I picked off Dejanews by Barry Margolin
the criteria for sending a redirect are:
- The packet is being forwarded out the same physical interface that it was received from,
- The IP source address in the packet is on the same Logical IP (sub)network as the next-hop IP address,
- The packet does not contain an IP source route option.
Routers ignore redirects and shouldn't even be receiving them in the first place, because redirects should only be sent if the source address and the preferred router address are in the same subnet. If the traffic is going through an intermediary router, that shouldn't be the case. The only time a router should get redirects is if it's originating the connections (e.g. you do a "telnet" from the router's exec), but not when it's doing normal traffic forwarding.
unknown
Well, remember that ICMP redirects are just bandages to cover routing problems. No one really should be routing that way.
ICMP redirects are easily spoofed, so many systems ignore them. Otherwise they risk having their connectivity being disconnected on whim. Also, many systems no longer send ICMP redirects because some people actually want to pass traffic through an intervening system! I don't know how FreeBSD ships these days, but I suggest that it should ship with ignore ICMP redirects as the default.
and not for LVS-DR and LVS-Tun
Julian: 12 Jan 2001
Only for LVS-NAT do the packets from the real servers hit the forward chain, i.e. the outgoing packets. LVS-DR and VS-TUN receive packets only to LOCAL_IN, i.e. the FORWARD chain, where the redirect is sent, is skipped. The incoming packets for LVS/NAT use ip_route_input() for the forwarding, so they can hit the FORWARD chain too and to generate ICMP redirects after the packet is translated. So, the problem always exists for LVS/NAT, for packets in the both directions because after the packets are translated we always use ip_forward to send the packets to the both ends.
I'm not sure but may be the old LVS versions used ip_route_input() to forward the DR traffic to the real servers. But this was not true for the TUN method. This call to ip_route_input() can generate ICMP redirects and may be you are right that for the old LVS versions this is a problem for DR. Looking in the Changelog it seems this change occured in LVS version 0.9.4, near Linux 2.2.13. So, in the HOWTO there is something that is true: there is no ICMP redirect problem for LVS/DR starting from Linux 2.2.13 :) But the problems remains for LVS/NAT even in the latest kernel. But this change in LVS is not created to solve the ICMP redirect problem. Yes, the problem is solved for DR but the goal was to speedup the forwarding for the DR method by skipping the forward chain. When the forward chain is skipped the ICMP redirect is not sent.
ICMP redirects and LVS: (Joe and Wensong)
The test setups shown in this HOWTO for LVS-DR and LVS-Tun have the client, director and realservers on the same network. In production the client will connect via a router from a remote network (and for LVS-Tun the realservers could be remote and all on separate networks).
The client forwards the packet for VIP to the director, the director receives the packet on the eth0 (eth0:1 is an alias of eth0), then forwards the packet to the real server through eth0. The director will think that the packet came and left through the same interface without any change, so an icmp redirect is send to the client to notify it to send the packets for VIP directly to the RIP.
However, when all machines are on the same network, the client is not a router and is directly connected to the director, and ignores the icmp redirect message and the LVS works properly.
If there is a router between the client and the director, and it listens to icmp redirects, the director will send an icmp redirect to the router to make it send the packet for VIP to the real server directly, the router will handle this icmp redirect message and change its routing table, then the LVS/DR won't work.
The symptoms is that once the load balancer sends an ICMP redirect to the router, the router will change its routing table for VIP to the real server, then all the LVS won't work. Since you did your test in the same network, your LVS client is in the same network that the load balancer and the server are, it doesn't need to pass through a router to reach the LVS, you won't have such a symptom. :)
Only when LVS/DR is used and there is only one interface to receive packets for VIP and to connect the real server, there is a need to suppress the ICMP redirects of the interface.
JoeThe ICMP redirects is turned on in the kernel 2.2 by default. The configure.pl script turns off icmp redirects on the director using sysctl
echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
(Wensong) In the reverse direction, replies coming back from the realserver to the client
|<------------------------| | real server client <--> tunlhost1=======tunlhost2 --> director ------->|
After the first response packet arrives from the realserver at the tunlhost2, tunlhost2 will try to send the packet through the tunnel. If the packet is too big, then tunlhost2 will send an ICMP packet to the VIP to fragment the packet. In the previous versions of ipvs, the director won't forward the ICMP packet to (any) real server. With 2.2.13 code has been added to handle the icmp redirects and make the director forward icmp packets to the corresponding servers.
If a realserver goes down after the connection is established, will the client get a dest_unreachable from the director?
No. Here is a design issue. If the director sends an ICMP_DEST_UNREACH immediately, all tranfered data for the established connection will be lost, the client needs to establish a new connection. Instead, we would rather wait for the timeout of connection, if the real server recovers from the temporary down (such as overloaded state) before the connection expires, then the connection can continue. If the real server doesn't recover before the expire, then an ICMP_DEST_UNREACH is sent to the client.
If the client goes down after the connection is established, where do the dest_unreachable icmp packets generated by the last router go?
If the client is unreachable, some router will generate an ICMP_DEST_UNREACH packet and sent to the VIP, then the director will forward the ICMP packet to the real server.
Since icmp packets are udp, are the icmp packets routed through the director independantly of the services that are being LVS'ed. ie if the director is only forwarding port 80/tcp, from CIP to a particular RIP, does the LVS code which handles the icmp forward all icmp packets from the CIP to that RIP. What if the client has a telnet session to one realserver and http to another realserver?
It doesn't matter, because the header of the original packet is encapsulated in the icmp packet. It is easy to identify which connection is the icmp packet for.
If the client has two connections to the LVS (say telnet and http) each to 2 different realservers and the client goes down, the director gets 2 ICMP_DEST_UNREACH packets. The director knows from the CIP:port which realserver to send the icmp packet to?
Wensong Zhang 21 Jan 2000
The director handles ICMP packets for virtual services long time ago, please check the ChangeLog of the code.
ChangeLog for 0.9.3-2.2.13The incoming ICMP packets for virtual services will be forwarded to the right real servers, and outgoing ICMP packets from virtual services will be altered and send out correctly. This is important for error and control notification between clients and servers, such as the MTU discovery.
JoeIf a realserver goes down after the connection is established, will the client get a dest_unreachable from the director?
No. Here is a design issue. If the director sends an ICMP_DEST_UNREACH immediately, all tranfered data for the established connection will be lost, the client needs to establish a new connection. Instead, we would rather wait for the timeout of connection, if the real server recovers from the temporary down (such as overloaded state) before the connection expires, then the connection can continue. If the real server doesn't recover before the expire, then an ICMP_DEST_UNREACH is sent to the client.
If the client goes down after the connection is established, where do the dest_unreachable icmp packets generated by the last router go?
If the client is unreachable, some router will generate an ICMP_DEST_UNREACH packet and sent to the VIP, then the director will forward the ICMP packet to the real server.
Since icmp packets are udp, are the icmp packets routed through the director independantly of the services that are being LVS'ed. ie if the director is only forwarding port 80/tcp, from CIP to a particular RIP, does the LVS code which handles the icmp forward all icmp packets from the CIP to that RIP. What if the client has a telnet session to one realserver and http to another realserver?
It doesn't matter, because the header of the original packet is encapsulated in the icmp packet. It is easy to identify which connection is the icmp packet for.
(This problem pops up in the mailing list occasionally, e.g. Ted Pavlic on 2000-08-01.)
Jerry Glomph Black
The kernel debug log (dmesg) occasionally gets bursts of messages of the following form on the LVS box:
IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188! IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188! IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188! IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188! IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188!What is this, is it a serious problem, and how to deal with it?
Joe
I looked in dejanews. No-one there knows either and people there are wondering if they are being attacked too. It appears in non-LVS situations, so it probably isn't an LVS problem. The posters don't know the identity of the sending node.
Wensong
I don't think it is a serious problem. If these messages are generated, the ICMP packets must fail in checksum. Maybe the ICMP packets from 199.108.9.188 is malformed for some unknown reason.
Here are some other reports
Hendrik Thielthiel@falkag.de
18 Jun 2001I noticed this in dmesg and messages:
kernel: IP_MASQ:reverse ICMP:failed checksum from 213.xxx.xxx.xxx! last message repeated 1522 times. Is this lvs specific (using nat) ? or can this be an attack?
Alois Treindl
alois@astro.ch
I see those too Jun 17 22:16:19 wwc kernel: IP_MASQ:reverse ICMP: failed checksum from 193.203.8.8!
not as many as you but every few hours a bunch.
Juri Haberland
juri@koschikode.com
From time to time I see them also on a firewall masquerading the companies net. I always assumed it is a corrupted ICMP packet... Who knows...
The client can be assigned to any realserver. One of the assumptions of LVS is that all realservers have the same content. This assumption is easy to fullfill for services like http, where the administrator updates the files on all realservers when needed. For services like mail or databases, the client writes to storage on one realserver. The other realservers do not see the updates unless something intervenes. Various tricks are described elsewhere here for mailservers and databases. These require the realservers to write to common storage (for mail the mailspool is nfs mounted; for databases, the LVS client connects to a database client on each realserver and these database clients write to a single databased on a backend machine, or the databased's on each realserver are capable of replicating).
One solution is to have a file system which can propagate changes to other realservers. We have mentioned gfs and coda in several places in this HOWTO as holding out hope for the future. People now have these working.
Wensong Zhang wensong@gnuchina.org
05 May 2001
It seems to me that Coda is becoming quite stable. I have run coda-5.3.13 with the root volume replicated on two coda file servers for near two months, I haven't met problem which need manual maintance until now. BTW, I just use it for testing purposes, it is not running in production site.
Mark Hlawatschek hlawatschek@atix.de
2001-05-04
we made some good experiences with the use of GFS (now moved to GFS. We are using LVS in conjunction with the GFS for about one year in older versions and it worked quite stable. We successfully demonstrated the solution with a newer version of GFS (4.0) at the CEBit 2001. Several domains (i.e. http://www.atix.org) will be served by the new configuration next week.
Mark's slides from his talk in German at DECUS in Berlin (2001) is available.
Tao Zhao taozhao@cs.nyu.edu
11 Jul 2001
The source code of LVS adds ip_vs_in() to netfilter hook NF_IP_LOCAL_IN to change the destination of packets. As I understand, this hook is called AFTER routing decisions have reached. So how can it forwards the packet to the new assinged destination without routing?
Henrik Nordstromhno@marasystems.com
Instead of rewriting the packet inside the normal packet flow of Linux-2.4, IPVS accepts the packet and constructs a new one, routes it and sends it out.. This approach does not make much sense for LVS-NAT within the netfilter framework, but fits quite well for the other modes.
JulianLVS does not follow the netfilter recommendations. What happens if we don't change the destination (e.g.DR and TUN methods which don't change the IP header). When such packet hits the routing the IP header fields are used for the routing decision. Netfilter can forward only by using NAT methods.
LVS tries not to waste CPU cycles in the routing cache. You can see that there is output routing call involved but there is a optimization you can find even in TCP - the destination cache. The output routing call is avoided in most of the cases. This model is near the one achieved in Netfilter, i.e. to call only once the input routing function (2.2 calls it twice for DNAT). I'm now testing a patch for 2.2 (on top of LVS) that avoids the second input routing call and that can reroute the masqueraded traffic to the right gateway when many gateways are used and mostly when these gateways are on same device. The tests will show how different is the speed between this patched LVS for 2.2 and the 2.4 one (one CPU of course).
We decided to use the LOCAL_IN hook for many reasons. May be you can find more info for the LVS integration into the Netfilter framework by searching in the LVS mail list archive for "netfilter".
Julian 29 Oct 2001
IPVS uses only the netfilter's hooks. It uses own connection tracking and NAT. You can see how LVS fits into the framework on the mailing list archive.
Ratz
I see that the defense_level is triggered via a sysctrl and invoked in the sltimer_handler as well as the *_dropentry. If we push those functions on level higher and introduce a metalayer that registers the defense_strategy which would be selectable via sysctrl and would currently contain update_defense_level we had the possibility to register other defense strategies like f.e. limiting threshold. Is this feasible? I mean instead of calling update_defense_level() and ip_vs_random_dropentry() in the sltimer_handler we just call the registered defense_strategy[sysctrl_read] function. In the existing case the defense_strategy[0]=update_defense_level() which also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)
The different strategies work in different places and it is difficult to use one hook. The current implementation allows they to work together. But may be there is another solution considering how LVS is called: to drop packets or to drop entries. There are no many places for such hooks, so may be it is possible something to be done. But first let's see what kind of other defense strategies will come.
Yes, the project got larger and more reputation than some of us initially thought. The code is very clear and stable, it's time to enhance it. The only very big problem that I see is that it looks like we're going to have to separate code paths one patch for 2.2.x kernels and one for 2.4.x.
Yes, this is the reality. We can try to keep the things not to look different for the user space.
This would be a pain in the ass if we had two ipvsadm. IMHO the userspace tools should recognize (compile-time) what kernel it is working with and therefore enable the featureset. This will of course bloat it up in future the more feature-differences we will have regarding 2.2.x and 2.4.x series.
Not possible, the sockopt are different in 2.4
Could you point me to a sketch where I could try to see how the control path for a packet looks like in kernel 2.4? I mean some- thing like I would do for 2.2.x kernels:
---------------------------------------------------------------- | ACCEPT/ lo interface | v REDIRECT _______ | --> C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ --> h a |input | e {Routing } |Chain | |output |ACCEPT e n |Chain | m {Decision} |_______| --->|Chain | c i |______| a ~~~~~~~~ | | ->|_______| k t | s | | | | | s y | q | v | | | u | v e v DENY/ | | v m | DENY/ r Local Process REJECT | | DENY/ | v REJECT a | | | REJECT | DENY d --------------------- | v e ----------------------------- DENY
Here is some info I maintain (may be not actual, the new ICMP hooks are missing). Look for "LVS" where is LVS placed.
Linux IP Virtual Server for Netfilter and Linux 2.4 The Netfilter hooks: Priorities: NF_IP_PRI_FIRST = INT_MIN, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_LAST = INT_MAX, PRE_ROUTING (ip_input.c:ip_rcv): CONNTRACK=-200, ip_conntrack_core.c:ip_conntrack_in MANGLE=-150, iptable_mangle.c:ipt_hook NAT_DST=-100, ip_nat_standalone.c:ip_nat_fn FILTER=0, ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect FILTER+1=1, net/sched/sch_ingress.c:ing_hook LOCAL_IN (ip_input.c:ip_local_deliver): FILTER=0, iptable_filter.c:ipt_hook LVS=100, ip_vs_in LAST-1, ip_fw_compat.c:fw_confirm CONNTRACK=LAST-1, ip_conntrack_standalone.c:ip_confirm FORWARD (ip_forward.c:ip_forward): FILTER=0, iptable_filter.c:ipt_hook FILTER=0, ip_fw_compat.c:fw_in, firewall, LVS:check_for_ip_vs_out, masquerade LVS=100, ip_vs_out LOCAL_OUT (ip_output.c): CONNTRACK=-200, ip_conntrack_standalone.c:ip_conntrack_local MANGLE=-150, iptable_mangle.c:ipt_local_out_hook NAT_DST=-100, ip_nat_standalone.c:ip_nat_local_fn FILTER=0, iptable_filter.c:ipt_local_out_hook POST_ROUTING (ip_output.c:ip_finish_output): FILTER=0, ip_fw_compat.c:fw_in, firewall, unredirect, mangle ICMP replies LVS=NAT_SRC-1, ip_vs_post_routing NAT_SRC=100, ip_nat_standalone.c:ip_nat_out CONNTRACK=LAST, ip_conntrack_standalone.c:ip_refrag CONNTRACK: PRE_ROUTING, LOCAL_IN, LOCAL_OUT, POST_ROUTING FILTER: LOCAL_IN, FORWARD, LOCAL_OUT MANGLE: PRE_ROUTING, LOCAL_OUT NAT: PRE_ROUTING, LOCAL_OUT, POST_ROUTING Running variants: 1. Only lvs - the fastest 2. lvs + ipfw NAT 3. lvs + iptables NAT Where is LVS placed: LOCAL_IN:100 ip_vs_in FORWARD:100 ip_vs_out POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing The chains: The out->in LVS packets (for any forwarding method) walk: pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING LOCAL_IN ip_vs_in -> ip_route_output/dst cache -> set skb->nfmark with special value -> ip_send -> POST_ROUTING POST_ROUTING ip_vs_post_routing - check skb->nfmark and exit from the chain The in->out LVS packets (for LVS/NAT) walk: pre_routing -> FORWARD -> POST_ROUTING FORWARD ip_vs_out -> NAT -> NF_ACCEPT POST_ROUTING ip_vs_post_routing - check skb->nfmark and exit from the chain
I hope there is a nice ascii diagram in the netfilter docs, but I hope the above info is more useful if you already know what each hook means.
The biggest problem I see here is that maybe the user space daemons don't get enough scheduling time to be accurate enough.
That is definitely true. When the CPU(s) are busy transferring packets the processes can be delayed. So, the director better not spend many cycles in user space. This is the reason I prefer all these health checks to run in the real servers but this is not always good/possible.
No, considering the fact that not all RS are running Linux. We would need to port the healthchecks to every possible RS architecture.
Yes, this is a drawback.
Tell me, which scheduler should I take? None of the existing ones gives me good enough results currently with persistency. We have to accept the fact, that 3-Tier application programmers don't know about loadbalancing or clustering, mostly using Java and this is just about the end of trying to load balance the application smoothly.
WRR + load informed cluster software. But I'm not sure in the the case when persistency is on (it can do bad things).
I currently get some values via an daemon coded in perl on the RS, started via xinetd. The LB connects to the healthcheck port and gets some prepared results. He then puts this stuff into a db and starts calculating the next steps to reconfigure the LVS-cluster to smoothen the imbalance. The longer you let it running the more data you get and the less adjustments you have to make. I reckon some guy showing up on this list once had this idea in direction of fuzzy logic. Hey Julian, maybe we should accept the fact that the wlc scheduler also isn't a very advanced one:Not sure :) I don't have results from experiments with wlc :) You can put it in /proc and to make different experiments, for example :) But warning, ip_vs_wlc can be module, check how lblc* register /proc vars.loh = atomic_read(&least->activeconns)*50+atomic_read(&least->inactconns);What would you think would change if we made this 50 dynamic?
Jul 2001
Linux IP Virtual Server for Netfilter and Linux 2.4 Running variants: 1. Only lvs - the fastest 2. lvs + ipfw NAT 3. lvs + iptables NAT Where is LVS placed: LOCAL_IN:100 ip_vs_in FORWARD:99 ip_vs_forward_icmp FORWARD:100 ip_vs_out POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing The chains: The out->in LVS packets (for any forwarding method) walk: pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING LOCAL_IN ip_vs_in -> ip_route_output/dst cache -> mark skb->nfcache with special bit value -> ip_send -> POST_ROUTING POST_ROUTING ip_vs_post_routing - check skb->nfcache and exit from the chain if our bit is set The in->out LVS packets (for LVS/NAT) walk: pre_routing -> FORWARD -> POST_ROUTING FORWARD (check for related ICMP): ip_vs_forward_icmp -> local delivery -> mark skb->nfcache -> POST_ROUTING FORWARD ip_vs_out -> NAT -> mark skb->nfcache -> NF_ACCEPT POST_ROUTING ip_vs_post_routing - check skb->nfcache and exit from the chain if our bit is set
Why LVS is placed there:
Sorry, we can't waste time here. The netfilter connection tracking can mangle packets here and we don't know at this time if a packet is for our virtual service (new connection) or for existing connection (needs lookup in the LVS connection table). We are sure that we can't make decisions whether to create new connections at this place but lookup for existing connections is possible under some conditions: the packets must be defragmented, etc.
LVS works with defragmented packets only
There are so many nice modules in this chain that can feed LVS with packets (probably modified)
ip_local_deliver() defragments the packets for us
We detect here packets for the virtual services or packets for existing connections. In any case we send them to POST_ROUTING via ip_send and return NF_STOLEN. This means we remove the packet from the LOCAL_IN chain before reaching priority LAST-1. The LocalNode feature just returns NF_ACCEPT without mangling the packet
In this chain if a packet is for LVS connection (even newly created) the LVS calls ip_route_output (or uses a destination cache), marks the packet as a LVS property (sets bit in skb->nfcache) and calls ip_send() to jump to the POST_ROUTING chain. There our ip_vs_post_routing hook must call the okfn for the packets with our special nfcache bit value (Is skb->nfcache used after the routing calls? We rely on the fact that it is not used) and to return NF_STOLEN.
One side effect: LVS can forward packet even when ip_forward=0, only for DR and TUN methods. For these methods even TTL is not decremented nor data checksum is checked.
LVS checks first for ICMP packets related to TCP or UDP connections. Such packets are handled as they are received in the LOCAL_IN chain - they are localy delivered. Used for transparent proxy setups.
LVS looks in this chain for in->out packets but only for the LVS/NAT method. In any case new connections are not created here, the lookup is for existing connections only.
In this chain the ip_vs_out function can be called from many places:
FORWARD:0 - the ipfw compat mode calls ip_vs_out between the forward firewall and the masquerading. By this way LVS can grab the outgoing packets for its connection and to avoid they to be used from the netfilter's NAT code.
FORWARD:100 - ip_vs_out is registered after the FILTER=0. We can come here twice if the ipfw compat module is used because ip_vs_out is called once from FORWARD:0 (fw_in) and after that from pri=100 where LVS always registers the ip_vs_out function. We detect this second call by looking in the skb->nfcache bit value. If the bit is set we return NF_ACCEPT. In fact, the second ip_vs_out call is avoided if the first returns NF_STOLEN and after calling the okfn function.
LVS marks the packets for debugging and they appear to come from LOCAL_OUT but this chain is not traversed. The LVS requirements from the POST_ROUTING chain include the fragmentation code only. But even the ICMP messages are generated and mangled ready for sending long before the POST_ROUTING chain: ip_send() does not call ip_fragment() for the LVS packets because LVS returns ICMP_FRAG_NEEDED when the mtu is shorter.
LVS makes MTU checks when accepting packets and selecting the output device. So, the ip_refrag POST_ROUTING hook is not used from LVS.
The result is: LVS must hook POST_ROUTING as first (may be only after the ipfw compat filter) and to return NF_STOLEN for its packets (detected by checking the special skb->nfcache bit value).