LVS-NAT is based on cisco's LocalDirector.
This method was used for the first LVS. If you want to set up a test LVS, this requires no modification of the realservers and is still probably the simplest setup.
With LVS-NAT, the incoming packets are rewritten by the director to have the destinatation address of one of the realservers and then forwarded to the realserver. The replies from the realserver are sent to the director where they are rewritten have the source address of the VIP.
Unlike the other two methods of forwarding used in an LVS (LVS-DR and LVS-Tun) the realserver only needs a functioning tcpip stack (eg a networked printer). I.e. the realserver can have any operatining system and no modifications are made to the configuration of the realservers (except setting their route tables).
Here the client is on the same network as the VIP (in a production LVS, the client will be coming in from an external network via a router). The director can have 1 or 2 NICs (two NICs will allow higher throughput of packets, since the traffic on the realserver network will be separated from the traffic on the client network).
Machine IP client CIP=192.168.1.254 director VIP VIP=192.168.1.110 (the IP for the LVS) director internal interface DIP=10.1.1.1 realserver1 RIP1=10.1.1.2 realserver2 RIP2=10.1.1.3 realserver3 RIP3=10.1.1.4 . . realserverN RIPn=10.1.1.n+1 dip DIP=10.1.1.9 (director interface on the LVS-NAT network)
________ | | | client | |________| CIP=192.168.1.254 | (router) | __________ | | | | VIP=192.168.1.110 (eth0:110) | director |---| |__________| | DIP=10.1.1.9 (eth0:9) | | ----------------------------------- | | | | | | RIP1=10.1.1.2 RIP2=10.1.1.3 RIP3=10.1.1.4 (all eth0) _____________ _____________ _____________ | | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________|
here's the lvs.conf file for this setup
LVS_TYPE=VS_NAT INITIAL_STATE=on VIP=eth0:110 lvs 255.255.255.0 192.168.1.255 DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255 DIRECTOR_DEFAULT_GW=client SERVICE=t telnet rr realserver1:telnet realserver2:telnet realserver3:telnet SERVER_NET_DEVICE=eth0 SERVER_DEFAULT_GW=dip #----------end lvs_nat.conf------------------------------------
The VIP is the only IP known to the client. The RIPs here are on a different network to the VIP (although with only 1 NIC on the director, the VIP and the RIPs are on the same wire).
In normal NAT, masquerading is the rewriting of packets originating behind the NAT box. With LVS-NAT, the incoming packet (src=CIP,dst=VIP, abbreviated to CIP->VIP) is rewritten by the director (becoming CIP->RIP). The action of the LVS director is called demasquerading. The demasqueraded packet is forwarded to the realserver. The reply packet (RIP->CIP) is generated by the realserver.
For LVS-NAT to work
This is the single most common cause of problems setting up a LVS-NAT LVS.
For a 2 NIC director with different networks for the realservers and the clients, it is enough for the default gw of the realservers to be the director. For a 1 NIC, two network setup, in addition, the realservers must only have routes to the director. For a 1 NIC, 1 network setup, ICMP redirects must be turned off on the director (the configure script does this for you).
In a normal server farm, the default gw of the realserver would be the router to the internet and the packet RIP->CIP would be sent directly to the client. In a LVS-NAT LVS, the default gw of the realservers must be the director. The director masquerades the packet from the realserver (rewrites it to VIP->CIP) and the client receives a rewritten packet with the expected source IP of the VIP.
Note: the packet must be routed via the director, there must be no other path to the client. A packet arriving at the client directly from the realserver will not be seen as a reply to the client's request and the connection will hang. If the director is not the default gw for the realservers, then if you use tcpdump on the director to watch an attempt to telnet from the client to the VIP (run tcpdump with `tcpdump port telnet`), you will see the request packet (CIP->VIP), the rewritten packet (CIP->RIP) and the reply packet (RIP->CIP). You will not see the rewritten reply packet (VIP->CIP). (Remember if you have a switch on the realserver's network, rather than a hub, then each node only sees the packets to/from it. tcpdump won't see packets to between other nodes on the same network.)
Part of the setup of LVS-NAT then is to make sure that the reply packet goes via the director, where it will be rewritten to have the addresses (VIP->CIP). In some cases (e.g. 1 net NS-NAT) icmp redirects have to be turned off on the director so that the realserver doesn't get a redirect to forward packets directly to the client.
In a production system, a router would prevent a machine on the outside exchanging packets with machines on the RIP network. As well, the realservers will be on a private network (eg 192.168.x.x/24) and will not be routable.
In a test setup (no router), these safeguards don't exist. All machines (client, director, realservers) are on the same piece of wire and if routing information is added to the hosts, the client can connect to the realservers independantly of the LVS. This will stop LVS-NAT from working (your connection will hang), or it may appear to work (you'll be connecting directly to the realserver).
In a test setup, traceroute from the realserver to the client should go through the director (2 hops). The configure script will test that the director's gw is 2 hops from the realserver and that the route to the director's gw is via the director, hopefully to prevent this type of error.
In production you should _not_ be able to ping from the realservers to the client. The realservers should not know about any other network than their own (here 10.1.1.0). The connection from the realservers to the client is through ipchains (for 2.2.x kernels) and LVS-NAT tables setup by the director.
In my first attempt at LVS-NAT setup, I had all machines on a 192.168.1.0 network and added a 10.1.1.0 private network for the realservers/director, without removing the 192.168.1.0 network on the realservers. All replies from the servers were routed onto the 192.168.1.0 network rather than back through LVS-NAT and the client didn't get any packets back.
The LVS-NAT setup can have a separate NIC for the DIP and the VIP putting the realserver network and the LAN for the VIP on different wires (the director could be a firewall for the realservers). This should prevent realservers routing packets directly to the client (at least it has for me).
Here's the general setup I use for testing. The client (192.168.2.254) connects to the VIP on the director. (The VIP on the realserver is present only for LVS-DR and LVS-Tun.) For LVS-DR, the default gw for the realservers is 192.168.1.254. For LVS-NAT, the default gw for the realservers is 192.168.1.9.
____________ | |192.168.1.254 (eth1) | client |---------------------- |____________| | CIP=192.168.2.254 (eth0) | | | | | VIP=192.168.2.110 (eth0) | ____________ | | | | | director | | |____________| | DIP=192.168.1.9 (eth1, arps) | | | (switch)------------------------ | RIP=192.168.1.2 (eth0) VIP=192.168.2.110 (for LVS-DR, lo:0, no_arp) _____________ | | | realserver | |_____________|
This setup works for both LVS-NAT and LVS-DR.
Here's the routing table for one of the realservers as in the LVS-NAT setup.
bashfull:# netstat -rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.1.0 0.0.0.0 255.255.255.0 U 40 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo 0.0.0.0 192.168.1.9 0.0.0.0 UG 40 0 0 eth0
Here's a traceroute from the realserver to the client showing 2 hops.
traceroute to client2.mack.net (192.168.2.254), 30 hops max, 40 byte packets 1 director.mack.net (192.168.1.9) 1.089 ms 1.046 ms 0.799 ms 2 client2.mack.net (192.168.2.254) 1.019 ms 1.16 ms 1.135 ms
icmp redirects are on at the director, but the director doesn't issue a redirect (see icmp_redirects ) because the packet RIP->CIP from the realserver emerges from a different NIC on the director than it arrived on (and with different source IP). The client machine doesn't send a redirect since it is not forwarding packets, it's the endpoint of the connection.
Use lvs_nat.conf as a template (sample here will setup LVS-NAT in the diagram above assuming the realservers are already on the network and using the DIP as the default gw).
#--------------lvs_nat.conf---------------------- LVS_TYPE=VS_NAT INITIAL_STATE=on #director setup: VIP=eth0:110 192.168.1.110 255.255.255.0 192.168.1.255 DIP=eth0:10 10.1.1.10 10.1.1.0 255.255.255.0 10.1.1.255 #Services on realservers: #telnet to 10.1.1.2 SERVICE=t telnet wlc 10.1.1.2:telnet #http to a 10.1.1.2 (with weight 2) and to high port on 10.1.1.3 SERVICE=t 80 wlc 10.1.1.2:http,2 10.1.1.3:8080 10.1.1.4 #realserver setup (nothing to be done for LVS-NAT) #----------end lvs_nat.conf------------------------------------
The output is a commented rc.lvs_nat file. Run the rc.lvs_nat file on the director and then the realservers (the script knows whether it is running on a director or realserver).
The configure script will setup up masquerading, forwarding on the director and the default gw for the realservers.
The packets coming in from the client are being demasqueraded by the director.
In 2.2.x you need to masquerade the replies. Here's the masquerading code in rc.lvs_nat, that runs on the director (produced by configure.pl).
echo "turning on masquerading " #setup masquerading echo "1" >/proc/sys/net/ipv4/ip_forward echo "installing ipchain rules" /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 http -d 0.0.0.0/0 #repeated for each realserver and service .. .. echo "ipchain rules " /sbin/ipchains -L
In this example, http is being masqueraded by the director, allowing the realserver to reply to the telnet requests from the director being demasqueraded by the director as part of the 2.2.x LVS code.
In 2.4.x masquerading of LVS'ed services is done explicitely by the LVS code and no extra masquerading (by iptables) commands need be run.
You may want to allow clients on the realservers to connect to servers on the internet. Setting this up is independant of the LVS and the connections from clients on the realservers are unrelated to the functioning of the LVS.
example: A client making a telnet request from a realserver, will be doing so from a high (>1024) port to 0.0.0.0:23. If the LVS is also forwarding requests for the same service, the connection is instead between 0.0.0.0:high_port and RIP:23 (LVS-NAT) or VIP:23 (LVS-DR).
In a normal LVS, connection requests from clients on the realserver will originate at the RIP and be sent through the realserver's default gw (the director in the case of LVS-NAT, a router for LVS-DR) without any masquerading. Since the realservers will usually have private IP's, the packets for the connection requests will not be routable. Instead you will need to NAT out the client requests using the director as the NAT box (and default gw for the client's requests). For LVS-NAT, the director is already the default gw. For LVS-DR, you have to route the packets from clients on the realservers through the director, while packets associated with the LVS, i.e. from the VIP on the realserver, is routed through a router.
If you want telnet clients on the realservers to be masqueraded by the director, to the outside world, then on the director you need to run a command like
director:/etc/lvs# /sbin/ipchains -p tcp -A forward -j MASQ -s $RIP -d 0.0.0.0 telnet
Telnet clients on the realservers will be now masqueraded out by the director, independently of the LVS setup. Masqueraded connections from the director will come from the primary IP on the outside of the director. If the VIP is an alias on the outside of the director (the usual situation), then the masqueraded connection will not come fom the VIP.
Here are the IP:port, seen by `netstat -an` on each machine, for two cases
client director realserver connection from client to LVS CIP:1041->VIP:23 - CIP:1041->RIP:23 telnet connection from realserver to telnetd on LVS client CIP:23<-DIP:61000 - CIP:23<-RIP:1030
The masqueraded connection to the LVS client comes from the primary IP of the director (here the DIP) and not from the VIP, which in this setup is an alias (secondary IP) of the DIP.
The director doesn't have connections to any of its ports for either connection. It the case of LVS, the director is just forwarding packets. In the case of masquerading, the masqueraded ports can be seen on the director with
director:/etc/lvs# ipchains -M -L -n IP masquerading entries prot expire source destination ports TCP 14:53.91 RIP CIP 1030 (61000) -> 23
Connections from clients start at high_port=1024. The masqueraded ports start at port=61000 (not 1024). The port number increments for each new connection in both cases. In the case where a machine is both connecting to the outside world (using ports starting at 1024) and masquerading connections from other machines (using port starting at 61000), there is no port collision detection. This can be a problem if the machine is masquerading a large number of connections and the port range has been increased.
With LVS-NAT, the ports can be re-mapped. A request to port 80 on the director can be sent to port 8000 on a realserver. This is possible because the source and destination of the packets are already being rewritten and no extra overhead is required to rewrite the port numbers. The rewriting is slow (60usec/packet on a pentium classic) and limits the throughput of LVS-NAT (for 536byte packets, this is 72Mbit/sec or about 100BaseT). While LVS-NAT throughput does not scale well with the number of realservers, the advantage of LVS-NAT is that realservers can have any OS, no modifications are needed to the realserver to run it in an LVS, and the realserver can have services not found on Linux boxes.
For the earlier versions of LVS-NAT (with 2.0.36 kernels) the timeouts were set by linux/include/net/ip_masq.h, the default values of masquerading timeouts are:
#define MASQUERADE_EXPIRE_TCP 15*16*Hz #define MASQUERADE_EXPIRE_TCP_FIN 2*16*Hz #define MASQUERADE_EXPIRE_UDP 5*16*Hz
Julian has his latest fool-proof setup doc at http://www.linuxvirtualserver.org/~julian/L4-NAT-HOWTO.txt. I will try to keep the copy here as up-to-date as possible
Q.1 Can the real server ping client? rs# ping -n client A.1 Yes => good A.2 No => bad Some settings for the director: Linux 2.2/2.4: ipchains -A forward -s RIP -j MASQ Linux 2.4: iptables -t nat -A POSTROUTING -s RIP -j MASQUERADE Q.2 Traceroute to client goes through LVS box and reaches the client? traceroute -n -s RIP CLIENT_IP A.1 Yes => good A.2 No => bad same ipchains command as in Q.1 For client and server on same physical media use these in the director: echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects Q.3 Is the traffic forwarded from the LVS box, in both directions? For all interfaces on director: tcpdump -ln host CLIENT_IP The right sequence, i.e. the IP addresses and ports on each step (the reversed for the in->out direction are not shown): CLIENT | CIP:CPORT -> VIP:VPORT | || | \/ out | CIP:CPORT -> VIP:VPORT || LVS box \/ | CIP:CPORT -> RIP:RPORT in | || | \/ | CIP:CPORT -> RIP:RPORT + REAL SERVER A.1 Yes, in both directions => good (for Layer 4, probably not for L7) A.2 The packets from the real server are dropped => bad: - rp_filter protection on the incoming interface, probably hit from local client - firewall rules drop the replies A.3 The packets from the real servers leave the director unchanged - missing -j MASQ ipchains rule in the LVS box For client and server on same physical media: The packets simply does not reach the director. The real server is ICMP redirected to the client. In director: echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects A.4 All packets from the client are dropped - the requests are received on wrong interface with rp_filter protection - firewall rules drop the requests A.5 The client connections are refused or are served from service in the LVS box - client and LVS are on same host => not valid - the packets are not marked from the firewall and don't hit firewall mark based virtual service Q.4 Is the traffic replied from the real server? For the outgoing interface on real server: tcpdump -ln host CLIENT_IP A.1 Yes, SYN+ACK => good A.2 TCP RST => bad, No listening real service A.3 ICMP message => bad, Blocked from Firewall/No listening service A.4 The same request packet leaves the real server => missing accept rules or RIP is not defined A.5 No reply => real server problem: - the rp_filter protection drops the packets - the firewall drops the request packets - the firewall drops the replies A.6 Replies goes through another device or don't go to the LVS box =? bad - the route to the client is direct and so don't pass the LVS box, for example: - client on the LAN - client and real server on same host - wrong route to the LVS box is used => use another Check the route: rs# ip route get CLIENT_IP from RIP The result: start the following tests rs# tcpdump -ln host CIP rs# traceroute -n -s RIP CIP lvs# tcpdump -ln host CIP client# tcpdump -ln host CIP For more deep problems use tcpdump -len, i.e. sometimes the link layer addresses help a bit. For FTP: LVS-NAT in Linux 2.2 requires: - modprobe ip_masq_ftp (before 2.2.19) - modprobe ip_masq_ftp in_ports=21 (2.2.19+) LVS-NAT in Linux 2.4 requires: - ip_vs_ftp LVS-DR/TUN require persistent flag FTP reports with debug mode enabled are useful: # ftp ftp> debug ftp> open my.virtual.ftp.service ftp> ... ftp> dir ftp> passive ftp> dir There are reports that sometimes the status strings reported from the FTP real servers are not matched with the string constants encoded in the kernel FTP support. For example, Linux 2.2.19 matches "227 Entering Passive Mode (xxx,xxx,xxx,xxx,ppp,ppp)" Julian Anastasov
ipvsadm does the following
#setup connection for telnet, using round robin /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr #connections to x.x.x.110:telnet are sent to # realserver 10.1.1.2:telnet #using LVS-NAT (the -m) with weight 1 /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.2:23 -m -w 1 #and to realserver 10.1.1.3 #using LVS-NAT with weight 2 /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.3:23 -m -w 2
(if the service was http, the webserver on the realhost could be listening on port 8000 instead of 80)
Example: client requests a connection to 192.168.1.110:23
director chooses real server 10.1.1.2:23, updates connection tables, then
packet source dest incoming CIP:3456 VIP:23 inbound rewriting CIP:3456 RIP1:23 reply (routed to DIP) RIP1:23 CIP:3456 outbound rewriting VIP:23 CIP:3456
The client gets back a packet with the source_address = VIP.
For the verbally oriented...
The request packet is sent to the VIP. The director looks up its tables and sends the connection to realserver1. The packet is rewritten with a new destination (in this case with the same port, but the port could be changed too) and sent to RIP1. The realserver replies, sending back a packet to the client. The default gw for the realserver is the director. The director accepts the packet and rewrites the packet to have source=VIP and sends the rewritten packet to the client.
Why isn't the source of the incoming packet rewritten to be the DIP or VIP?
Wensong...changing the source of the packet to the VIP sounds good too, it doesn't require that default route rule, but requires additional code to handle it.
Joe
In normal NAT, where a bunch of machines are sitting behind a NAT box, all outward going packets are given the IP on the outside of the NAT box. What if there are several IPs facing the outside world? For NAT it doesn't really matter as long as the same IP is used for all packets. The default value is usually the first interface address (eg eth0). With LVS-NAT you want the outgoing packets to have the source of the VIP (probably on eth0:1) rather than the IP on the main device on the director (eth0).
With a single realserver LVS-NAT LVS serving telnet, the incoming packet does this ,
CIP:high_port -> VIP:telnet #client sends a packet CIP:high_port -> RIP:telnet #director demasquerades packet, forwards to realserver RIP:telnet -> CIP:high_port #realserver replies
The reply arrives on the director (being sent there because the director is the default gw for the realserver). To get the packet from the director to the client, you have to reverse the masquerading done by the LVS. To do this (in 2.2 kernels), on the director you add an ipchains rule
director:# ipchains -A forward -p tcp -j MASQ -s realserver1 telnet -d 0.0.0.0/0
If the director has multiple IPs facing the outside world (eg eth0=192.168.2.1 the regular IP for the director and eth0:1=192.168.2.110 the VIP), the masquerading code has to choose the correct IP for the outgoing packet. Only the packet with src_addr=VIP will be accepted by the client. A packet with any other scr_addr will be dropped. The normal default for masquerading (eth0) should not be used in this case. The required m_addr (masquerade address) is the VIP.
Does LVS fiddle with the ipchains tables to do this?
Julian Anastasovja@ssi.bg
01 May 2001No, ipchains only delivers packets to the masquerading code. It doesn't matter how the packets are selected in the ipchains rule.
The m_addr (masqueraded_address) is assigned when the first packet is seen (the connect request from the client to the VIP). LVS sees the first packet in the LOCAL_IN chain when it comes from the client. LVS assigns the VIP as maddr.
The MASQ code sees the first packet in the FORWARD chain when there is a -j MASQ target in the ipchains rule. The routing selects the m_addr. If the connection already exists the packets are masqueraded.
The LVS can see packets in the FORWARD chain but they are for already created connections, so no m_addr is assigned and the packets are masqueraded with the address saved in the connections structure (the VIP) when it was created.
There are 3 common cases:
- The connection is created as response to packet.
- The connection is created as response to packet to another connection.
- The connection is already created
Case (1) can happen in the plain masquerading case where the in->out packets hit the masquerading rule. In this case when nobody recommends the s_addr for the packets going to the external side of the MASQ, the masq code uses the routing to select the m_addr for this new connection. This address is not always the DIP, it can be the preferred source address for the used route, for example, address from another device.
Case (1) happens also for LVS but in this case we know:
- the client address/port (from the received datagram)
- the virtual server address/port (from the received datagram)
- the real server address/port (from the LVS scheduler)
But this is on out->in packet and we are talking about in->out packets
Case (2) happens for related connections where the new connection can be created when all addresses and ports are known or when the protocol requires some wildcard address/port matching, for example, ftp. In this case we expect the first packet for the connection after some period of time.
It seems you are interested how case (3) works. The answer is that the NAT code remembers all these addresses and ports in a connection structure with these components
- external address/port (LVS: client)
- masquerading address/port (LVS: virtual server)
- internal address/port (LVS: real server)
- protocol
- etc
LVS and the masquerading code simply hook in the packet path and they perform the header/data mangling. In this process they use the information from the connection table(s). The rule is simple: when a packet is already for established connection we must remember all addresses and ports and always to use same values when mangling the packet header. If we select each time different addresses or ports we simply break the connection. After the packet is mangled the routing is called to select the next hop. Of course, you can expect problems if there are fatal route changes.
So, the short answer is: the LVS knows what m_addr to use when a packet from the real server is received because the connection is already created and we know what addresses to use. Only in the masquerading case (where LVS os not involved) connections can be created and a masquerading address to be selected without using rule for this. In all other cases there is a rule that recommends what addresses to be used at creation time. After creation the same values are used.
The disadvantage of the 2 network LVS-NAT is that the realservers are not able to connect to machines in the network of the VIP. You couldn't make a LVS-NAT setup out of machines already on your LAN, which were also required for other purposes to stay on the LAN network.
Here's a one network LVS-NAT LVS.
________ | | | client | |________| CIP=192.168.1.254 | | __________ | | | | VIP=192.168.1.110 (eth0:110) | director |---| |__________| | DIP=192.168.1.9 (eth0:9) | | ------------------------------------ | | | | | | RIP1=192.168.1.2 RIP2=192.168.1.3 RIP3=192.168.1.4 (all eth0) _____________ _____________ _____________ | | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________|
The problem:
A return packet from the realserver (with address RIP->CIP) will be sent to the realserver's default gw (the director). ICMP redirects will be sent from the director telling the realserver of the better route directly to the client. The realserver will then send the packet directly to the client and it will not be demasqueraded by the director. The client will get a reply from the RIP rather than the VIP and the connection will hang.
The cure:
In the previous HOWTO I said that initial attempts to handle this by turning off redirects had not worked. The problem appears now to be solved.
Thanks to michael_e_brown@dell.com
and Julian ja@ssi.bg
for help sorting this out.
To get a LVS-NAT LVS to work on one network -
1. On the director, turn off icmp redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
(Note: eth0 may be eth1 etc, on your machine).
2. Make the director the default and only route for outgoing packets.
You will probably have set the routing on the realserver up like this
realserver:/etc/lvs# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 director 0.0.0.0 UG 0 0 0 eth0
Note the route to 192.168.1.0/24. This allows the realserver to send packets to the client by just putting them out on eth0, where the client will pick them up directly (without being demasqueraded) and the LVS will not work.
Remove the route to 192.168.1.0/24.
realserver:/etc/lvs#route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0
This will leave you with
realserver:/etc/lvs# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 director 0.0.0.0 UG 0 0 0 eth0
The LVS-NAT LVS now works. If LVS is forwarding telnet, you can telnet from the client to the VIP and connect to the realserver.
You can ping from the client to the realserver.
You can also connect _directly_ to services on the realserver _NOT_ being forwarded by LVS (in this case e.g. ftp).
You can no longer connect directly to the realserver for services being forwarded by the LVS. (In the example here, telnet ports are not being rewritten by the LVS, ie telnet->telnet).
client:~# telnet realserver Trying 192.168.1.11... ^C (i.e. connection hangs)
Here's tcpdump on the director. Since the network is switched the director can't see packets between the client and realserver. The client initiates telnet. `netstat -a` on the client shows a SYN_SENT from port 4121.
director:/etc/lvs# tcpdump
tcpdump: listening on eth0
16:37:04.655036 realserver.telnet > client.4121: S 354934654:354934654(0) ack 1183118745 win 32120 <mss 1460,sackOK,timestamp 111425176[|tcp]> (DF)
16:37:04.655284 director > realserver: icmp: client tcp port 4 121 unreachable [tos 0xc0]
(repeats every second until I kill telnet on client)
I don't see the connect request from client->realserver. The first packet I see is the ack from the realserver, which will be forwarded via the director. The director will rewrite the ack to be from the director. The client will not accept an ack to port 4121 from director:telnet.
Here's an untested solution from Julian for a one network LVS-NAT
put the client in the external logical network. By this way the client, the director and the real server(s) are on same physical network but the client can't be on the masqueraded logical network. So, change the client from 192.168.1.80 to 166.84.192.80 (or something else). Don't add through DIP (I don't see such IP for the Director). Why in your setup DIP==VIP ? If you add DIP (166.84.192.33 for example) in the director you can later add path for 192.168.1.0/24 through 166.84.192.33. There is no need to use masquerading with 2 NICs. Just remove the client from the internal logical network used by the LVS cluster.
A different working solution from Ray Bellis rpb@community.net.uk
the same *logical* subnet. I still have a dual-ethernet box acting as a director, and the VIP is installed as an alias interface on the external side of the director, even though the IP address it has is in fact assigned from the same subnet as the
Ray Bellis rpb@community.net.uk
has used a 2 NIC director
to have the RIPs on the same logical network as the VIP
(ie RIP and VIP numbers are from the same subnet), although
they are in different physical networks.
The throughput of LVS-NAT is limited by the time taken by the director to rewrite a packet. The limit for a pentium classic 75MHz is about 80Mbit/sec (100baseT). Increasing the number of realservers does not increase the throughput.
The performance page shows a slightly higher latency with LVS-NAT compared to LVS-DR or LVS-Tun, but the same maximum throughput. The load average on the director is high (>5) at maximum throughput, and the keyboard and mouse are quite sluggish. The same director box operating at the same throughput under VS_DR or LVS-Tun has no perceptable load as measured by top or by mouse/keyboard responsiveness.
WayneNAT taks some CPU and memory copying. With a slower CPU, it will be slower.
Julian Anastasov ja@ssi.bg
19 Jul 2001
This is a myth from the 2.2 age. In 2.2 there are 2 input route calls for the out->in traffic and this reduces the performance. By default, in 2.2 (and 2.4 too) the data is not copied when the IP header is changed. Updating the checksum in the IP header does not cost too much time compared to the total packet handling time.
To check the difference between the NAT and DR forwarding method in out->in direction you can use testlvs from http://www.linux-vs.org/ julian/ and to flood a 2.4 director in 2 setups: DR and NAT. My tests show that I can't see a visible difference. We are talking about 110,000 SYN packets/sec with 10 pseudo clients and same cpu idle during the tests (there is not enough client power in my setup for full test), 2 CPUx 866MHz, 2 100mbit internal i82557/i82558 NICs, switched hub:
3 testlvs client hosts -> NIC1-LVS-NIC2 -> packets/sec.
I use small number of clients because I don't want to spend time in routing cache or LVS table lookups.
Of course, the NAT involves in->out traffic and this can reduce twice the performance if the CPU or the PCI power is not enough to handle the traffic in both directions. This is the real reason the NAT method to look so slow in 2.4. IMO, the overhead from the TUN encapsulation or from the NAT process is negliable.
Here come the surprises:
The basic setup: 1 CPU PIII 866MHz, 2 NICs (1 IN and 1 OUT), LVS-NAT, SYN flood using testlvs with 10 pseudo clients, no ipchains rules. Kernels: 2.2.19 and 2.4.7pre7.
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 99% (strange) In 110,000 SYNs/sec, Out 88,000 SYNs/sec, CPU idle: 0%
In 80,000 SYNs/sec, Out 55,000 SYNs/sec, CPU idle: 0% In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 0% In 110,000 SYNs/sec, Out 63,000 SYNs/sec (strange), CPU idle: 0%
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 20% In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 2%
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 30% In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 0%
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 45% In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 15%, 30000 ctxswitches/sec
What I see is that:
limits: 2.2=88,000P/s, 2.4=96,000P/s, i.e. 8% difference
110,000->96,000P/s, 2-3% idle, so I can't claim that there is a NAT-specific overhead.
I performed other tests, testlvs with UDP flood. The packet rate is lower, the cpu idle time in the LVS box was increased dramatically but the client hosts show 0% cpu idle, may be more testlvs client hosts are needed.
(with Julian)
The routes added with the route command go into the kernel FIB (Forwarding information base) route table. The contents are displayed with the route (or netstat -a) command.
Following an icmp redirect, the route updates go into the kernel's route cache (route -C).
You can flush the route cache with
echo 1 > /proc/sys/net/ipv4/route/flush or ip route flush cache
Here's the route cache on the realserver before any packets are sent.
realserver:/etc/rc.d# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface realserver director director 0 1 0 eth0 director realserver realserver il 0 0 9 lo
With icmp redirects enabled on the director, repeatedly running traceroute to the client shows the routes changing from 2 hops to 1 hop. This indicates that the realserver has received an icmp redirect packet telling it of a better route to the client.
realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.932 ms 0.562 ms 0.503 ms 2 client (192.168.1.254) 1.174 ms 0.597 ms 0.571 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.72 ms 0.581 ms 0.532 ms 2 client (192.168.1.254) 0.845 ms 0.559 ms 0.5 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 client (192.168.1.254) 0.69 ms * 0.579 ms
Although the route command shows no change in the FIB, the route cache has changed. (The new route of interest is bracketted by >< signs.)
realserver:/etc/rc.d# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface client realserver realserver l 0 0 8 lo realserver realserver realserver l 0 0 1038 lo realserver director director 0 1 138 eth0 >realserver client client 0 0 6 eth0< director realserver realserver l 0 0 9 lo director realserver realserver l 0 0 168 lo
Packets to the client now go directly to the client instead of via the director (which you don't want).
It takes about 10mins for the client's route cache to expire (experimental result). The timeouts may be in /proc/sys/net/ipv4/route/gc_*, but their location and values are well encrypted in the sources :) (some more info from Alexey at LVS archives )
Here's the route cache after 10mins.
realserver:/etc/rc.d# route -C
Kernel IP routing cache
Source Destination Gateway Flags Metric Ref Use Iface
realserver realserver realserver l 0 0 1049 lo
realserver director director 0 1 139 eth0
director realserver realserver l 0 0 0 lo
director realserver realserver l 0 0 236 lo
There are no routes to the client anymore. Checking with traceroute, shows that 2 hops are initially required to get to the client (i.e. the routing cache has reverted to using the director as the route to the client). After 2 iterations, icmp redirects route the packets directly to the client again.
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.908 ms 0.572 ms 0.537 ms
2 client (192.168.1.254) 1.179 ms 0.6 ms 0.577 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.695 ms 0.552 ms 0.492 ms
2 client (192.168.1.254) 0.804 ms 0.55 ms 0.502 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 client (192.168.1.254) 0.686 ms 0.533 ms *
If you now turn off icmp redirects on the director.
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
Checking routes on the realserver -
realserver:/etc/lvs# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
0.0.0.0 director 0.0.0.0 UG 0 0 0 eth0
nothing has changed here.
Flush the kernel routing table and show the kernel routing table -
realserver:/etc/lvs# ip route flush cache
realserver:/etc/lvs# route -C
Kernel IP routing cache
Source Destination Gateway Flags Metric Ref Use Iface
realserver director director 0 1 0 eth0
director realserver realserver l 0 0 1 lo
There are no routes to the client.
Now when you send packet to the client, the route stays via the director needing 2 hops to get to the client. There are no one hop packets to the client.
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.951 ms 0.56 ms 0.491 ms
2 client (192.168.1.254) 0.76 ms 0.599 ms 0.574 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.696 ms 0.562 ms 0.583 ms
2 client (192.168.1.254) 0.62 ms 0.603 ms 0.576 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.692 ms * 0.599 ms
2 client (192.168.1.254) 0.667 ms 0.603 ms 0.579 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.689 ms 0.558 ms 0.487 ms
2 client (192.168.1.254) 0.61 ms 0.63 ms 0.567 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.705 ms 0.563 ms 0.526 ms
2 client (192.168.1.254) 0.611 ms 0.595 ms *
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
1 director (192.168.1.9) 0.706 ms 0.558 ms 0.535 ms
2 client (192.168.1.254) 0.614 ms 0.593 ms 0.573 ms
The kernel route cache
realserver:/etc/rc.d# route -C
Kernel IP routing cache
Source Destination Gateway Flags Metric Ref Use Iface
client realserver realserver l 0 0 17 lo
realserver realserver realserver l 0 0 2 lo
realserver director director 0 1 0 eth0
>realserver client director 0 0 35 eth0<
director realserver realserver l 0 0 16 lo
director realserver realserver l 0 0 63 lo
shows the the only route to the client (labelled with >< ) is via the director.
For send_redirects, what's the difference between all, default and eth0?
Juliansee the LVS archives
When the kernel needs to check for one feature (send_redirects for example) it uses such calls: if (IN_DEV_TX_REDIRECTS(in_dev)) ... These macros are defined in /usr/src/linux/include/linux/inetdevice.h The macro returns a value using expression from
all/<var> and <dev>/<var>So, these macros check for example for:
all/send_redirects || eth0/send_redirectsor
all/hidden && eth0/hiddenwhen you create eth0 for first time using ifconfig eth0 ... up default/send_redirects is copied to eth0/send_redirects from the kernel, internally. I.e. default/ contains the initial values the device inherits when it is created. This is the safest way a device to appear with correct
conf/<dev>/values.When we put value in
all/<var>you can assume that we set the<var>When we put value inall/<var>you can assume that we set the<var>for all devices in this way:
all/<var> the macro returns: for && 0 0 for && 1 the value from <dev>/<var> for || 0 the value from <dev>/<var> for || 1 1
This scheme allows the different devices to have different values for their vars. For example, if we set 0 to all/send_redirects, the 3th line applies to the values, i.e. the result from the macro is the real value in <dev>/send_redirects. If we set 1 to all/send_redirects according to the 4th line, the macro always returns 1 regardless of the <dev>/send_redirects
how to debug/understand TCP/IP packets?
JulianThe RFC documents are your friends:
http://www.ietf.cnri.reston.va.us/rfc.html
The numbers you need:
793 TRANSMISSION CONTROL PROTOCOL 1122 Requirements for Internet Hosts -- Communication Layers 1812 Requirements for IP Version 4 Routers 826 An Ethernet Address Resolution Protocolfor tcpdump, see man tcpdump.
for Micrsoft NT _server_
Steve.Gonczi@networkengines.com
there is a uSoft supplied packet capture utility as well.
also -W. Richard Stevens: TCP-IP Illustrated, Vol 1, a good intro into packet layouts and protocol basics. (anything by Stevens is good - Joe).
Ivan Figueredoidf@weewannabe.com
for windump - http://netgroup-serv.polito.it/windump/
frederic.defferrard@ansf.alcatel.fr
would be possible to use LVS-NAT to load-balance virtual-IPs to ssh-forwarded real-IPs?
Ssh can also be used to create a local access that is forwarded to a remote access throught the ssh protocol. For example you can use ssh to securely map a local acces to a remote POP server:
local:localport ==> local:ssh ssh port forwarding remote:ssh ==> remote:pop
And when you connect to local:localip you are transparently/securely connected to remote:pop
The main idea is to allow RS in differents LANs with RS that are non-Linux (precluding LVS-Tun).
Example:
- VS:81 ---- ssh ---- RS:80 / INTERNET - - - - > VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80 \ - VS:83 ---- ssh ---- RS:80
Wensongyou can use VPN (or CIPE) to map some external real servers into your private cluster network. If you use LVS-NAT, make sure the routing on the real server must be configuration properly so that the response packets will go through the load balancer to the clients.
I think that it isn't necessery to have the default router to the load balancer when using ssh because when the RS address is the same that the VS address (differents ports)
With the NAT method, your example won't work because the LVS/NAT treats packets as local ones and forward to the upper layers without any change.However, your example give me an idea that we can dynamically redirect the port 80 to port 81, 82 and 83 respectively for different connections, then your example can work. However, the performance won't be good, because lots of works are done in the application level, and the overhead of copying from kernel to user-space is high.
Another thought is that we might be able to setup LVS/DR with real server in different LANs by using of CIPE/VPN stuff. For example, we use CIPE to establish tunnels from the load balancer to real servers like
10.0.0.1================10.0.1.1 realserer1 10.0.0.2================10.0.1.2 realserer2 --- Load Balancer 10.0.0.3================10.0.1.3 realserer3 10.0.0.4================10.0.1.4 realserer4 10.0.0.5================10.0.1.5 realserer5Then, you can add LVS/DR configuration commands as:
ipvsadm -A -t VIP:www ipvsadm -a -t VIP:www -r 10.0.1.1 -g ipvsadm -a -t VIP:www -r 10.0.1.2 -g ipvsadm -a -t VIP:www -r 10.0.1.3 -g ipvsadm -a -t VIP:www -r 10.0.1.4 -g ipvsadm -a -t VIP:www -r 10.0.1.5 -gI haven't tested it. Please let me know the result if anyone tests this configuration.