Next Previous Contents

11. Services

In principle setting up a service on an LVS is simple - you run the service on the realserver and forward the packets from the director. The simplest service to LVS is telnet: the client types a string of characters and the server returns a string of characters. In practice some services interact more with their environment. Ftp needs another port. With http, the server needs to know its name (it will have the IP of realserver, but will need to proclaim to the client that it has the VIP). https is not listening to an IP, but to requests to a nodename. This section shows the steps needed to get the common services working.

When trying something new on an LVS, always have the service telnet LVS'ed. If something is not working with your service, check how telnet is doing. Telnet has the advantages

11.1 setting up a new service

When setting up an LVS on a new service, the client-server semantics are maintained

Example: nfs over LVS, realserver exports its disk, client mounts disk from LVS (this example taken from performance data for single realserver LVS),

realserver:/etc/exportfs (realserver exports disk to client, here a host called client2)

/       client2(rw,insecure,link_absolute,no_root_squash) 

The client mounts the disk from the VIP. Here's client2:/etc/fstab (client mounts disk from machine with an /etc/hosts entry of VIP=lvs).

lvs:/   /mnt            nfs     rsize=8192,wsize=8192,timeo=14,intr 0 0

The client makes requests to VIP:nfs. The director must forward these packets to the realservers. Here's the conf file for the director.

#lvs_dr.conf for nfs on realserver1
.
.
VIP=eth1:110 lvs 255.255.255.255 192.168.1.110
DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255 
DIRECTOR_DEFAULT_GW=client2
SERVICE=t telnet rr realserver1 realserver2     #for sanity check on LVS
#to call NFS the name "nfs" put the following in /etc/services
#nfs             2049/udp
#note the 'u' for "udp" in the next line
SERVICE=u nfs rr realserver1                    #the service of interest
SERVER_VIP_DEVICE=lo:0
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=client
#----------end lvs_dr.conf------------------------------------

11.2 services must be setup for forwarding type

The services must be setup to listen on the correct IP. With telnet, this is easy (telnetd listens on 0.0.0.0 under inetd), but most other services need to be configured to listen to an IP.

For LVS-NAT, the packets will arrive with dst_addr=RIP, i.e. the service will be listening to the RIP of the realserver. When the realserver replies, then name of the machine returned will be the realserver, but the src_addr will be rewritten by the director to be the VIP.

With LVS-DR and LVS-NAT the packets will arrive with dst_addr=VIP, i.e. the service will be listening to an IP which is NOT the IP of the realserver. Configuring the httpd to listen to the RIP rather than the VIP is a common cause of problems for people setting up http/https.

In both cases, in production, you will need to make the name of the machine given by the realserver to be the name associated with the VIP.

Note: if the realserver is Linux 2.4 and is accepting packets by transparent proxy, then see the section on TP for the IP the service should listen on.

11.3 ftp general

ftp is a 2 port service in both active and passive modes. In general multiport services or services which need to run together on the one realserver (eg http/https), can be handled by persistance or by Ted Pavlic's adaption of fwmark (see fwmark for passive ftp).

ftp comes in 2 flavors active/passive

11.4 ftp (active) - the classic command line ftp

This is a 2 port service.

ip_vs_ftp/ip_masq_ftp module helpers

As part of the ip_vs build, the modules ip_masq_ftp (2.2.x) and ip_vs_ftp (2.4.x) are produced. The ip_masq_ftp module is a patched version of the file which allowed ftp through a NAT box. This patch stopped its original function (at least in early kernels).

The 2.2.x ftp module is only available as a module (i.e. it can't be built into the kernel).

Juri Haberland juri@koschikode.com 30 Apr 2001

AFAIK the IP_MASQ_* parts can only be built as modules. They are automagically selected if you select CONFIG_IP_MASQUERADE.

Julian Anastasov May 01, 2001

Starting from 2.2.19 the following module parameter is required:

modprobe ip_masq_ftp in_ports=21
Joe

I don't see this mentioned in /usr/src/linux/Documentation, ipvs-1.0.7-2.2.19/Changelog, google or dejanews. Is this an ip_vs feature or is it a new kernel feature?

I see info only in the source. This is a new 2.2.19 feature.

ratz

It's /usr/src/linux/net/ipv4/ip_masq_ftp.c:

 * Multiple Port Support
 *      The helper can be made to handle up to MAX_MASQ_APP_PORTS (normally 12)
 *      with the port numbers being defined at module load time.  The module
 *      uses the symbol "ports" to define a list of monitored ports, which can
 *      be specified on the insmod command line as
 *              ports=x1,x2,x3...
 *      where x[n] are integer port numbers.  This option can be put into
 *      /etc/conf.modules (or /etc/modules.conf depending on your config)
 *      where modload will pick it up should you use modload to load your
 *      modules.
 * Additional portfw Port Support
 *      Module parameter "in_ports" specifies the list of forwarded ports
 *      at firewall (portfw and friends) that must be hooked to allow
 *      PASV connections to inside servers.
 *      Same as before:
 *              in_ports=fw1,fw2,...
 *      Eg:
 *              ipmasqadm portfw -a -P tcp -L a.b.c.d 2021 -R 192.168.1.1 21
 *              ipmasqadm portfw -a -P tcp -L a.b.c.d 8021 -R 192.168.1.1 21
 *              modprobe ip_masq_ftp in_ports=2021,8021
And it is a new kernel feature, not LVS feature.

what are these modules for: from ipvsadm(8) (ipvs 0.2.11)

If a virtual service is to handle FTP connections then persistence must be set for the virtual service if Direct Routing or Tunnelling is used as the forwarding mechanism. If Masquerading is used in conjunction with an FTP service than persistence is not necessary, but the ip_vs_ftp kernel module must be used. This module may be manually inserted into the kernel using insmod(8)

From Julian 3 May 2001, the modules are required for

The modules are NOT used for LVS-DR or LVS-Tun: in these cases persistence is used (or fwmarks version of persistence).

LVS-NAT, 2.2.x director

I found that ftp worked just fine without the module for 2.2.x (1.0.3-2.2.18 kernel).

LVS-NAT, 2.4.x director

For 2.4.x you can connect with ftp without any extra modules, but you can't "ls" the contents of the ftp directory. For that you need to load the ip_vs_ftp module. Without this module, your client's screen won't lock up, it just does nothing. If you then load the module, you can list the contents of the directory.

LVS-DR, LVS-Tun

For LVS-DR, LVS-Tun active ftp needs persistence. Otherwise it does not work, with or without ip_masq_ftp loaded. You can login, but attempting to do a `ls` will lockup the client screen. Checking the realserver, shows connections on ports 20,21 to paired ports on the client.

11.5 ftp (passive)

Passive ftp is used by netscape to get files from an ftp url like ftp://ftp.domain.com/pub/ . Here's an explanation of passive ftp from http://www.tm.net.my/learning/technotes/960513-36.html

If you can't open connections from Netscape Navigator through a firewall to ftp servers outside your site, then try configuring the firewall to allow outgoing connections on high-numbered ports.

Usually, ftp'ing involves opening a connection to an ftp server and then accepting a connection from the ftp server back to your computer on a randomly-chosen high-numbered telnet port. the connection from your computer is called the "control" connection, and the one from the ftp server is known as the "data" connection. All commands you send and the ftp server's responses to those commands will go over the control connection, but any data sent back (such as "ls" directory lists or actual file data in either direction) will go over the data connection.

However, this approach usually doesn't work through a firewall, which typically doesn't let any connections come in at all; In this case you might see your ftp connection appear to work, but then as soon as you do an "ls" or a "dir" or a "get", the connection will appear to hang.

Netscape Navigator uses a different method, known as "PASV" ("passive ftp"), to retrieve files from an ftp site. This means it opens a control connection to the ftp server, tells the ftp server to expect a control connection to the ftp server, tells the ftp server to expect a second connection, then opens the data connection to the ftp server itself on a randomly-chosen high-numbered port. This works with most firewalls, unless your firewall retricts outgoing connections on high-numbered ports too, in which case you're out of luck (and you should tell your sysadmins about this).

"Passive FTP" is described as part of the ftp protocol specification in RFC 959 ("http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html").

If you are setting up an LVS ftp farm, it is likely that users will retrieve files with a browser and you will need to setup the LVS to handle passive ftp. You will either need persistence (also see on the LVS website under documentation; persistence handling in LVS) or fwmark persistent connection for ftp.

For passive ftp, the ftpd sets up a listener on a high port for the data transfer. This problem for LVS is that the IP for the listener is the RIP and not the VIP.

Wenzhuo Zhang 1 May 2001

I've been using 2.2.19 on my dialup masquerading box for quite some time. It doesn't seem to me that the option is required, whether in PASV or PORT mode. We can actually get ftp to work in NAT mode without using the ip_masq_ftp module. The trick is to tell the real ftp servers to use the VIP as the passive address for connections from outside; e.g. in wu-ftpd, add the following lines to the /etc/ftpaccess:

passive address RIP <localnet>
passive address 127.0.0.1 127.0.0.0/8
passive address VIP 0.0.0.0/0
Of course, the ftp virtual service has to be persistent port 0.

Alois Treindl, 3 May 2001

I found (with kernel 2.2.19) that I needed the command

modprobe ip_masq_ftp in_ports=21
so that (passive mode) ftp from Netscape would work. without the in_ports=21 it did not work.
Julian Anastasov ja@ssi.bg 03 May 2001

Yes, it seems this option is not useful for the active FTP transfers because if the data connection is not created while the client's PORT command is detected in the command stream then it is created later when the internal real server creates normal in->out connection to the client. So, it is not a fatal problem for active FTP to avoid this option. The only problem is that these two connections are independent and the command connection can die before the data connection, for long transfers. With the in_ports option used this can not happen.

The fatal problems come for the passive transfers when the data connection from the client must hit the LVS service. For this, the ip_masq_ftp module must detect the 227 response from the real server in the in->out packets and to open a hole for the client's data connection. And the "good" news is that this works only with in_ports/in_mark options used.

Alois

I am using proftpd as ftp server, which does not seem to have on option so that I could configure on the server that it gives the VIP to clients making a PASV request; it always gives the realserver IP address in replies to such requests.

Bad ftpd :) It seems the follwing rules are valid:

11.6 ftp is difficult to secure

Roberto Nibali ratz@tac.ch 06 May 2001

We have multiple choices if we want to narrow down the input ipchains rules on the front interface of director

The biggest problem is with the ip_masq_ftp module. It should create an ip_fw entry in the masq_table for the PORT port. It doesn't do this and we have to open the whole port range. For PASV we have to DNAT the range.

ipchains -A forward -i $EXT_IF -s $INTERNAL_NET $UNPRIV_PORTS -d $DEP -j MASQ

FTP is made up of two connections, the Control- and the Data- Connection.

If we have to protect a client, we would like to only allow passive ftp, because then we do not have to allow incoming connections. If we have to protect a server, we would like to only allow active ftp, because then we only have to allow the incoming control-connection. This is a deadlock.

Example ftp sessions with netcat. (Note this URL now refers through to "atstake.com" which has the NT binary of netcat, but not the unix source. If you know where it is now or want the original code, let me know.)

We need 2 xterms (x1, x2), netcat and an ftp-server (here "zar" 172.23.2.30).

First passive mode (because it is conceptionally easier)

#x1: Open the control-connection to the server, 
#and sent the command "pasv" to the server. 
$ netcat zar 21
220 zar.terreactive.ch FTP server (Version 6.4/OpenBSD/Linux-ftpd-0.16) ready.
user ftp
331 Guest login ok, send your complete e-mail address as password.
pass ftp
230 Guest login ok, access restrictions apply.
pasv
227 Entering Passive Mode (172,23,2,30,169,29)

The server replied with 6 numbers:

In x2 I open a second connection with a second netcat

$ netcat 172.23.2.30 43293
# x2 will now display output from this connection

Now in x1 (the control-connection)

$ list
list
150 Opening ASCII mode data connection for '/bin/ls'.
226 Transfer complete.

and in x2 the listing appears.

Active ftp

I use the same control-connection in x1 as above, but I want the server to open a connection. Therefore I first need a listener. I do it with netcat in x2:

$ netcat -l -p 2560

Now I tell the server on the control connection to connect (2560=10*256+0)

port 172,23,2,8,10,0
200 PORT command successful.

<verb>

Now you see, why I used port 2560. 
172.23.2.8 is, of course, my own IP-address. 
And now, using x1, I ask for a directory-listing 
with the list command, and it appears in x2. 
For completeness sake, here is the the full in/output. 

First the xterm 1:

<verb>
netcat zar 21
220 zar.terreactive.ch FTP server (Version 6.4/OpenBSD/Linux-ftpd-0.16) ready.
user ftp
331 Guest login ok, send your complete e-mail address as password.
pass ftp
230 Guest login ok, access restrictions apply.
pasv
227 Entering Passive Mode (172,23,2,30,169,29)
list
150 Opening ASCII mode data connection for '/bin/ls'.
226 Transfer complete.
port 172,23,2,8,10,0
200 PORT command successful.
list
150 Opening ASCII mode data connection for '/bin/ls'.
226 Transfer complete.
quit
221 Goodbye.

xterm 2:

netcat 172.23.2.30 43293
total 7
dr-x--x--x   2 root     root         1024 Jul 26  2000 bin
drwxr-xr-x   2 root     root         1024 Jul 26  2000 dev
dr-x--x--x   2 root     root         1024 Aug 20  2000 etc
drwxr-xr-x   2 root     root         1024 Jul 26  2000 lib
drwxr-xr-x   2 root     root         1024 Jul 26  2000 msgs
dr-xr-xr-x  11 root     root         1024 Mar 15 14:26 pub
drwxr-xr-x   3 root     root         1024 Mar 11  2000 usr

netcat -l -p 2560
total 7
dr-x--x--x   2 root     root         1024 Jul 26  2000 bin
drwxr-xr-x   2 root     root         1024 Jul 26  2000 dev
dr-x--x--x   2 root     root         1024 Aug 20  2000 etc
drwxr-xr-x   2 root     root         1024 Jul 26  2000 lib
drwxr-xr-x   2 root     root         1024 Jul 26  2000 msgs
dr-xr-xr-x  11 root     root         1024 Mar 15 14:26 pub
drwxr-xr-x   3 root     root         1024 Mar 11  2000 usr

11.7 evaluation of SuSE ftp proxy

Roberto Nibali ratz@tac.ch 08 May 2001

There has been some talks about ftp, security and LVS recently and different opinions appeared. I wasn't aware of the fact the people still heavily use the ftp protocol through a firewall, rather then putting a completely secluded box in a corner. Back here at terreActive we have been fighting with the ftp problem since 4 years already and we have not yet the ultimate solution. As such we also evaluated the SuSE FTP proxy

What follows is an evaluation mostly done by one of our coworkers Martin Trampler and me (ratz). We're not yet finished with testing everything and all possible setups (NAT, non-NAT, client behind a firewall, etc.) but the result looks rather good in terms of improving security. A better paper will probably follow, but we're too busy right now and ftp is anyway not allowed in our company policy unless the customer has a special SLA.

Motivated by

I started in March/01 with a search for a FTP Proxy-software, which could be used as a drop-in on pab1/2-machines to increase the security of the machines (clients and servers) behind the packetfilters. Since it was a requirement that this Software should be able to transparently proxy external clients (i.e. the clients don't realize that there is a proxy inbetween), there was only one package which deserved a closer look: The FTP-Proxy from the SuSE Proxy Suite (which actually consists of nothing but this FTP-Proxy). This Proxy now includes support for transparent proxying mode.

Mode of operation

The Proxy consists of a single binary (ftp-proxy, stripped about 50k). All configuration-options can be set in a single configuration-file which by default is named ftp-proxy.conf and searched for in whatever directory was given with the configure-option --sysconfdir=. If this option was not given it is searched for in /usr/local/proxy-suite/etc/, which shows SuSEs BSD-heritage. Best way is, to give the config-file at runtime with the -f cmdline-option.

It is very useful to compile debugging-support into the binary during evaluation and to run it with the cmdline-option -v 4 for maximum debugging. Debugging output is then appended to /tmp/ftp-proxy.debug. It can be run from (x)inetd or in standalone-mode as daemon. I only evaluated the daemon.

It reads the config-file (which must exist) and binds to some local port (e.g. 3129, which is IANA-unassigned and squid+1). The Packetfilter has to be configured to redirect all packets which come in on port 21 to this port (more later). As soon as it gets a request it handles it by first replying to the client only. After the initial USER <username> command it connects to the server (or, more exactly, to port 21 of the host whose IP was the destination of the redirected package).

The configfile may contain user-specific sections which direct special users to special servers. This feature may be very useful but was not evaluated either. It then continues as an agent between client and server; checking either side's communication for correct syntax, as a good application-proxy should.

As soon as the client prepares a data connection (either by sending a PASV or a PORT command, it acknowledges it and, in case of a requested passive connection, establishes a listener which binds to the server's IP (!!). This came rather as a surprise to me. It actually works and means, that the data connection is transparent as well for the client. The range of port on which it listens is configureable as well as the range of ports it uses for outgoing connection (to either the server or to the client in the active-ftp case).

As soon as the client actually wants to retrieve data, the connection to the server is established and the data is shuffeled around. Since the connection to the server is completely seperate from the client connection, its mode doesn't have to be the one the client requests (also by default it is). The data-connection to the server may also be configured to always be active or passive. Here it is clearly desireable to always use passive mode to avoid opening another listener on the packetfilter.

Evaluation

After initially having some minor problems to compile the proxy (it has to be configured --with-regex) and to get it running (by default it thinks it is started by inetd, i.e. standalone-mode is not the default) it ran without problems and also wrote informative messages into the debugging file. Almost everything can be configured in the configuration file (also not everything is documented unambigously) but in general the quality of the documentation, the logging and the debugging messages seems quite high.

The proxy is, as already mentioned, completely transparent for the client and of course intransparent for the server (i.e. the server sees the connections coming from the client).

First more extensive stress-testing and code-review performed on 04/Apr/01 showed the following irregularities:

The failed connections result from portnumbers being reused where they should be increased. I think, this problem would also be found on the client-side of the connection if the stress-test would issue more than 1 data-retrieving command. It may help to undefine the Destination[Min|Max]Port configuration directive to get a port assigned by the system.

The - attack-behaviour vanished after disabling debugging output but may nevertheless being an issue. We found a questionable use of a static char* in a formatting routine.

Integration into a firewall suite

Packetfilter-ruleset

First the Packetfilterport on which the proxy listens must be closed. It is well possible to bind the proxy to e.g. localhost, but the ipchains ... -j REDIRECT (see below) only allows the specification of a port, not of port+IP. If the proxy is bound to an IP it doesn't get the packets. It has to be universally bound and therefore its port must be closed.

In the following I use:

The latter pair of rules covers the case that all data connections from the proxy to the server are passive. Note, that no rules for the forward-chain are necessary at all.

This diagram shows the control-connection.


                        +--------------------------+
                        |      |     Proxy      |  |
                        |      |3129____________|  |
  +--------+   tcp/21   |-----   ^         |  -----|  tcp/21   +--------+
  | Client |----------->|eth0|   |         +->|eth1|---------->| Server |
  +--------+ to server  |--------+redirect    -----|           +--------+
                        |Packetfilter              |
                        +--------------------------+

The obvious problem is, to formulate a fw-ftp-proxy script which accepts 2 NEs (external client, internal server) as input and does not generate redundant rules. Because the server-side connection is completely independent from the client, its rules must only be added once for each server while the client-side rules (including the redirect) are dependent of both server and client. Probably the best way would be to add a script which only handles the client-side and to add each ftp-server seperately with a "tcp@fw"-rule. Since tcp@fw does not allow specification of source-ports, this rule would then be wider as necessary.

In the client-side script, the portrange used in the proxy-configfile would then have to be hardwired. It would be necessary to verify, that for every server used as target in a client-side script there is at least (or even better exactly one) tcp@fw as described above.

Security considerations

We had a swift look at the code and it looks rather clean and well documented to me. Unfortunately some features are incorrectly documented or not documented at all while some features are already documented but not yet implemented.

As already mentioned, the port on which the proxy listens has to be closed. The servers should only be driven in passive mode, which should be possible for any server. The PASV_PORTS should be restricted to a dozen or so (depending on the load).

For the maintainers of the servers, the major drawback is the proxy's intransparency.

Features not yet evaluated

Conclusion

General Aspects

Given the current situation, where we shoot huge holes in the firewall to fully enable (passive) ftp connections to servers located inside, the use of this proxy would greatly increase the security of these systems.

Prior to deployment I think the code should be reviewed more closely (remember that the proxy opens listeners on the PF!) and some more efforts should be undertaken to find a configuration which is as tight as possible by providing the required functionality (cf. the section above).

Extensions

It should, in general, be possible to have a second proxy running for inside clients. There we still have the problem, that we have to open the whole UNPRIV-Range for connections coming from sourceport 20. Basically I think, that this problem should be handled differently: Providing the functionality is the business of the server (hence the name) . The FTP-Protocol provides passive mode exactly for this case (firewalled client). So we should in general not allow clients behind our Firewalls/Packetfilters to make active FTP connections.

We found out that it should not be too difficult to enable "bidirectional transparency".

11.8 sshd

surprisingly (considering that it negotiates a secure connection) nothing special either. You do not need persistent port/client connection for this.

jeremy@xxedgexx.com

I'm using ipvs to balance ssh connections but for some reason ipvs is only using one real server and persists to use that server until I delete its arp entry from the ipvs machine and remove the virtual loopback on the real server. Also, I noticed that connections do not register which this behavior.

Wensong

do you use the persistent port for your VIP:22? If so, the default timeout of persistent port is 360 seconds, once the ssh session finishes, it takes 360 seconds to expire the persistent session. (In ipvs-0.9.1, you can flexibly set the timeout for the persistent port.) There is no need to use persistent port for ssh service, because the RSA keys are exchanged in each ssh session, and each session is not related.

The director will timeout an idle tcp connection (e.g. ssh, telnet) in 15mins, independantly of any settings on the client or server. You will want to change these timeouts.

keys for realservers running sshd

If you install sshd and generate the host keys for the realservers using the default settings, you'll get a working LVS'ed sshd. However you should be aware of what you've done. The default sshd listens to 0.0.0.0 and you will have generated host keys for a machine whose name corresponds to the RIP (and not the VIP). Since the client will be displaying a prompt with the name of the realserver (rather than the name associated with the VIP) this will work just fine. However the client will get a different realserver each connection (which is OK too) and will accumulate keys for each realserver. If instead you want the client to be presented with one virtual machine, you will need each machine to have its hostname being the name associated with the VIP, the sshd will have to listen to the VIP (if VS-DR, VS-Tun) and the hostkeys will have to be generated for the name of the VIP.

11.9 telnet

Simple one port service. Use telnet (or netcat) for initial testing of your LVS. It is a simpler client/service than http (it is not persistent) and a connection shows up as an ActConn in the ipvsadm output.

(Also note the director timeout problem, explained in the ssh section).

11.10 dns

This is from Ted Pavlic. Two (independant) connections, tcp and udp to port 53 are needed.

(from the IPCHAINS-HOWO) DNS doesn't always use UDP; if the reply from the server exceeds 512 bytes, the client uses a TCP connection to port number 53, to get the data. Usually this is for a zone transfer.

Here is part of an lvs.conf file which has dns on two realservers.

#dns, note: need both udp and tcp
#A realserver must be able to determine its own name.
#(log onto machine from console and use nslookup
# to see if it knows who it is)
# and to do DNS on the VIP and name associated with the VIP
#To test a running LVS, on client machine, run nslookup and set server = VIP.
SERVICE=t dns wlc 192.168.1.1 192.168.1.8
SERVICE=u dns wlc 192.168.1.1 192.168.1.8

If the LVS is run without mon, then any setup that allows the realservers to resolve names is fine (ie if you can sit at the console of each realserver and run nslookup, you're OK).

If the LVS is run with mon (eg for production), then dns needs to be setup in a way that dns.monitor can tell if the LVS'ed form of dns is working. When dns.monitor tests a realserver for valid dns service, it first asks for the zone serial number from the authoritative (SOA) nameserver of the virtualserver's domain. This is compared with the serialnumber for the zone returned from the realserver. If these match then dns.monitor declares that the realserver's dns is working.

The simplest way of setting up an LVS dns server is for the realservers to be secondaries (writing their secondary zone info to local files, so that you can look at the date and contents of the files) and some other machine (eg the director) to be the authoritative nameserver. Any changes to the authoritative nameserver (say the director) will have to be propagated to the secondaries (here the realservers) (delete the secondary's zone files and HUP named on the realservers). After the HUP, new files will be created on the secondary nameservers (the realservers) with the time of the HUP and with the new serial numbers. If the files on the secondary nameservers are not deleted before the HUP, then they will not be updated till the refresh/expire time in the zonefile and the secondary nameservers will appear to dns.monitor to not be working.

There is no reason to create an LVS to do DNS. DNS has its own cacheing and hierachial method of loadbalancing. However if you have an LVS already running serving http, ftp... then it's simple to throw in dns as well (Ted).

11.11 sendmail/smtp/pop3/qmail

For mail which is being passed through, LVS is a good solution.

If the mail is being delivered to the realserver, then the mail will arrive randomly at any one of the realservers and write to the different filesystems. This is the many reader/many writer problem that LVS has. Since you probably want your mail to arrive at one place only, the only way of handling this right now is to have the /home directory nfs mounted on all the realservers from a backend fileserver which is not part of the LVS. (an nfs.monitor is in the works.) Each realserver will have to be configured to accept mail for the virtual server DNS name (say lvs.domain.com).

It should be possible to use Coda (http://www.coda.cs.cmu.edu/) to keep /home directories synchronised, or inter-mezzo or gfs all of which look nice, but we haven't tested.

To maintain user passwds on the realservers -

Gabriel Neagoe Gabriel.Neagoe@snt.ro

for syncing the passwords - IF THE ENVIRONMENT IS SAFE- you could use NIS or rdist

identd (auth) problems

You will not be explicitely configuring identd in an LVS. However identd is used by sendmail and tcpwrappers and will cause problems. Sendmail can't use identd when running on an LVS (see identd and sendmail). Running identd as an LVS service doesn't fix this.

To fix, in sendmail.cf file set the value

Timeout.ident=0

Also see http://www.sendmail.org/faq/section3.html - Why do connections to the smtp port take such a long time?

for qmail:

Martin Lichtin lichtin@bivio.com

To test an LVS'ed smtp server (connect to lvs:smtp from the client)

client:~# telnet lvs.cluster.org smtp
 trying 192.168.1.110...
 Connected to lvs.cluster.org
 Escape character is '^]'.
220 lvs.cluster.org ESMTP Sendmail 8.9.1a/8.9.0; Sat 6 Nov 1999 13:16:30 GMT
 HELO client.cluster.org
250 client.cluster.org Hello root@client.cluster.org [192.168.1.12], pleased to meet you
 quit
221 client.cluster.org closing connection

check that you can access each realserver in turn (here 192.168.1.12 was accessed).

pop3

pop3 - as for smtp. The mail agents must see the same /home file system, so /home should be mounted on all realservers from a single file server.

Thoughts about sendmail/pop

(another variation on the many reader/many writer problem)

loc@indochinanet.com wrote:

I need this to convince my boss that LVS is THE SOLUTION for very Scalable and High Available Mail/POP server.

Rob Thomas rob@rpi.net.au

This is about the hardest clustering thing you'll ever do. Because of the constant read/write access's you -will- have problems with locking, and file corruption.. The 'best' way to do this is (IMHO):

  1. NetCache Filer as the NFS disk server.
  2. Several SMTP clients using NFS v3 to the NFS server.
  3. Several POP/IMAP clients using NFS v3 to the NFS server.
  4. At least one dedicated machine for sending mail out (smarthost)
  5. LinuxDirector box in front of 2 and 3 firing requests off

Now, items 1 2 -and- 3 can be replaced by Linux boxes, but, NFS v3 is still in Alpha on linux. I -believe- that NetBSD (FreeBSD? One of them) has a fully functional NFS v3 implementation, so you can use that.

The reason why I emphasize NFSv3 is that it -finally- has 'real' locking support. You -must- have atomic locks to the file server, otherwise you -will- get corruption. And it's not something that'll happen occasionally. Picture this:


  [client]  --  [ l.d ] -- [external host]
                   |
     [smtp server]-+-[pop3 server]
                   |
               [filesrv]

Whilst [client] is reading mail (via [pop3 server]), [external host] sends an email to his mailbox. the pop3 client has a file handle on the mail spool, and suddenly data is appended to this. Now the problem is, the pop3 client has a copy of (what it thinks) is the mail spool in memory, and when the user deletes a file, the mail that's just been received will be deleted, because the pop3 client doesn't know about it.

This is actually rather a simplification, as just about every pop3 client understands this, and will let go of the file handle.. But, the same thing will happen if a message comes in -whilst the pop3d is deleting mail-.


                           POP Client    SMTP Client
  I want to lock this file <--
  I want to lock this file               <--
  You can lock the file    -->
  You can lock the file                  -->
  Consider it locked       <--
  File is locked           -->
  Consider it locked                     <--
  Ooh, I can't lock it                   -->

The issue with NFS v1 and v2 is that whilst it has locking support, it's not atomic. NFS v3 can do this:

                           POP Client    SMTP Client
  I want to lock this file <--
  I want to lock this file               <--
  File is locked           -->
  Ooh, I can't lock it                   -->

That's why you want NFSv3. Plus, it's faster, and it works over TCP, rather than UDP 8-)

This is about the hardest clustering thing you'll ever do.

Stefan Stefanov sstefanov@orbitel.bg

I think this might be not-so-hardly achieved with CODA and Qmail.

Coda (http://www.coda.cs.cmu.edu) allows "clustering" of file system space. Qmail's (http://www.qmail.org) default mailbox format is Maildir, which is very lock safe format (even on NFS without lockd).

(I haven't implemented this, it's just a suggestion.)

11.12 Mail farms

Peter Mueller pmueller@sidestep.com 10 May 2001

what open source mail programs have you guys used for SMTP mail farm with LVS? I'm thinking about Qmail or Sendmail?

Michael Brown Michael_E_Brown@Dell.com, Joe and Greg Cope gjjc@rubberplant.freeserve.co.uk 10 May 2001

You can do load balancing against multiple mail servers without LVS. Use multiple MX records to load balance, and mailing list management software (Mailman, maybe?). DNS responds with all MX records for a request. The MTA should then choose one at random from the same piority. (A cache DNS will also return all MX records.) You don't get persistent use of one MX record. If the chosen MX record points to a machine that's down, the MTA will choose another MX record.

Wensong

I think that central load balancing is more efficient in resource utilization than randomly picking up servers by clients, basic queuing theory can prove this. For example, if there are two mail servers grouped by multiple DNS MX records, it is quite possible that a mail server of load near to 1 still receiving new connections (QoS is bad here), in the mean while the other mail server just has load 0.1. If the central load balancing can keep the load of two server around 0.7 respectively, the resource utilization and QoS is better than that of the above case. :)

Michael Brown Michael_E_Brown@Dell.com 15 May 2001

I agree, but... :-)

  1. You can configure most mail programs to start refusing connections when load rises above a certain limit. The protocol itself has built-in redundancy and error-recovery. Connections will automatically fail-over to the secondary server when the primary refuses connections. Mail will _automatically_ spool on the sender's side if the server experiences temporary outage.
  2. Mail service is a special case. The protocol/RFC itself specified application-level load balancing, no extra software required.
  3. Central load balancer adds complexity/layers that can fail.

I maintain that mail serving (smtp only, pop/imap is another case entirely) is a special case that does not need the extra complexity of LVS. Basic Queuing theory aside, the protocol itself specifies load-balancing, failover, and error-recovery which has been proven with years of real-world use.

LVS is great for protocols that do not have the built-in protocol-level load-balancing and error recovery that SMTP inherently has (HTTP being a great example). All I am saying is use the right tool for the job.

Note this discussion applies to mail which is being forwarded by the MTA. The final target machine has the single-writer, many-reader problem as before (which is fine if it's a single node).

Joe

How would someone like AOL handle the mail farm problem? How do users get to their mail? Does everyone in AOL get their mail off one machine (or replicated copies of it) or is each person directed to one of many smaller machines to get their mail?

Michael Brown

Tough question... AOL has a system of inbound mail relays to receive all their user's mail. Take a look:

[mebrown@blap opt]$ nslookup
Default Server:  ausdhcprr501.us.dell.com
Address:  143.166.227.254

> set type=mx
> aol.com
Server:  ausdhcprr501.us.dell.com
Address:  143.166.227.254

aol.com preference = 15, mail exchanger = mailin-03.mx.aol.com
aol.com preference = 15, mail exchanger = mailin-04.mx.aol.com
aol.com preference = 15, mail exchanger = mailin-01.mx.aol.com
aol.com preference = 15, mail exchanger = mailin-02.mx.aol.com
aol.com nameserver = dns-01.ns.aol.com
aol.com nameserver = dns-02.ns.aol.com
mailin-03.mx.aol.com    internet address = 152.163.224.88
mailin-03.mx.aol.com    internet address = 64.12.136.153
mailin-03.mx.aol.com    internet address = 205.188.156.186
mailin-04.mx.aol.com    internet address = 152.163.224.122
mailin-04.mx.aol.com    internet address = 205.188.158.25
mailin-04.mx.aol.com    internet address = 205.188.156.249
mailin-01.mx.aol.com    internet address = 152.163.224.26
mailin-01.mx.aol.com    internet address = 64.12.136.57
mailin-01.mx.aol.com    internet address = 205.188.156.122
mailin-01.mx.aol.com    internet address = 205.188.157.25
mailin-02.mx.aol.com    internet address = 64.12.136.89
mailin-02.mx.aol.com    internet address = 205.188.156.154
mailin-02.mx.aol.com    internet address = 64.12.136.121
dns-01.ns.aol.com       internet address = 152.163.159.232
dns-02.ns.aol.com       internet address = 205.188.157.232

So that is on the recieve side. On the actual user reading their mail side, things are much different. AOL doesn't use normal SMTP mail. They have their own proprietary system, which interfaces to the normal internet SMTP system through gateways. I don't know how AOL does their internal, proprietary stuff, but I would guess it would be massively distributed system.

Basically, you can break down your mail-farm problem into two, possibly three, areas.

1) Mail receipt (from the internet)
2) Users reading their mail
3) Mail sending (to the internet)

Items 1 and 3 can normally be hosted on the same set of machines, but it is important to realize that these are separate functions, and can be split up, if need be.

For item #1, the listing above showing what AOL does is probably a good example of how to set up a super-high-traffic mail gateway system. I normally prefer to add one more layer of protection on top of this: a super low-priority MX at an offsite location. (example: aol.com preference = 100, mail exchanger = disaster-recovery.offsite.aol.com )

For item #2, that is going to be a site policy, and can be handled many different ways depending on what mail software you use (imap, pop, etc). The good IMAP software has LDAP integration. This means you can separate groups of users onto separate IMAP servers. The mail client then can get the correct server from LDAP and contact it with standard protocols (IMAP/POP/etc).

For item #3, you will solve this differently depending on what software you have for #2. If the client software wants to send mail directly to a smart gateway, you are probably going to DNS round-robin between several hosts. If the client expects it's server (from #2) to handle sending email, then things will be handled differently.

Wenzhuo Zhang wenzhuo@zhmail.com

Here's an article on paralleling mail servers by Derek Balling.

Shain Miley 25 May 2001

I am planning on setting up an LVS IMAP cluster. I read some posts that talk about file locking problems with NFS that might cause mailbox corruption. Do you think NFS will do the trick or is there a better (faster, journaling) file system out there that will work in a production environment.

Matthew S. Crocker matthew@crocker.com 25 May 2001

NFS will do the trick but you will have locking problems if you use mbox format e-mail. You *must* use MailDir instead of mbox to avoid the locking issues.

You can also use GFS (www.globalfilesystem.org) which has a fault tolerant shared disk solution.

Don Hinshaw dwh@openrecording.com

I do this. I use Qmail as it stores the email in Maildir format, which uses one file per message as opposed to mbox which keeps all messages in a single file. On a cluster this is an advantage since one server may have a file locked for writing while another is trying to write. Since they are locking two different files it eases the problems with NFS file locking.

Courier also supports Maildir format as I believe does Postfix.

I use Qmail+(many patches) for SMTP, Vpopmail for a single UID mail sandbox (shell accounts my ass, not on this rig), and Courier-Imap. Vpopmail is configured to store userinfo in MySQL and Courier-Imap auths out of Vpopmail's tables.

Joe I've always had the creeps about pop and imap sending clear text passwds. How do you handle passwds?

It's a non-issue on that particular system, which is a webmail server. There is no pop just imapd and it's configured to allow connections only from localhost. The webmail is configured to connect to imapd on localhost. No outside connections allowed.

But, this is another reason that I started using Vpopmail. Since it is a mail sandbox that runs under a single UID, email users don't get a shell account, so even if their passwords are sniffed, it only gets the cracker a look into that user's mailbox, nothing more.

At least on our system. If a cracker grabs someone's passwd and then finds that the user uses the same passwd on every account they have, there's not much I can do about that.

On systems where users do have an ftp or shell login, I make sure that their login is not the same as their email login and I also gen random passwords for all such accounts, and disallow the users changing it.

I'm negotiating a commercial contract to host webmail for a company (that you would recognize if I weren't prohibited by NDA from disclosing the name), and if it goes through then I'll gen an SSL cert for that company and auth the webmail via SSL.

You can also support SSL for regular pop or imap clients such as Netscape Messanger or MS Outlook or Outlook Express.

Everything is installed in /var/qmail/* and that /var/qmail/ is an NFS v3 export from a RAID server. All servers connect to a dedicated MySQL server that also stores it's databases on another NFS share from the RAID. Also each server mounts /www from the RAID.

Each realserver runs all services, smtpd, imapd, httpd and dns. I use TWIG as a webmail imap client, which is configured to connect to imapd on localhost (each server does this). Incoming smtp, httpd and dns requests are load balanced, but not imapd, since they are local connections on each server. Each server stores it's logs locally, then they are combined with a cron script and moved to the raid.

It's been working very well in a devel environment for over a year (round- robin dns, not lvs). I've recently begun the project to rebuild the system and scale it up into a commercially viable system, which is quite a task since most of the software packages are at least a year old, and I'll be using a pair of LVS directors instead of the RRDNS.

Users will also be using some sort of webmail (IMP/HORDE) to get their mail when they are off site...other than that standard Eudora/Netscape will be used for retrieval.

I settled on TWIG mainly because of it's vhost support. With Vpopmail, I can execute
# /var/qmail/vpopmail/bin/vadddomain somenewdomain.com <postmaster passwd>
and add that domain to dns and begin adding users and serving it up. I had to tweak TWIG just a bit to get it to properly deal with the "user@domain" style login that Vpopmail requires, but otherwise it works great. Each vhost domain can have it's own config file, but there is only one copy of TWIG in /www. TWIG uses MySQL, and though it doesn't require it, I also create a seperate database for each vhost domain.

IMP's development runs along at about Mach 0.00000000004 and I got tired of waiting for them to get a working addressbook. That plus it doesn't vhost all that well. SquirrelMail is very nice, but again not much vhost support. Plus TWIG includes the kitchen sink. Email, contacts, schedule, notes, todo, bookmarks and even USENET (which I don't use), each module can be enabled/disabled in the config file for that domain, and it's got a very complete advanced security module (which I also don't use). It's all PHP and using mod_gzip is pretty fast. I tested the APC optimizer for PHP, but every time I made a change to a script I had to reload Apache. Not very handy for a devel system, but it did add some noticable speed increases, until I unloaded it.

The real servers would need access to both the users home directories as well as the /var/mail directory. I am not too familiar with the actual locking problems...I understand the basics but I also hear that NFS V3 was supposed to fix some of the locking issues with V2...I also saw some links to GFS,AFS,etc not too sure how they would work...

Just a quick note: About a week ago I tried compiling a kernel that had been patched by SGI for XFS. The kernel (2.4.2) compiled fine, but choked once the LVS patches had been applied. Not having a lot of time to play around with it, I simply moved to 2.4.4+lvs 0.9 and decided not to bother with XFS on the director boxes.

also I thought about samba and only found one post from last year where someone was going to try it but there was no more info there.

Well, there's how I do it. I've tried damned near every combination of GPL software available over about the last 2 years to finally arrive at my current setup. Now if I could just load balance MySQL...

Greg Cope

MySQL connections / data transfere work much faster (20% ish) when on local host - so how about running mysql on each host, which is a select only system, and each localhost uses replication to a mster DB that is used for inserts and updates ?

Ultimately I think I'll have to. After I get done rebuilding the system to use kernel 2.4 and LVS and everything is stabilized, then I'll be looking very hard at just this sort of thing.

Joe, 04 Jun 2001

SMTP servers need access to DNS for reverse name lookup. If they are LVS'ed in a LVS-DR setup, won't this be a problem?

Matthew S. Crocker matthew@crocker.com

You only need to make sure you have the proper forward and reverse lookup set. inbound mail to an SMTP server gets load balanced by the LVS but it still sees the orginal from IP of the sender and can do reverse lookups as normal. outbound mail from an SMTP server makes connections from its real IP address which can be NAT'd by a firewall or not. That IP address can also be reverse looked up.

normally the realservers in a LVS-DR setup have private IPs for the RIPs and hence they can't receive replies from calls made to external name servers.

I would also assume that people would write filter rules to only allow packets in and out of the realservers that belong to the services listed in the director's ipvsadm tables.

I take it that your LVS'ed SMTP servers can access external DNS servers, either by NAT through the director, or in the case of LVS-DR by having public IPs and making calls from those IPs to external nameservers via the default gw of the realservers?

We currently have our real servers with public IP addresses.

Bowie Bailey Bowie_Bailey@buc.com

You can also do this by NAT through a firewall or router. I am not doing SMTP, but my entire LVS setup (VIPs and all) is private. I give the VIPs a static conduit through the firewall for external access. The realservers can access the internet via NAT, the same as any computer on the network.

11.13 authd/identd (port 113) and tcpwrappers (tcpd)

You do not explicitely set authd (==identd) as an LVS service. It is used with some services (eg sendmail and services running inside tcpwrappers). authd initiates calls from the realservers to the client. LVS is designed for services which receive connect requests from clients. LVS does not allow authd to work anymore and this must be taken into account when running services that cooperate with authd. The inability of authd to work with LVS is important enough that there is a separate section on authd.

11.14 http name and IP-based (with LVS-DR or LVS-Tun)

http with name- and ip-based http is a simple one port service. Your httpd must be listening to the VIP which will be on lo:0 or tunl0:0. The httpd can be listening on the RIP too (on eth0) for mon, but for the LVS you need the httpd listening to the VIP.

Thanks to Doug Bagley doug@deja.com for getting this info on ip and name based http into the HOWTO.

Both ip-based and name-based webserving in an LVS are simple. In ip-based (HTTP/1.0) webserving, the client sends a request to a hostname which resolves to an IP (the VIP on the director). The director sends the request to the httpd on a realserver. The httpd looks up its httpd.conf to determine how to handle the request (e.g. which DOCUMENTROOT).

In named-based (HTTP/1.1) webserving, the client passes the HOST: header to the httpd. The httpd looks up the httpd.conf file and directs the request to the appropriate DOCUMENTROOT. In this case all URL's on the webserver can have the same IP.

The difference between ip- and name-based web support is handled by the httpd running on the realservers. LVS operates at the IP level and has no knowledge of ip- or name-based httpd and has no need to know how the URLs are being handled.

For the definitive word on ip-based and name-based web support see

http://www.apache.org/docs/vhosts/index.html

Here are some excerpts.

The original (HTTP/1.0) form of http was IP-based, ie the httpd accepted a call to an IP:port pair, eg 192.168.1.110:80. In the single server case, the machine name (www.foo.com) resolves to this IP and the httpd listens to calls to this IP. Here's the lines from httpd.conf

Listen 192.168.1.110:80
<VirtualHost 192.168.1.110>
        ServerName lvs.mack.net
        DocumentRoot /usr/local/etc/httpd/htdocs
        ServerAdmin root@chuck.mack.net
        ErrorLog logs/error_log
        TransferLog logs/access_log
</VirtualHost>

To make an LVS with IP-based httpds, this IP is used as the VIP for the LVS and if you are using LVS-DR/LVS-Tun, then you set up multiple realservers, each with the httpd listening to the VIP (ie its own VIP). If you are running an LVS for 2 urls (www.foo.com, www.bar.com), then you have 2 VIPs on the LVS and the httpd on each realserver listens to 2 IPs.

The problem with ip-based virtual hosts is that an IP is needed for each url and ISPs charge for IPs.

Doug Bagley doug@deja.com

With HTTP/1.1, a client Name based virtual hosting uses the HTTP/1.1 "Host:" header, which HTTP/1.1 clients send. This allows the server to know what host/domain, the client thinks it is connecting to. A normal HTTP request line only has the request path in it, no hostname, hence the new header. IP-based virtual hosting works for older browsers that use HTTP/1.0 and don't send the "Host:" header, and requires the server to use a separate IP for each virtual domain.

The httpd.conf file then has

NameVirtualHost 192.168.1.110

<VirtualHost 192.168.1.110>
ServerName www.foo.com
DocumentRoot /www.foo.com/
..
</VirtualHost 192.168.1.110>

<VirtualHost 192.168.1.110>
ServerName www.bar.com
DocumentRoot /www.bar.com/
..
</VirtualHost 192.168.1.110>

DNS for both hostnames resolves to 192.168.1.110 and the httpd determines the hostname to accept the connection from the "Host:" header. Old (HTTP/1.0) browsers will be served the webpages from the first VirtualHost in the httpd.conf.

For LVS again nothing special has to be done. All the hostnames resolve to the VIP and on the realservers, VirtualHost directives are setup as if the machine was a standalone.

Ted Pavlic pavlic@netwalk.com.

Note that in 2000, ARIN (look for "name based web hosting" announcements, the link changes occasionally) announced that IP based webserving would be phased out in favor of name based webserving for ISPs who have more that 256 hosts. This will only require one IP for each webserver. (There are exceptions, ftp, ssl, frontpage...)

11.15 http with LVS-NAT

Summary: make sure the httpd on the realserver is listening on the RIP not the VIP (this is the opposite of what was needed for LVS-DR or LVS-Tun). (Remember, there is no VIP on the realserver with LVS-NAT).

tc lewis had an (ip-based) non-working http LVS-NAT setup. The VIP was a routable IP, while the realservers were virtual hosts on the non-routable 192.168.1.0/24 network.

Michael Sparks michael.sparks@mcc.ac.uk

What's happening is a consequence of using NAT. Your LVS is accepting packets for the VIP, and re-writing them to either 192.168.123.3 or 192.168.123.2. The packets therefore arrive at those two servers marked for address 192.168.123.2 or 192.168.123.3, not the VIP.

As a result when apache sees this:

<VirtualHost w1.bungalow.intra>
...
</VirtualHost>

It notices that the packets are arriving on either 192.168.123.2 or 192.168.123.3 and not w1.bungalow.intra, hence your problem.

Solutions

Joe 10 May 2001

It just occured to me that a realserver in a LVS-NAT LVS is listening on the RIP. The client is sending to the VIP. In an HTTP 1.1 or name based httpd, doesn't the server get a request with the URL (which will have the VIP) in the payload of the packet (where an L4 switch doesn't see it)? Won't the server be unhappy about this? This has come up before with name based service like https and for indexing of webpages. Does anyone know how to force an HTTP 1.1 connection (or to check whether the connection was HTTP 1.0 or 1.1) so we can check this?

Paul Baker pbaker@where2getit.com 10 May 2001

The HTTP 1.1 request (and also 1.0 requests from any modern browser) contain a Host: header which specifies the hostname of the server. As long as the webservers on the realservers are aware that they are serving this hostname. There should be no issue with 1.1 vs 1.0 http requests.

so both virtualHost and servername should be the reverse dns of the VIP?

Yes. Your Servername should be the reverse dns of the VIP and you need to have a Virtualhost entry for it as well. In the event that you are serving more than one domain on that VIP, then you need to have a VirtualHost entry for each domain as well.

what if instead of the name of the VIP, I surf to the actual IP? There is no device with the VIP on the LVS-NAT realserver. Does there need to be one? Will an entry in /etc/hosts that maps the VIP to the public name do?

Ilker Gokhan IlkerG@sumerbank.com.tr

If you write URL with IP address such as http://123.123.123.123/, the Host: header is filled with this IP address, not hostname. You can see it using any network monitor program (tcpdump).

11.16 httpd normally closes connections

If you look with ipvsadm to see the activity on an LVS serving httpd, you won't see much. A non-persistent httpd on the realserver closes the connection after sending the packets. Here's the output from ipvsadm, immediately after retrieving a gif filled webpage from a 2 realserver LVS.

director:# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
  -> bashfull.mack.net:www       Masq    1      2          12        
  -> sneezy.mack.net:www         Masq    1      1          11        

The InActConn are showing the connections that transferred hits that have been closed and are in the FIN state waiting to timeout. You may see "0" in the InActConn column, leading you to think that you are not getting the packets via the LVS.

11.17 Persistence with http; browser opens many connections to httpd

With the first version of the http protocol, HTTP/1.0, a client would request a hit/page from the httpd. After the transfer, the connection was dropped. It is expensive to setup a tcp connection just to transfer a small number of packets, when it is likely that the client will be making several more requests immediately afterwards (e.g. if the client downloads a page with references to gif images in it, then after parsing the html page, it will issue requests to fetch the gifs). With HTTP/1.1 persistent connection was possible. The client/server pair negotiate to see if persistent connection is available. The httpd will keep the connection open for a period (KeepAliveTimeout, 15sec usually) after a transfer in case further transfers are requested. The client can drop the connection any time it wants to (i.e. when it has got all the hits on a page).

Alois Treindl alois@astro.ch 30 Apr 2001

when I reload a page on the client, the browser makes several http hits on the server for the graphics in the page. These hits are load balanced between the real servers. I presume this is normal for HTTP/1.0 protocol, though I would have expected Netscape 4.77 to use HTTP/1.1 with one connection for all parts of a page.

Joe

Here's the output of ipvsadm after downloading a test page consisting of 80 different gifs (80 lines of <img src="foo.gif">.

director:/etc/lvs# ipvsadm
IP Virtual Server version 1.0.7 (size=4096)                    
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:http rr
  -> bashfull.mack.net:http         Route   1      2          0         
  -> sneezy.mack.net:http           Route   1      2          0         

It would appear that the browser has made 4 connections which are left open. The client shows (netstat -an) 4 connections which are ESTABLISHED, while the realservers show 2 connections each in FIN_WAIT2. Presumably each connection was used to transfer an average of 20 requests.

If the client-server pair were using persistent connection, I would expect only one connection to have been used.

Andreas J. Koenig andreas.koenig@anima.de 02 May 2001

Netscape just doesn't use a single connection, and not only Netscape. All major browsers fire mercilessly a whole lot of connections at the server. They just don't form a single line, they try to queue up on several ports simultaneously...

...and that is why you should never set KeepAliveTimeout to 15 unless you want to burn your money. You keep several gates open for a single user who doesn't use them most of the time while you lock others out.

Julian

Hm, I think the browsers fetch the objects by creating 3-4 connections (not sure how many exactly). If there is a KeepAlive option in the httpd.conf you can expect small number of inactive connections after the page download is completed. Without this option the client is forced to create new connections after each object is downloaded and the HTTP connections are not reused.

The browsers reuse the connection but there are more than one connections.

KeepAlive Off can be useful for banner serving but a short KeepAlive period has its advantages in some cases with long rtt where the connection setups costs time and because the modern browsers are limited to the number of connections they open. Of course, the default period can be reduced but its value depends on the served content, whether the client is expected to open many connections for short period or just one.

Peter Mueller pmueller@sidestep.com 01 May 2001

I was searching around on the web and found the following relevant links..

http://thingy.kcilink.com/modperlguide/performance/KeepAlive.html
http://httpd.apache.org/docs/keepalive.html -- not that useful
http://www.apache.gamma.ru/docs/misc/fin_wait_2.html -- old but interesting

Andreas J. Koenig andreas.koenig@anima.de 02 May 2001

If you have 5 servers with 15 secs KeepAliveTimeout, then you can serve

60*60*24*5/15 = 28800 requests per day

Joe

don't you actually have MaxClients=150 servers available and this can be increased to several thousand presumably?

Peter Mueller

I think a factor of 64000 is forgotten here (number of possible reply ports), plus the fact that most http connections do seem to terminate immediately, despite the KeepAlive.

Andreas (?)

Sure, and people do this and buy lots of RAM for them. But many of them servers are just in 'K' state, waiting for more data on these KeepAlive connections. Moreover, they do not compile the status module into their servers and never notice.

Let's rewrite the above formula:

MaxClients / KeepAliveTimeout

denotes the number of requests that can be satisfied if all clients *send* a keepalive header (I think that's "Connection: keepalive") but *do not actually use* the kept-alive line. If they actually use the kept-alive line, you can serve more, of course.

Try this: start apache with the -X flag, so it will not fork children and set the keepalivetimeout to 60. Then load a page from it with Netscape that contains many images. You will notice that many pictures arive quickly and a few pictures arive after a long, long, long, looooong time.

When the browser parses the incoming HTML stream and sees the first IMG tag it will fire off the first IMG request. It will do likewise for the next IMG tag. At some point it will reach an IMG tag and be able to re-use an open keepalive connection. This is good and does save time. But if a whole second has passed after a keepalive request it becomes very unlikely that this connection will be re-used ever, so 15 seconds is braindead. One or two seconds is OK.

In the above experiment my Netscape loaded 14 images immediately after the HTML page was loaded, but it took about a minute for each of the remaining 4 images which happened to be the first in the HTML stream.

Joe

Here's the output of ipvsadm after downloading the same 80 gif page with the -X option on apache (only one httpd is seen with ps, rather than the 5 I usually have).

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.11 (size=16384)                  
Prot LocalAddress:Port Scheduler Flags                         
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:http rr
  -> bashfull.mack.net:http         Route   1      1          1         
  -> sneezy.mack.net:http           Route   1      0          2         

The page shows a lot of loading at the status line, then stops, showing 100% of 30k. However the downloaded page is blank. A few seconds later the gifs are displayed. The client shows 4 connections in CLOSE_WAIT and the realservers each show 2 connections in FIN_WAIT2.

Paul J. Baker pbaker@where2getit.com 02 May 2001

The KeepAliveTimeout value is NOT the connection time out. It says how long Apache will keep an active connection open waiting for a new request to come on the SAME connection after it has fulfilled a request. Setting this to 15 seconds does not mean apache cuts all connections after 15 seconds.

I write server load-testing software so I have do quiet a bit of research in the behaviour of each browser. If Netscape hits a page with a lot of images on it, it will usually open about 8 connections. It will use these 8 connections to download things as quickly as it can. If the server cuts each connection after 1 request is fullfilled, then Netscape browser has to keep reconnecting. This costs a lot of time. KeepAlive is a GOOD THING. Netscape does close the connections when it is done with them which will be well before the 15 seconds since the last request expire.

Think of KeepAliveTimeout as being like an Idle Timeout in FTP. Imagine it being set to 15 seconds.

11.18 Dynamically generated images on web pages

On static webpages, all realservers serve identical content. Dynamically generated images are only generated on the webserver that gets the request. The director will send the clients request for that image to any of the realservers and not neccessarily the realserver that generated the images.

Solutions are

Both methods are described in the section using fwmark for dynamically generated images.

11.19 http: logs, shutting down, cookies, url_parsing, squids, mod_proxy, indexing programs, htpasswd

Logs

The logs from the various realservers need to be merged. From the postings below, using a common nfs filesystem doesn't work and no-one knows whether this is a locking problem fixable by NFS-v3.0 or not. The way to go seems to be merglog

Emmanuel Anne emanne@absysteme.fr

..the problem about the logs. Apparently the best is to have each web server process its log file on a local disk, and then to make stats on both all files for the same period... It can become quite complex to handle, is there not a way to have only one log file for all the servers

Joe - (this is quite old i.e. 2000 or older and hasn't been tested).

log to a common nfs mounted disk? I don't know whether you can have httpds running on separate machines writing to the same file. I notice (using truss on Solaris) that apache does write locking on files while it is running. Possibly it write-locks the log files. Normally multiple forked httpds are running. Presumably each of them writes to the log files and presumably each of them locks the log files for writing.

Webstream Technical Support mmusgrove@webstream.net 18 May 2001

I've got 1 host and 2 real servers running apache(ver 1.3.12-25). The 2nd server NFS exports a directory called /logs. The 1st acts as a client and mounts that drive. I have apache on the 1st card saving the access_log file for each site into that directory as access1.log. The 2nd server saves it as access2.log in the same directory. Our stats program on another server looks for *.log files in that directory. The problem is that whenever I access a site (basically browse through all the pages of a site), the 2nd card adds the access info into the access2.log file and everything is fine. The 1st card saves it to the access1.log file for a few seconds, then all of a sudden the file size goes down to 0 and its empty.

Alois Treindl alois@astro.ch

I am running a similar system, but with Linux 2.4.4 which has NFS version 3, which is supposed to have safe locking. Earlier NFS version are said to have buggy file locking, and as Apache must lock the access_log for each entry, this might be the cause of your problem.

I have chosen not to use a shared access_log between the real servers, i.e. not sharing it via NFS. I share the documents directory and a lot else via NFS between all realservers, but not the logfiles.

I use remote syslog logging to collect all access logs on one server.

1. On server w1, which holds the collective access_log and error_log, I have in /etc/syslog.conf the entry: local0.=info /var/log/httpd/access_log local0.err /var/log/httpd/error_log

2. on all other servers, I have an entry which sends the messages to w1: local0.info @w1 local0.err @w1

3. On all servers, I have in http.conf the entry: CustomLog "|/usr/local/bin/http_logger" common

4. and the utility http_logger, which sends the log messages to w1, contains:


#!/usr/bin/perl
# script: logger
use Sys::Syslog;
$SERVER_NAME = 'w1';
$FACILITY = 'local0';
$PRIORITY = 'info';
Sys::Syslog::setlogsock('unix');
openlog ($SERVER_NAME,'ndelay',$FACILITY);
while (<>) {
  chomp;
  syslog($PRIORITY,$_);
}
closelog;    

5. I also to error_log logging to the same central server. This is even easier, because Apache allows to configure in httpd.conf: ErrorLog syslog:local0

On all realservers, except w1, thse log entries are sent to w1 by the syslog.conf given above.

I think it is superior to using NFS. the access_log entries of course contain additional fields in from of the Apache log lines, which originate from the syslogd daemon.

It is also essential that the realservers are well synchonized, so that the log entries appear in correct timestamp sequence.

I have a shared directory setup and both Real Servers have their own access_log files that are put into that directory (access1.log and access2.log...i do it this way so the Stats server can grab both files and only use 1 license), so i dont think its a file locking issue at all. Each apache server is writing to its own separate access log file, it's just that they happen to be in the same shared directory. How would httpd daemon on server A know to LOCK the access log from server B.

Alois

Why do you think it is NOT a file locking problem? On each realserver, you have a lot of httpd daemons running, and to write into the same file without interfering, they will have to use file locking, to get exclusive access. On one each server, you do not have just one httpd daemon, but many forked copies. All these processes on ONE server need to write to the SAME logfile. For this shared write access, they use file locking.

If this files sits on a NFS server, and NFS file locking is buggy (which I only know as rumor, not as experience), then it might well be the cause of your problem.

Why don't you keep your access_log local on each server, and rotate them frequently, to collect them on one server (merge-sorted by date/time), and then use your Stats server on it?

If you use separate log files anyway, I cannot see the need to create them on NFS. Nothing prevents you from rotating them every 6 hours, and you will probably not need more current stats.

So the log files HAVE to be on a local disk or else one may run into such a problem as I am having now?

Alois

I don't now. I only have read the NFS file locking before NFS 3.0 is broken. It is not a problem related to LVS. You may want to read http://httpd.apache.org/docs/mod/core.html#lockfile

Thanks but Ive seen that before. Each server saves that lock file to its own local directory.

Anyone have a quick and dirty script to merge-sort by date/time the combined apache logs?

Martin Hierling mad@cc.fh-lippe.de

try merglog

Alois

assuming that all files contain only entries from the same month, I think you can try:

sort -m -k 4 file1 file2 file3 ...

Arnaud Brugnon arnaud.brugnon@24pmteam.com

We successfuly use mergelog (you can find on freshmeat or SourceForge) for merging logs (gz or not) from our cluster nodes. With use a simple perl script for downloading them to a single machine.

Juri Haberland list-linux.lvs.users@spoiled.org Jul 13 2001

I'm looking for a script that merges and sorts the access_log files of my three real servers running apache. The logs can be up to 500MB if combined.

Michael Stiller ms@2scale.net Jul 13 2001

You want to look at mod_log_spread

Stuart Fox stuart@fotango.com

cat one log to the end of the other then run

sort -t - -k 3 ${WHEREVER}/access.log > new.log
then you can run webalizer on it.

Thats what I use, doesnt take more than about 30 seconds. If you can copy the logs from your real servers to another box and run sort there, it seems to be better

Heck, here's the whole(sanitized) script


#!/bin/bash
 
##
## Set constants
##
 
DATE=`date "+%d-%b-%Y"`
YESTERDAY=`date --date="1 day ago" "+%d-%b-%Y"`
ROOT="/usr/stats"
SSH="/usr/local/bin/ssh"
 
## First(1) Remove the tar files left yesterday
find ${ROOT} -name "*.tar.bz2" |xargs -r rm -v
 
##
## First get the access logs
## Make sure some_account has read-only access to the logs
 
su - some_account -c "$SSH some_account@real.server1 \"cat
/usr/local/apache/logs/access.log\" >  ${ROOT}/logs/$DATE.log"
su - some_account -c "$SSH some_account@real.server2 \"cat
/usr/local/apache/logs/access.log\" >> ${ROOT}/logs/$DATE.log"
 
##
## Second sort the contents in date order
##
 
sort -t - -k 3 ${ROOT}/logs/$DATE.log > ${ROOT}/logs/access.log
 
##
## Third run webalizer on the sorted files
## Just set webalizer to dump the files in ${ROOT}/logs
 
/usr/local/bin/webalizer -c /usr/stats/conf/webalizer.conf

## 
## Forth remove all the crud
## You still got the originals on the real servers
 
find ${ROOT} -name "*.log"|xargs -r rm -v
 
##
## Fifth tar up all the files for transport to somewhere else
 
cd ${ROOT}/logs && tar cfI ${DATE}.tar.bz2 *.png *.tab *.html && chown
some_account.some_account ${DATE}.tar.bz2

Stuart Fox stuart@fotango.com

Ok scrub my last post, i just tested mergelog. On a 2 x 400mb log it took 40 seconds, my script did it in 245 seconds.

Juri Haberland list-linux.lvs.users@spoiled.org

Ok, thanks to you all very much! That was quick and successful :-)

I tried mergelog, but I had some difficulties to compile it on Solaris 2.7 until I found that I was missing GNU make...

But now: Happy happy, joy joy!

karkoma abambala@genasys.es

Another posibility... http://www.xach.com/multisort/

Stuart Fox stuart@fotango.com

mergelog seems to be 33% faster than multisort using exactly the same file

Shutting down http

You need to shut down httpd gracefully, by bringing the weight to 0 and letting connections drop, or you will not be able to bind to port 80 when you restart httpd. If you want to do on the fly modifications to your httpd, and keep all realservers in the same state, you may have problems.

Thornton Prime thornton@jalan.com 05 Jan 2001

I have been having some problems restarting apache on servers that are using LVS-NAT and was hoping someone had some insight or a workaround.

Basically, when I make a configuration change to my webservers and I try to restart them (either with a complete shutdown or even just a graceful restart), Apache tries to close all the current connections and re-bind to the port. The problem is that invariably it takes several minutes for all the current connections to clear even if I kill apache, and the server won't start as long as any socket is open on port 80, even if it is in a 'CLOSING' state.

Michael E Brown wrote:

Catch-22. I think the proper way to do something like this is to take the affected server out of the LVS table _before_ making any configuration changes to the machine. Wait until all connections are closed, then make your change and restart apache. You should run into less problems this way. After the server has restarted, then add it back into the pool.

I thought of that, but unfortunately I need to make sure that the servers in the cluster remain in a near identical state, so the reconfiguration time should be minimal.

Julian wrote

Hm, I don't have such problems with Apache. I use the default configuration-time settings, may be with higher process limit only. Are you sure you use the latest 2.2 kernels in the real servers?

I'm guessing that my problem is that I am using LVS persistent connections, and combined with apache's lingering close this makes it difficult for apache to know the difference between a slow connection and a dead connection when it tries to close down, so the time it takes to clear some of the sockets approaches my LVS persistence time.

I haven't tried turning off persistence, and I haven't tried re-compiling apache without lingering-close. This is a production cluster with rather heavy traffic and I don't have a test cluster to play with. In the end rebooting the machine has been faster than waiting for the ports to clear so I can restart apache, but this seems really dumb, and doesn't work well because then my cluster machines have different configuration states.

One reason for your servers to block is a very low value for the client number. You can build apache in this way:

CFLAGS=-DHARD_SERVER_LIMIT=2048 ./configure ...

and then to increase MaxClients (up to the above limit). Try with different values. And don't play too much with the MinSpareServers and MaxSpareServers. Values near the default are preferred. Is your kernel compiled with higher value for the number of processes:

/usr/src/linux/include/linux/tasks.h

Is there any way anyone knows of to kill the sockets on the webserver other than simply wait for them to clear out or rebooting the machine? (I tried also taking the interface down and bringing it up again ... that didn't work either.)

Is there any way to 'reset' the MASQ table on the LVS machine to force a reset?

No way! The masq follows the TCP protocol and it is transparent to the both ends. The expiration timeouts in the LVS/MASQ box are high enough to allow the connection termination to complete. Do you remove the real servers from the LVS configuration before stopping the apaches? This can block the traffic and can delay the shutdown. It seems the fastest way to restart the apache is apachectl graceful, of course, if you don't change anything in apachectl (in the httpd args).

Cookies

see cookie

URL parsing

unknown

Is there any way to do URL parsing for http requests (ie send cgi-bin requests to one server group, static to another group?)

John Cronin jsc3@havoc.gtf.org 13 Dec 2000

Probably the best way to do this is to do it in the html code itself; make all the cgis hrefs to cgi.yourdomain.com. Similarly, you can make images hrefs to image.yourdomain.com. You then set these up as additional virtual servers, in addition to your www virtual server. That is going to be a lot easier than parsing URLs; this is how they have done it at some of the places I have done consulting for; some of those places were using Extreme Networks load balancers, or Resonate, or something like that, using dozens of Sun and Linux servers, in multiple hosting facilities.

Horms

What you are after is a layer-7 switch, that is something that can inspect HTTP packets and make decisions bassed on that information. You can use squid to do this, there are other options. A post was made to this list about doing this a while back. Try hunting through the archives.

LVS on the other hand is a layer-4 switch, the only information that it has available to it is IP address and port and protocol (TCP/IP or UDP/IP). It cannot inspect the data segment and see even understand that the request is an HTTP request, let alone that the URL requested is /cgi-bin or whatever.

There has been talk of doing this, but to be honest it is a different problem to that which LVS solves and arguably should live in user space rather than kernel space as a _lot_ more proccessing is required.

mod_proxy

Sean, 25 Dec 2000

I need to forward request using the Direct Routing method to a server. However I determine which server to send the request to depending on the file it has requested in the HTTP GET not based on it's load.

Michael E Brown

Use LVS to balance the load among several servers set up to reverse-proxy your realservers, set up the proxy servers to load-balance to realservers based upon content.

Atif Ghaffar atif@4unet.net

On the LVS servers you can run apache with mod_proxy compiled in, then redirect traffic with it.

Example

        ProxyPass /files/downloads/ http://internaldownloadserver/ftp/
        ProxyPass /images/ http://internalimagesserver/images/

more on Proxy pass: http://www.linuxfocus.org/English/March2000/article147.html

or you can use mod_rewrite, in that case, your REAL servers should be reachable from the net.

there is also a transparent proxy module for apache http://www.stevek.com/projects/mod_tproxy/

squids

Palmer J.D.F J.D.F.Palmer@swansea.ac.uk Nov 05, 2001

With the use of IP-Tables etc on the directors can you route various URLs/IPs (such as ones requiring NTLM authentication like FrontPage servers etc) not to go through the caches, but just to be transparently routed to their destination.

Horms

This can only be done if the URLS to be passed directly through can be identified by IP address and/or Port. LVS only understands IP addresses and Ports, weather it is TCP or UDP, and other spurious low level data that can be matched using ipchains/iptables.

In particular LVS does _not_ understand HTML, it cannot differentiate between, for instance http://blah/index.html and http://blah/index.asp. rather you would need to set up something along the lines of http://www.blah/index.html and http://asp.blah/index.asp, and have www.blah and asp.blah resolve to different IP addresses.

Further to this you may want to take a look at http://wwwcache.ja.net/JanetService/PilotService.html

Running indexing programs (eg htdig) on the LVS

(From Ted I think)

Setup -

realservers are node1.foobar.com, node2.foobar.com... nodeN.foobar.com, director has VIP=lvs.foobar.com (all realservers appear as lvs.foobar.com to users).

Problem -

if you run the indexing program on one of the (identical) realservers, the urls of the indexed files will be

http://nodeX.foobar.com/filename

These urls will be unuseable by clients out in internetland since the realservers are not individually accessable by clients.

If instead you run the indexing program from outside the LVS (as a user), you will get the correct urls for the files, but you will have to move/copy your index back to the realservers.

Solution (from Ted Pavlic, edited by Joe).

On the indexing node, if you are using LVS-NAT add a non-arping device (eg lo:0, tunl0, ppp0, slip0 or dummy) with IP=VIP as if you were setting up LVS-DR (or LVS-Tun). With LVS-DR/LVS-Tun this device with the VIP is already setup. The VIP is associated in dns with the name lvs.foobar.com. To index, on the indexing node, start indexing from http://lvs.foobar.com and the realserver will index itself giving the URLs appropriate for the user in the index.

Alternately (for LVS-NAT), on the indexing node, add the following line to /etc/hosts

127.0.0.1 localhost lvs.foobar.com

make sure your resolver looks to /etc/hosts before it looks to dns and then run your indexing program. This is a less general solution, since if the name of lvs.foobar.com was changed to lvs.bazbar.com, or if lvs.foobar.com is changed to be a CNAME, then you would have to edit all your hosts files. The solution with the VIP on every machine would be handled by dns.

There is no need to fool with anything unless you are running LVS-NAT.

htpasswd with http

Noah Roberts wrote:

If anyone has had success with htpasswords in an LVS cluster please tell me how you did it.

Thornton Prime thornton@jalan.com Fri, 06 Jul 2001

We have HTTP authentication working on dozens of sites through LVS with all sorts of different password storage from old fashioned htpasswd files to LDAP. LVS when working properly is pretty transparent to HTTP between the client and server.

11.20 HTTP 1.0 and 1.1 requests

Joe

Does anyone know how to force an HTTP 1.1 connection?

Patrick O'Rourke orourke@missioncriticallinux.com

httperf has an 'http-version' flag which will cause it to generate 1.0 or 1.1 requests.

11.21 https

http is an IP based protocol, while https is a name based protol.

http: you can test an httpd from the console by configuring it to listen on the RIP of the realserver. Then when you bring up the LVS you can re-configure it to listen on the VIP.

https: requires a certificate with the official (DNS) name of the server as the client sees it (the DNS name of the LVS cluster which is associated with the VIP). The https on the realserver then must be setup as if it had the name of the LVS cluster. To do this, activate the VIP on a device on the realserver (it can be non-arping or arping - make sure there are no other machines with the VIP on the network or disconnect your realserver), make sure that the realserver can resolve the DNS name of the LVS to the VIP (by dns or /etc/hosts), setup the certificate and conf file for https and startup the httpd. Check that a netscape client running on the realserver (so that it connects to the realserver's VIP and not to the arping VIP on the director) can connect to https://lvs.clustername.org

Do this for all the realservers, then use ipvsadm on the director to forward https requests to each of the RIPs.

The scheduling method for https must be persistent for keys to remain valid.

When compiling in Apache.. What kind of certificate should I create for a real application with Thawte?

Alexandre Cassen Alexandre.Cassen@wanadoo.fr

When you generate your CSR use the CN (Common Name) of the DNS entry of your VIP.

11.22 Named Based Virtual Hosts for https

Dirk Vleugels dvl@2scale.net 05 Jul 2001

I want to host several https domains on a single LVS-DR cluster. The setup of http virtual hosts is straightforward, but what about https? The director needs to be known with several VIP's (or it would be impossible to select the correct server certificate).

Matthew S. Crocker matthew@crocker.com

SSL certs are labelled with the URL name but the SLL session is established before any HTTP requests. So, you can only have one SSL cert tied to an IP address. If you want to have a single host handle multiple SSL certs you need a seperate IP for each cert. You also need to setup the director to handle all the IP's

named based HTTP DO NOT WORK with SSL because the SSL cert is sent BEFORE the HTTP so the sever won't know what cert to send.

Martin Hierling mad@cc.fh-lippe.de

You canīt do Name Based VHosts, because the SSL Stuff is done before HTTP snaps in. So at the Beginning there is only the IP:Port and no www.domain.com. Look at Why can't I use SSL with name-based/non-IP-based virtual hosts?

(Here reproduced in its entirety)

The reason is very technical. Actually it's some sort of a chicken and egg problem: The SSL protocol layer stays below the HTTP protocol layer and encapsulates HTTP. When an SSL connection (HTTPS) is established Apache/mod_ssl has to negotiate the SSL protocol parameters with the client. For this mod_ssl has to consult the configuration of the virtual server (for instance it has to look for the cipher suite, the server certificate, etc.). But in order to dispatch to the correct virtual server Apache has to know the Host HTTP header field. For this the HTTP request header has to be read. This cannot be done before the SSL handshake is finished. But the information is already needed at the SSL handshake phase.

Bingo!

Dirk

With LVS-NAT this would be no problem (targeting different ports on the RS's). But with direct routing i need different virtual IP's on the RS. The question: will the return traffic use the VIP-IP by default? Otherwise the client will notice the mismatch during the SSL handshake.

"Matthew S. Crocker" matthew@crocker.com

Yes, on the real servers you will have multiple dummy interfaces, on for each VIP. Apache will bind itself to each interface. The sockets for the SSL session are also bound to the interface. The machine will send packets from the IP address of the interface the packet leaves the machine on. So, it will work as expected. The clients will see packets from the IP address they connected to.

Julian Anastasov ja@ssi.bg

NAT:    one RIP for each name/key
DR/TUN: one VIP for each name/key
Is this correct?

James Ogley james.ogley@pinnacle.co.uk

The realservers also need a VIP for each https URL, as they need to be able to resolve that URL to themself on a unique IP (this can be achieved with /etc/hosts of course)

Joe

Are you saying that https needs its own IP:port rather than just IP?

Dirk

Nope. A unique IP is sufficient. Apache has to decide which csr to use _before_ seeing the 'Host' header in the HTTP request (SSL handshake comes first). A unique port is also sufficient to decide which virtual server is meant though (and via NAT easier to manage imho).

(I interpret Dirk as saying that the IP:port must be unique. Either the IP is unique and the port is the same for all urls, or the IP is common and there is a unique port for each url.)

11.23 Databases

Normal databaseds (eg mysqld, i.e. anything but Oracle's parallel database server for several 100k$) running under LVS suffer the same restrictions of single writer/many readers as does any other service (eg smtp) where the user can write to files on the realserver.

Databases running independantly on several realservers have to be kept synchronised for content, just as do webservers. If the database files are read-only as far as the LVS clients are concerned, and the LVS administrator can update each copy of the database on the realservers at regular intervals (eg a script running at 3am) then you can run a copy of the databased on each realserver, reading the files which you are keeping synchronised.

Online transaction processing requires that LVS clients be able to write to the database.

If you try to do this by setting up an LVS where each realserver has a databased and its own database files, then writes from a particular user will go to only one of the realservers. The database files on the other realservers will not be updated and subsequent LVS users will be presented with inconsistent copies of the database files.

The Linux Scalable Database project http://lsdproject.sourceforge.net/ is working on code to serialise client writes so that they can be written to all realservers by an intermediate agent. Their code is experimental at the moment, but is a good prospect in the long term for setting up multiple databased and file systems on separate realservers. (Note: May 2001, the replication feature of mysql is functionally equivelant.)

Currently most databased are deployed in a multi-tier setup. The clients are out in internet land; they connect to a web-server which has clients for the database; the web-server database client connects to a single databased. In this arrangement the LVS should balance the webservers/database clients and not balance the database directly.

Production LVS databases, eg the service implemented by Ryan Hulsker RHulsker@ServiceIntelligence.com (sample load data at http://www.secretshopnet.com/mrtg/) have the LVS users connect to database clients (perl scripts running under a webpage) on each realserver. These database clients connect to a single databased running on a backend machine that the LVS user can't access. The databased isn't being LVS'ed - instead the user connects to LVS'ed database clients on the realserver(s) which handle intermediate dataprocessing, increasing your throughput.

The approach of having databaseds on each realserver accessing a common filesystem on a back-end server, fails. Tests with mysqld running on each of two realservers working off the same database files mounted from a backend machine, showed that reads were OK, but writes from any realserver either weren't seen by the other mysqld or corrupted the database files. Presumably each mysqld thinks it owns the database files and keeps copies of locks and pointers. If another mysqld is updating the filesystem at the same time then these first set of locks and pointers are invalid. Presumably any setup in which multiple databaseds were writing to one file system (whether NFS'ed, GFS'ed, coda, intermezzo...) would fail for the same reason.

In an early attempt to setup this sort of LVS jake buchholz jake@execpc.com setup an LVS'ed mysql database with a webinterface. LVS was to serve http and each realserver to connect to the mysqld running on itself. Jake wanted the mysql service to be lvs'ed as well and for each realserver to be a mysql client. The solution was to have 2 VIPs on the director, one for http and the other for mysqld. Each http realserver makes a mysql request to the myqslVIP. In this case no realserver is allowed to have both a mysqld and an httpd. A single copy of the database is nfs'ed from a fileserver. This works for reads.

mysql replication

MySQL (and most other databases) supports replication of databases.

Ted Pavlic tpavlic@netwalk.com 23 Mar 2001

When used with LVS, a replicated database is still a single database. The MySQL service is not load balanced. HOWEVER, it is possible to put some of your databases on one server and others on another. Replicate each SET of databases to the OTHER server and only access them from the other server when needed (at an application or at some fail-over level).

Doug Sisk sisk@coolpagehosting.com 9 May 2001

An article on mysql's built in replication facility

Michael McConnell michaelm@eyeball.com> 13 Sep 2001

Can anyone see a down side or a reason why one could not have two System in a Failover Relationship running MySQL. The Database file would be synchronized between the two systems via a crontab and rsync. Can anyone see a reason why rsync would not work? I've on many occasions copied the entire mysql data directory to another system and start it up without a problem. I recognize that there are potential problems that the rsync might take place while the master is writing and the sync will only have part of a table, but mysql's new table structure is supposed to account for this. If anything, a quick myismfix should resolve these problems.

Paul Baker pbaker@where2getit.com>

Why not just use MySQL's built in replication?

There are many fundamental problems with MySQL Replication. MySQL's Replication requires that two systems be setup with identical data sources, activate in a master / slave relationship. If the Master fails all requests can be directed to the Slave. Unfortunately this Slave does not have a Slave, and the only way to give it a slave, is to turn it off, synchronize it's data with another system and then Activate them in a Master/Slave relationship, resulting in serious downtime when databases are in excess of 6 gigs (-:

This is the most important problem, but there are many many more, and I suggest people take a serious look at other options. Currently I use a method of syncing 3 systems using BinLog's.

Paul Baker pbaker@where2getit.com>

What is the downtime when you have to run myisamchk against the 6 gig database because rsync ran at exactly the same time as mysql was writting to the database files and now your sync'd image is corrupted?

There is no reason you can not set up the slave as a master in advance from the beginning. You just use the same database image as from the original master.

When the master master goes down, set up a new slave by simple copying the original master image over to the new slave, then point it to the old slave that was already setup to be a master. You wouldn't need to take the original slave down at all to bring up a new one. You would essentially be setting up a replication chain but only with the first 2 links active.

Michael McConnell michaelm@eyeball.com>

In the configuration I described using Rsync, the MyISMchk would take place on the slave system, I recognize the time involved would be very large, but this is only the slave. This configuration would be setup so an Rsync between the master and slave takes place every 2 hours, and then the Slave would execute a MyISMchk to ensure the data is ready for action.

I recognize that approximately 2 hours worth of data could be lost, but I would probably us the MySQL BinLogs rotated at 15 minutes interval and stored on the slave to allow this to be manually merged in, and keep the data loss time down to only 15 minutes.

Paul, you said that I could simply copy the Data from the Slave to a new Slave, but you must keep in mind, in order to do this MySQL requires that the Master and Slave data files be IDENTICAL, that means the Master must be turned off, the data copied to the slave, and then both systems activated. Resulting in serious downtime.

Paul

You only have to make a copy of the data one time when you initial set up your Master the first time. As long as it takes to do this is your downtime:

   kill mysqlpid
   tar cvf mysql-snapshot.tar /path/to/mysql/datadir
   /path/to/mysql/bin/safe_mysqld

Your down time is essentially only how long it takes to copy your 6 gigs of data NOT across a network, but just on the same machine. (which is far less than a myisamchk on the same data) Once that is done, at your leisure you can copy the 6 gigs to the slave while the master is still up and serving requests.

You can then continue to make slave after slave after slave just by copying the original snap shot to each one. The master never has to be taking offline again.

Michael McConnell

You explained that I can kill the MySQL on the Master, tar it up, copy the data to the Slave and activate it as the Slave. Unfortunately this is not how MySQL works. MySQL requires that the Master and Slave be identical data files, *IDENTICAL* that means the Master (tar file) cannot change before the Slave comes online.

Paul

Well I suppose there was an extra step that I left out (because it doesn't affect the amount of downtime). The complete migration steps would be:

  1. Modify my.cnf file to turn on replication of the master server. This is done while the master mysql daemon is still running with the previous config in memory.
  2. shutdown the mysql daemon via kill.
  3. tar up the data.
  4. start up the mysql daemon. This will activate replication on the master and cause it to start logging all changes for replication since the time of the snapshot in step 3.

    At this point downtime is only as long as it takes you to do steps 2, 3, and 4.

  5. copy the snapshot to a slave server and active replication in the my.cnf on the slave server as both a master and a slave.
  6. start up the slave daemon. at this time the slave will connect to the master and catch up to any of the changes that took place since the snapshot.

So as you see, the data can change on the master before the slave comes online. The data just can't change between when you make the snapshot and when the master daemon comes up configured for replication as a master.

Michael McConnell

Paul you are correct. I've just done this experiment.

A(master) -> B(slave)
B(master -> C(slave)

A Died. Turn off C's Database, tar it up, replicated the Data to A, Activate A as Slave to C. No data loss, and 0 downtime.

(there appears to have been an offline exchange in here.)

Michael McConnell michaelm@eyeball.com>

I've just completed the Rsync deployment method, this works very well. I find this method vastly superiors to both other methods we discussed. Using Rsync allows me to only use 2 HOSTS and I can still provide 100% uptime. In the other method I need 3 systems to provide 100% uptime.

In addition the Rsync method is far easier to maintain and setup.

I do recognize this is not *perfect* I run the Rsync every 20 minutes, and then I run myismchk on the slave system immediately afterwards. I run the MyISMChk to only scan Tables that have changed since the last check. Not all my tables change every 20 minutes. I will be timing operations and lowering this Rsync down to approximately 12 minutes. This method works very effectively for managing a 6 Gig Database that is changing approximately 400 megs of data a day.

Keep in mind, there are no *real time* replication methods available for MySQL. Running with MySQL's building Replication commonly results (at least with a 6 gig / 400 megs changing) in as much as 1 hour of data inconsistency. The only way to get true real time is to use a shared storage array.

Paul Baker

MySQL builtin replication is supposed to be "realtime" AFAIK. It should only fall behind when the slave is doing selects against a database that causes changes to wait until the selects are complete. Unless you have a select that is taking an hour, there is no reason it should fall that far behind. Have you discussed your findings with the MySQL developers?

Michael McConnell

I do not see MySQL making any claims to Real-time. There are many situations where a high load will result in systems getting backed up, especially if your Slave system performs other operations.

MySQL's built-in replication functions like so;

  1. Master writes updates to Binary Log
  2. Slave checks for Binary Updates
  3. Slave Downloads new Bin Updates / Installs

Alexander N. Spitzeraspitzer@3plex.com

how often are you running rsync? since this is not realtime replication, you run the risk of losing information if the master dies before the rsync has run (and new data has been added since the last rsync.)

Don Hinshaw dwh@openrecording.com>

I have one client where we do this. At this moment ( I just checked) their database is 279 megs. An rsync from one box to another across a local 100mbit connections takes 7-10 seconds if we run it at 15 minute intervals. If we run it at 5 minute intervals it takes < 3 seconds. If we run it too often, we get an error of "unexpected EOF in read_timeout".

I don't know what causes that error, and they are very happy with the current situation, so I haven't spent any time to find out why. I assume it has something to do with write caching or filesystem syncing, but that's just a wild guess with nothing to back it up. For all I know, it could be ssh related.

We also do an rsync of their http content and httpd logs, which total approximately 30 gigs. We run this sync hourly, and it takes about 20 minutes.

Benjamin Lee benjaminlee@consultant.com>

For what it's worth, I have also been witness to the EOF error. I have also fingered ssh as the culprit.

John Cronin

What kind of CPU load are you seeing? rsync is more CPU intensive than most other replication methods, which is how it gains its bandwidth efficiency. CPU Load? How many files are you syncing up - a whole filesystem, or just a few key files? From you answer, I assume you are not seeing a significant CPU load.

Michael McConnell

I RSync 6 Gigs worth of data, Approximately 50 files (tables). Calculating Checksum's is really a very simple calculation, the cpu used to do this is less than 0% - 1% of a PIII 866. (care of vmstat)

I believe all of these articles you have found are related to RSync Servers that serve in the function one would see a system as a major FTP Server. For example ftp.linuxberg.org or ftp.cdrom.com

11.24 Cookies

Cookies are not a service. Cookies are an application level protocol for maintaining state for a client when using the stateless http/https protocols. Other methods for maintaining state involve passing information to the client in the URL. (This can be done with e.g. php.) Cookies are passed between servers and clients which have http, https and/or database services and need to be considered when setting up an LVS.

For the cookie specification see netscape site.

Being a layer 4 switch, LVS doesn't inspect the content of packets and doesn't know what's in them. A cookie is contained in a packet and the packet looks just like any other packet to an LVS.

Eric Brown wrote:

Can LVS in any of its modes be configured to support cookie based persistent sessions?

Horms horms@vergenet.net 3 Jan 2001

No. This would require inspection of the TCP data secion, and infact an understanding of HTTP. LVS has access only to the TCP headers.

Roberto Nibali ratz@tac.ch 19 Apr 2001

LVS is a Layer4 load balancer and can't do content based (L7) load balancing.

You shouldn't try to solve this problem by changing the TCP Layer to provide a solution which should be handled by the Application Layer. You should never touch/tweak TCP settings out of the boundaries given in the various RFC's and their implementations.

If your application passes a cookie to the client, these are the general approaches:

11.25 r commands; rsh, rcp, and their ssh replacements

An example of using rsh to copy files is in performance data for single realserver LVS Sect 5.2,

Caution: The matter of rsh came up in a private e-mail exchange. The person had found that rshd, operating as an LVS'ed service, initiated a call (rsh client request) to the rshd running on the LVS client. (See Stevens "Unix Network Programming" Chapter 14, which explains rsh.) This call will come from the RIP rather than the VIP. This will require rsh to be run under LVS-NAT or else the realservers must be able to contact the client directly. Similar requests from the identd client and passive ftp on realservers cause problems for LVS.

11.26 NFS

failover of NFS

It is possible with LVS to export directories from realservers to a client, making an nfs fileserver (see performance data for single realserver LVS), near the end). This is all fine and dandy except that there is no easy way to fail-out the nfs service.

Joseph Mack

One of the problems with running NFS as an LVS'ed service (ie to make an LVS fileserver), that has come up on this mailing list is that a filehandle is generated from disk geometry and file location data. In general then the identical copies of the same file that are on different realservers will have different file handles. When a realserver is failed out (e.g. for maintenance) and the client is swapped over to a new machine (which he is not supposed to be able to detect), he will now have an invalid file handle.

Is our understanding of the matter correct?

Dave Higgen dhiggen@valinux.com 14 Nov 2000

In principle. The file handle actually contains a 'dev', indicating the filesystem, the inode number of the file, and a generation number used to avoid confusion if the file is deleted and the inode reused for another file. You could arrange things so that the secondary server has the same FS dev... but there is no guarantee that equivalent files will have the same inode number; (depends on order of file creation etc.) And finally the kicker is that the generation number on any given system will almost certainly be different on equivalent files, since it's created from a random seed.

If so is it possible to generate a filehandle only on the path/name of the file say?

Well, as I explained, the file handle doesn't contain anything explicitly related to the pathname. (File handles aren't big enough for that; only 32 bytes in NFS2, up to 64 in NFS3.)

Trying to change the way file handles are generated would be a MASSIVE redesign project in the NFS code, I'm afraid... In fact, you would really need some kind of "universal invariant file ID" which would have to be supported by the underlying local filesystem, so it would ramify heavily into other parts of the system too...

NFS just doesn't lend itself to replication of 'live' filesystems in this manner. It was never a design consideration when it was being developed (over 15 years ago, now!)

There HAVE been a number of heroic (and doomed!) efforts to do this kind of thing; for example, Auspex had a project called 'serverguard' a few years ago into which they poured millions in resources... and never got it working properly... :-(

Sorry. Not the answer you were hoping for, I guess...

shared scsi solution for NFS

(from a discussion with Horms at OLS 2001)

It seems that the code which calculates the filehandle in NFS is so entrenched in NFS, that it can't be rewritten to allow disks with the same content (but not neccessarily the same disk geometry) to act as failovers in NFS. The current way around this problem is for a reliable (eg RAID-5) disk to be on a shared scsi line. This way two machines can access the same disk. If one machine fails, then the other can supply the disk content. If the RAID-5 disk fails, then you're dead.

John Cronin jsc3@havoc.gtf.org 08 Aug 2001

You should be able to do it with shared SCSI, in an active-passive failover configuration (one system at a time active, the second system standing by to takeover if the first one fails). Only by using something like GFS could both be active simultaneously, and I am not familiar enough with GFS to comment on how reliable it is. If you get the devices right, you can have a seamless NFS failover. If you don't, you may have to umount and remount the file systems to get rid of stale file handle problems. For SMB, users will have to reconnect in the event of a server failure, but that is still not bad.

detecting failed NFS

Don Hinshaw

Would a simple TCP_CONNECT be the right way to test NFS?

Horms

If you are running NFS over TCP/IP then perhaps, but in my experience NFS deployments typically use NFS over UDP/IP. I'm suspecting a better test would be to issue some rpc calls to the NFS server and look at the response, if any. Something equivalent to what showmount can do might not be a bad start.

Joe

how about exporting the directory to the director as well and doing an `ls` every so often

Malcolm Cowe malcolm_cowe@agilent.com 7 Aug 2001

HP's MC/ServiceGuard cluster software monitors NFS using the "rpcinfo" command -- it can be quite sensitive to network congestion, but it is probably the best tool for monitoring this kind of service.

The problem with listing an NFS exported directory is that when the service bombs, ls will wait for quite a while -- you don't want the director hanging because of an NFS query that's gone awry.

Nathan Patrick np@sonic.net 09 Aug 2001

Mon includes a small program which extends what "rpcinfo" can do (and shares some code!) Look for rpc.monitor.c in the mon package, available from kernel.org among other places. In a nutshell, you can check all RPC services or specific RPC services on a remote host to make sure that they respond (via the RPC null procedure). This, of course, implies that the portmapper is up and responding.

From the README.rpc.monitor file:

This program is a monitor for RPC-based services such as the NFS protocol, NIS, and anything else that is based on the RPC protocol. Some general examples of RPC failures that this program can detect are:
  - missing and malfunctioning RPC daemons (such as mountd and nfsd)
  - systems that are mostly down (responding to ping and maybe
    accepting TCP connections, but not much else is working)
  - systems that are extremely overloaded (and start timing out simple
    RPC requests)
To test services, the monitor queries the portmapper for a listing of RPC programs and then optionally tests programs using the RPC null procedure.

At Transmeta, we use:

  "rpc.monitor -a" to monitor Network Appliance filers
  "rpc.monitor -r mountd -r nfs" to monitor Linux and Sun systems

Michael E Brown michael_e_brown@dell.com 08 Aug 2001

how about rpcping?

(Joe - rpcping is at nocol_xxx.tar.gz for those of us who didn't know rpcping existed.)

NFS locking and persistence

Steven Lang Sep 26, 2001

The primary protocol I am interested in here is NFS. I have the director setup with DR with LC scheduling, no persistence, with UDP connections timing out after 5 seconds. I figured the time it would need to be accessing the same host would be when reading a file, so they are not all in contention for the same file, which seems to cost preformance in GFS. That would all come in a series of accesses. So there is not much need to keep the traffic to the same host beyond 5 seconds.

Horms horms@vergenet.net>

I know this isn't the problem you are asking about, but I think there are some problems with your architecture. I spent far to much of last month delving into NFS - for reasons not related to laad balancing - and here are some of the problems I see with your design. I hope they are useful.

As far as I can work out you'll need the persistance to be much longer than 5s. NFS is stateless, but regardless, when a client connects to a server a small ammount of state is established on both sides. The stateless aspect comes into play in that if either side times out, or a reboot is detected then the client will attempt to reconnect to the server. If the client doesn't reconnect, but rather its packets end up on a different server because of load balancing the server won't know anything about the client and nothing good will come of the sitiation. The solution to this is to ensure a client consistently talks to the same server, by setting a really long persistancy.

There is also the small issue of locks. lockd should be running on the server and keeping track of all the locks the client has. If the client has to reconnect, then it assumes all its locks are lost, but in the mean time it assumes everything is consistent. If it isn't accessing the same server (which wouldn't work for the reason given above) then the server won't know about any locks it things it has.

Of course unless you can get the lockd on the different NFS servers to talk to each other you are going to have a problem if different clients connected to different servers want to lock the same file. I think if you want to have any locking you're in trouble.

I actually specifically tested for this. Now it may just be a linux thing, not common to all NFS platforms, but in my tests, when packets were sent to the server other than the one mounted, it happily serves up the requested data without even blinking. So whatever state is being saved (And I do know there is some minimal state) seems unimportant. It actually surprised me how seamlessly it all worked, as I was expecting the non-mounted server to reject the requests or something similar.

That is quite suprising as the server should maintain state as to what clients have what mounted.

There is also the small issue of locks. lockd should be running on the server and keeping track of all the locks the client has. If the client has to reconnect, then it assumes all its locks are lost, but in the mean time it assumes everything is consistent. If it isn't accessing the same server (which wouldn't work for the reason given above) then the server won't know about any locks it things it has.

This could indeed be an issue. Perhaps setting up persistence for locks. But I don't think it is all bad. Of course, I am basing this off several assumptions that I have not verified at the moment. I am assuming that NFS and GFS will Do The Right Thing. I am assuming the NFS lock daemon will coordinate locks with the underlying OS. I am also assuming that GFS will then coordinate the locking with the rest of the servers in the cluster. Now as I understand locking on UNIX, it is only advisory and not enforced by the kernel, the programs are supposed to police themselves. So in that case, as long as it talks to a lock daemon, and keeps talking to the same lock daemon, it should be okay, even if the lock daemon is not on the server it is talking to, right?

That should be correct, the locks are advisor. As long as the lock daemons are talking to the underlying file system (GFS) then the behaviour should be correct, regarldess of which lock daemon a client talks to, as long as the client consistently talks to the same lock daemon for a given lock.

Of course, in the case of a failure, I don't know what will happen. I will definately have to look at the whole locking thing in more detail to know if things work right or not, and maybe get the lock daemons to coordinate locking.

Given that lockd currently lives entirely in the kernel that is easier said than done.

11.27 RealNetworks streaming protocols

Jerry Glomph Black black@real.com August 25, 2000

RealNetworks' streaming protocols are

The server configuration can be altered to run on any port, but the above numbers are the customary, and almost universally-used ones.

Mark Winter, a network/system engineer in my group wrote up the following detailed recipe on how we do it with LVS:

add IP binding in the G2 server config file

<List Name="IPBindings">
     <Var Address_1="<real ip address>"/>
     <Var Address_2="127.0.0.1"/>
     <Var Address_3="<virtual ip address>"/>
</List>

On the LVS side
./ipvsadm -A -u <VIP>:0  -p
./ipvsadm -A -t <VIP>:554  -p
./ipvsadm -A -t <VIP>:7070  -p
./ipvsadm -A -t <VIP>:8080  -p

./ipvsadm -a -u <VIP>:0 -r <REAL IP ADDRESS>
./ipvsadm -a -t <VIP>:554 -r <REAL IP ADDRESS>
./ipvsadm -a -t <VIP>:7070 -r <REAL IP ADDRESS>
./ipvsadm -a -t <VIP>:8080 -r <REAL IP ADDRESS>

(Ted)

I just wanted to add that if you use FWMARK, you might be able to make it a little simpler and not have to worry about forwarding EVERY UDP port.

# Mark packets with FWMARK1
ipchains -A input -d <VIP>/32 7070 -p tcp -m 1
ipchains -A input -d <VIP>/32 554 -p tcp -m 1
ipchains -A input -d <VIP>/32 8080 -p tcp -m 1
ipchains -A input -d <VIP>/32 6970:7170 -p udp -m 1

# Setup the LVS to listen to FWMARK1
ipvsadm -A -f 1 -p

# Setup the real server
ipvsadm -a -f 1 -r <RIP>

Not only is this only six lines rather than eight, but now you've setup a persistent port grouping. You do not have to forward EVERY UDP port, and you're still free to setup non-persistent services (or other persistent services that are persistent based on other ports).

When you want to remove a real server, you now do not have to remove FOUR real servers, you just remove one. Same thing with adding. Plus, if you want to change what's forwarded to each real server, you can do so with ipchains and not bother with taking up and down the LVS. ALSO... if you have an entire network of VIPs, you can setup IPCHAINS rules which will forward the entire network automatically rather than each VIP one by one.

11.28 Synchronising content and backing up realservers.

Realservers should have indentical filesi/content for any particular service (since the client can be connected to any of them). This is not a problem for slowing changing sites (e.g. ftp servers), where the changes can be made by hand, but sites serving webpages have to be changed daily or hourly. Some semi-automated method is needed to stage the content in a place where it is reviewed and then moved to the realservers.

For a database LVS, the changes have to be propagated in seconds. In an e-commerce site you have to either keep the client on the same machine when they transfer from http to https (using persistence), which may be difficult if they do some comparative shopping or have lunch in the middle of filling their shopping cart, or propagate the information <em>e.g.</em> a cookie to the other realservers.

Here are comments from the mailing list.

Wensong

If you just have two servers, it might be easy to use rsync to synchronize the backup server, and put the rsync job in the crontab in the primary. See http://rsync.samba.org/ for rsync.

If you have a big cluster, you might be interested in Coda, a fault-tolerant distributed filesystem. See http://www.coda.cs.cmu.edu/ for more information.

Joe

from comments on the mailing list, Coda now (Aug 2001) seems to be a usable project. I don't know what has happened to the sister project Intermezzo.

J Saunders 27 Sep 1999

I plan to start a frequently updated web site (potentially every minute or so).

Baeseman, Cliff Cliff.Baeseman@greenheck.com

I use mirror to do this!. I created a ftp point on the director. All nodes run against the director ftp directory and update the local webs. It runs very fast and very solid. upload to a single point and the web propagates across the nodes.

Paul Baker pbaker@where2getit.com 23 Jul 2001

PFARS Project on SourceForge

I have just finished commiting the latest revisions to the PFARS project CVS up on SourceForge. PFARS prounced 'farce' is the "PFARS For Automatic Replication System."

PFARS is currently used to handle server replication for Where2GetIt.com's LVS cluster. It has been in the production environment for over 2 months so we are pretty confident with the code stability. We decided to open source this program under the GPL to give back to the community that provided us with so many other great FREE tools that we could not run our business without (especially LVS). It is written in Perl and uses rsync over SSH to replicate server file systems. It also natively supports Debian Package replication.

Although the current version number is 0.8.1 it's not quite ready for release yet. It is seriously lacking in documentation and there is no installation procedure yet. Also in the future we would like add support for RPM based linux distros, many more replication stages, and support for restarting server processes when certain files are updated. If anyone would like to contribute to this project in any way do not be afraid to email me directly our join the development mailing list at pfars-devel@lists.sourceforge.net.

Please visit the project page at http://sourceforge.net/projects/pfars/ and check it out. You will need to check it out from CVS as there are no files released yet. Any feedback will be greatly appreciated.

Zachariah Mully wrote:

I am having a debate with one of my software developers about how to most efficiently sync content between realservers in an LVS system.

The situation is this... Our content production software that we'll be putting into active use soon will enable our marketing folks to insert the advertising into our newsletters without the tech and launch teams getting involved (this isn't necessarily a good thing, but I'm willing to remain open minded ;). This will require that the images they put into newsletters be synced between all the webservers... The problem though is that the web/app servers running this software are load-balanced so I'll never know which server the images are being copied to.

Obviously loading the images into the database backend and then out to the servers would be one method, but the software guys are convinced that there is a way to do it with rsync. I've looked over the documentation for rsync and I don't see anyway to set up cron jobs on the servers to run an rsync job that will look at the other servers content, compare it and then either upload or download content to that server. Perhaps I am missing an obvious way of doing this, so can anyone give me some advice as to the best way of pursuing this problem?

Bjoern Metzdorf bm@turtle-entertainment.de 19 Jul 2001

You have at least 3 possibilities:

1. You let them upload to all RIPs (uploading to each real server) 2. You let them upload to a testserver, and after some checks you use rsync to put the images onto the RIPs. 3. You let them upload to one defined RIP instead of the VIP and rsync from there (no need for a testserver)

Stuart Fox stuart@fotango.com 19 Jul 2001

nfs mount one directory for upload and server the images from there.

Write a small perl/bash script to monitor both upload directories remotely then rsync the differences when detected.

Don Hinshaw dwh@openrecording.com 19 Jul 2001

You can use rsync, rsync over ssh or scp.

You can also use partition syncing with a network/distributed filesystem such as Coda or OpenAFS or drbd (drbd is still too experimental for me). Such a setup creates partitions which are mirrored in real-time. I.e., changes to one reflect on them all.

We use a common NFS share on a RAID array. In our particular setup, users connect to a "staging" server and make changes to the content on the RAID. As soon as they do this, the real-servers are immediately serving the changed content. The staging server will accept FTP uploads from authenticated users, but none of the real-servers will accept any FTP uploads. No content is kept locally on the real-servers so they never need be synced, except for config changes like adding a new vhost to Apache.

jik@foxinternet.net 19 Jul 2001

If you put the conf directory on the NFS mount along with htdocs then you only need to edit one file, then ssh to each server and "apachectl graceful"

Don Hinshaw dwh@openrecording.com 20 Jul 2001

Um, no. We're serving a lot of: <VirtualHost x.x.x.x> and the IP is different for each machine. In fact the conf files for all the real-servers are stored in an NFS mounted dir. We have a script that manages the separate configs for each real-server.

I'm currently building a cluster for a client that uses a pair of NFS servers which will use OpenAFS to keep synced, then use linux-ha to make sure that one of them is always available. One thing to note about such a system is that the synced partitions are not "backups" of each other. This is really a "meme" (way of thinking about something). The distinction is simply that you cannot rollback changes made to a synced filesystem (because the change is made to them both), whereby with a backup you can rollback. So, if a user deletes a file, you must reload from backup. I mention this because many people that I've talked to think that if you have a synced filesystem, then you have backups.

What I'm wondering is why you would want to do this at all. From your description, your marketing people are creating newsletters with embedded advertising. If they are embedding a call for a banner (called a creative in adspeak) then normally that call would grab the creative/clickthrough link from the ad server not the web servers. For tracking the results of the advertising, this is a superior solution. Any decent ad server will have an interface that the marketing dept. can access without touching the servers at all.

Marc Grimme grimme@comoonics.atix.de 20 Jul 2001

Depending on how much data you need to sync, you could think about using a Cluster Filesystem. So that any node in the LVS-Cluster could concurrently access the same physically data. Have a look at http://www.sistina.com/gfs/ . We have a clustered webserver with 3 to 5 nodes with GFS underneath and it works pretty stable.

If you are sure on what server has latest data is uploaded to, no problem with rsync. If not I would consider to use a Network - or Cluster Filesystem. That should save a lot of scripting work and is more storage efficient.

jgomez@autocity.com

We are using rsync as a server. We are using a function that uploads the contents to the server and sync the uploaded file to the other servers.The list of servers we have to sync is in a file like:

192.168.0.X
192.168.0.X
192.168.0.X
When a file is uploaded,the server reads this file and make the sync to all the other servers.

11.29 Timeouts for TCP/UDP connections

2.2 kernels

Julian, 28 Jun 00

LVS uses the default timeouts for idle connections set by MASQ for (EST/FIN/UDP) of 15,2,5 mins (set in /usr/src/linux/include/net/ip_masq.h). These values are fine for ftp or http, but if you have people sitting on a LVS telnet connection they won't like the 15min timeout. You can't read these timeout values, but you can set them with ipchains. The format is

$ipchains -M -S tcp tcpfin udp

example:

$ipchains -M -S 36000 0 0

sets the timeout for established connections to 10hrs. The value "0" leaves the current timeout unchanged, in this case FIN and UDP.

2.4 kernels

Julian Anastasov ja@ssi.bg 31 Aug 2001

The timeout is set by ipvsadm. Although this feature has been around for a while, it didn't work till ipvsadm 1.20 (possibly 1.19) and ipvs-0.9.4 (thanks to Brent Cook for finding this bug).

$ipvsadm --set tcp tcpfin udp

The default timeout is 15 min for the LVS connections in established state. For any NAT-ed client connections ask iptables.

To set the tcp timeout to 10hrs, while leaving tcpfin and udp timeouts unchanged, do

#ipvsadm --set 36000 0 0

Brent Cook busterb@mail.utexas.edu 31 Aug 2001

I found the relevant code in the kernel to modify this behavior in 2.4 kernels without using ipchains. I got this info from http://www.cs.princeton.edu/ jns/security/iptables/iptables_conntrack.html In /usr/src/linux/net/ipv4/netfilter/ip_conntrack_proto_tcp.c , change TCP_CONNTRACK_TIME_WAIT to however long you need to wait before a tcp connection timeout.

Does anyone foresee a problem with other tcp connections as a result of this? An regular tcp program will probably close the connection anyway.

static unsigned long tcp_timeouts[]
= { 30 MINS,    /*      TCP_CONNTRACK_NONE,     */
    5 DAYS,     /*      TCP_CONNTRACK_ESTABLISHED,      */
    2 MINS,     /*      TCP_CONNTRACK_SYN_SENT, */
    60 SECS,    /*      TCP_CONNTRACK_SYN_RECV, */
    2 MINS,     /*      TCP_CONNTRACK_FIN_WAIT, */
    2 MINS,     /*      TCP_CONNTRACK_TIME_WAIT,        */
    10 SECS,    /*      TCP_CONNTRACK_CLOSE,    */
    60 SECS,    /*      TCP_CONNTRACK_CLOSE_WAIT,       */
    30 SECS,    /*      TCP_CONNTRACK_LAST_ACK, */
    2 MINS,     /*      TCP_CONNTRACK_LISTEN,   */
};

In the general case you cannot change the settings at the client. If you have access to the client, you can you can arrange for the client to send keepalive packets often enough to reset the timer above and keep the connection open.

Kyle Sparger ksparger@dialtone.com> 5 Oct 2001

You can address this from the client side by reducing the tcp keepalive transmission intervals. Under Linux, reduce it to 5 minutes:

echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time

where '300' is the number of seconds. I find this useful in all manner of situations where the OS times out connections.

director with many entries in FIN state

With the FIN timeout being about 1 min (2.2.x kernels), if most of your connections are non-persistent http (only taking 1 sec or so), most of your connections will be in the InActConn state.

Hendrik Thiel, 20 Mar 2001

we are using a lvs in NAT Mode and everything works fine ... Probably, the only Problem seems to be the huge number of (idle) Connection Entries.

ipvsadm shows a lot of inActConn (more than 10000 entries per Realserver) entries. ipchains -M -L -n shows that these connections last 2 minutes. Is it possible to reduce the time to keep the Masquerading Table small? e.g. 10 seconds ...

Joe

For 2.2 kernels, you can use netstat -M instead of ipchains -M -L. For 2.4.x kernels use `cat /proc/net/ip_conntrack`.

Julian

One entry occupies 128 bytes. 10k entries mean 1.28MB memory. This is not a lot of memory and may not be a problem.

To reduce the number of entries in the ipchains table, you can reduce the timeout values. You can edit the TIME_WAIT, FIN_WAIT values in ip_masq.c, or enable the secure_tcp strategy and alter the proc values there. FIN_WAIT can also be changed with ipchains.


Next Previous Contents