Next Previous Contents

20. Failover protection

Don't even think about doing this till you've got your LVS working properly. If you want the LVS to survive a server or director failure, you can add software to do this after you have the LVS working. For production systems, failover may be useful or required.

On director or realserver failure, the connection/session from the client to the realserver is lost. Preserving session information is a difficult problem. Do not expect a solution real soon. After failover, the client will be presented with a new connection. Unlike on a single server, only some of the clients will experience loss of connection.

The most common problem here is loss of network access or extreme slowdown. Hardware failure or an OS crash (on unix) is less likely. However redundancy of services on the realservers is one of the useful features of LVS. One machine/service can be removed from the functioning virtual server for upgrade or moving of the machine and can be brought back on line later without interruption of service to the client.

20.1 Director failure

What happens if the director dies? The usual solution is duplicate director(s) with one active and one inactive. If the active director fails, then it is switched out. Automatic detection of failure in unreliable devices by other unreliable devices is not a simple problem. Although everyone seems to want reliable service, in most cases people are using the redundant boxes in order to maintain service through periods of planned maintenance rather than to handle boxes which just fail at random times. In critical situations, you should at least plan at least on replacing disks before their expected failure time.

HA solution

and to run heartbeat between them. One director defaults to being the operational director and the other takes over when heartbeat detects that the default director has died.

                        ________
                       |        |
                       | client |
                       |________|
                           |
                           |
                        (router)
                           |
                           |
          ___________      |       ___________
         |           |     |  DIP |           |
         | director1 |-----|------| director2 |
         |___________|     |  VIP |___________|
               |     <- heartbeat->    |
               |---------- | ----------|
                           |
         ------------------------------------
         |                 |                |
         |                 |                |
     RIP1, VIP         RIP2, VIP        RIP3, VIP
   ______________    ______________    ______________
  |              |  |              |  |              |
  | realserver1  |  | realserver2  |  | realserver3  |
  |______________|  |______________|  |______________|

The Ultra Monkey project uses Heartbeat from the Linux-HA project and ldirectord to monitor the realservers. Fake, heartbeat and mon are available at the Linux High Availability site.

The setup of Ultra Monkey is also covered on the LVS website.

A write up by Peter Mueller on setting up Linux HA on directors is at the end of this section.

Two box HA LVS

Doug Sisk sisk@coolpagehosting.com 19 Apr 2001

Is it possible to create a two server LVS with fault tolerance? It looks straight forward with 4 servers ( 2 Real and 2 Directors), but can it be done with just two boxes, ie directors, each director being a realserver for the other director and a realserver running localnode for itself?

Horms

Take a look at ultramonkey.org, that should give you all the bits you need to make it happen. You will need to configure heartbeat on each box, and then LVS (ldirectord) on each box to have two real servers: the other box, and localhost.

Vrrpd

Alexandre Cassen alexandre.cassen@canal-plus.com, the co-author of keepalived and author of LVSGSP has produced a vrrpd demon for LVS. Installation instructions will be forthcoming (Nov 2001), but he hasn't written the docs yet :-)

The vrrpd fabricates a software ethernet device on the outside of the director (for the VIP) and another for the inside of the director (for the DIP) each with a MAC address from the private range of MAC addresses (i.e. will not be found on any manufactured NIC). When a director fails, vrrpd re-creates the ethernet devices, with the original IP and MACs, on the secondary director. The router does not see any change in the link layer and continues to route packets to the same MAC address.

In the Linux-HA situation by contrast, when the IPs (VIP, DIP) are moved to a new hardware NIC, the MAC address changes. Various types of trickery, (e.g. using send-arp to flush the router's arp table) is required to tell the router that the IP has moved to a new MAC address. This can possibly interrupt service (some packets will have to be re-sent).

Padraig Brady padraig@antefacto.com 22 Nov 2001

Haven't Cisco got patents on this? What's the legal situation if someone wanted to deploy this?

Michael McConnell michaelm@eyeball.com> no - ftp:/ftp.isi.edu/in-notes/rfc2338.txt

20.2 Saving connection state on failover: Director demon for server state synchronisation

For seemless director failover, all connection state information from the failed director should be transferred/available to the new director. This is a similar problem to backing up a hot database. This problem has been discussed many times on the mailing list without any code being produced. Grabbing the bull by the horns, Ratz and Julian convened the Bulgarian Summit meeting in March 2001 where a design was set for a server state sync demon.

In ipvs-0.9.2 Wensong released a sync demon.

Wensong Zhang wensong@gnuchina.org 20 Jun 2001

The ipvs-0.9.2 tar ball is available on the LVS website. The major change is new connection sychronization feature.

Added the feature of connection synchroniztion from the primary load balancer to the backup load balancers through multicast.

The ipvs syncmaster daemon is started inside the kernel on the primary load balancers, and it multicasts the queue of connection state that need synchronization. The ipvs syncbackup daemon is started inside the kernel too on the backup load balancers, and it accepts multicast messages and create corresponding connections.

Here is simple intructions to use connection synchronization.

On the primary load balancer, run

primary_director:# ipvsadm --start-daemon=master --mcast-interface=eth0 

On the backup load balancers, run

backup_director:# ipvsadm --start-daemon=backup --mcast-interface=eth0

To stop the daemon, run

director:# ipvsadm --stop-daemon

Note that the feature of connection synchronization is under experiment now, and there is some performance penalty when connection synchronization, because a highly loaded load balancer may need multicast a lot connection information. If the daemon is not started, the performance will not be affected.

Alexandre Cassen alexandre.cassen@canal-plus.com 9 Jul 2001

Using ipvsadm you start the sync daemon on to the master director. So it send adverts to the backups servers using multicast: 224.0.0.81. You need to start ipvsadm sync daemon on the backups servers too...

The master mulitcasts messages to the backup load balancers in the following format.


       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |  Count Conns  |   Reserved    |            Size               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      |                    IPVS Sync Connection (1)                   |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                            .                                  |
      |                            .                                  |
      |                            .                                  |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      |                    IPVS Sync Connection (n)                   |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

I have planed to add an ICV like in IPSEC-AH (with anti-replay and all strong dataexchange format) but I steel very busy.

There aren't a lot of people using the server state sync demon yet, so we don't have much experience with it yet.

There is now a sync demon write up.

Here's a comment from the days before the LVS sync demon.

Lars Marowski-Bree lmb@suse.de

The -sh and -dh schedulers in 2.4 should make this possible, as there is no state information to transfer ;-)

20.3 Realserver failure

An agent external to the ipvs code on the director is used to monitor the services. LVS itself can't monitor the services as LVS is just a packet switcher. If a realserver fails, the director doesn't get the failure, the client does. For the standard LVS-Tun and LVS-DR setups (ie receiving packets by an ethernet device and not by TP), the reply packets from the realserver go to its default gw and don't go through the director, so the LVS can't detect failure even if it wants to. For some of the mailings concerning why the LVS does not monitor itself and why an external agent (eg mon) is used instead, see the postings on external agents.

In a failure protected LVS, if the realserver at the end of a connection fails, the client will loose their connection to the LVS, and the client will have to start with a new connection, as would happen on a regular single server. With a failure protected LVS, the failed realserver will be switched out of the LVS and a working new server will be made available to you transparently (the client will connect to one of the still working servers, or possibly a new server if one is brought on-line).

If the service is http, loosing the connection is not a problem for the client: they'll get a new connection next time they click on a link/reload. For services which maintain a connection, loosing the connection will be a problem.

ratz ratz@tac.ch 16 Nov 2000

This is very nasty for persistent setups in an e-commerce environment. Take for example a simple e-com site providing some subjects to buy. You can browse and view all their goodies. At a certain point you want to buy something. Ok, it is common nowadays that people can buy over the Internet with CC. Obviously this is done f.e. with SSL. SSL needs persistency enabled in the lvs-configuration. Imaging having 1000 users (conn ESTABLISHED) that are entering their VISA information when the database server crashes and the healthcheck takes out the server; or even more simple when the server/service itself crashes. Ok, all already established connections (they have a persistent template in the kernel space) are lost and these 1000 users have to reauthenticate. How does this look from a clients point of view which has no idea about the technology behind a certain site.

Here the functioning and setup of "mon" is described. In the Ultra Monkey version of LVS, ldirectord fills the same role. (I haven't compared ldirectord and mon. I used mon because it was available at the time, while ldirectord was either not available or I didn't know about it.) The configure script will setup mon to monitor the services on the realservers.

Get "mon" and "fping" from http://www.kernel.org/software/mon/ (I'm using mon-0.38.20)

(from ian.martins@mail.tju.edu - comment out line 222 in fping.c if compiling under glibc)

Get the perl package "Period" from CPAN, ftp://ftp.cpan.org)

To use the fping and telnet monitors, you'll need the tcp_scan binary which can be built from satan. The standard version of satan needs patches to compile on Linux. The patched version is at

ftp://sunsite.unc.edu/pub/Linux/system/network/admin

20.4 ethernet NIC failure

There was a lengthy thread on using multiple NICs to handle NIC failure. Software/hardware to handle such failures is more common for other unices which run expensive servers (e.g. Solaris) but is less common in Linux.

Beowulfs can use multiple NICs to increase thoughput by bonding them together (channel bonding), but redundancy/HA is not important for beowulfs - if a machine fails, it is fixed on the spot. There is no easy way to un-bond NICs - you have to reboot the computer :-(

Michael McConnell michaelm@eyeball.com 06 Aug 2001

I want to take advantage of dual NICS in the real server to provide redundancy. Unfortunately the default gw issue comes up.

Michael E Brown michael_e_brown@dell.com 09 Aug 2001

Yes, this is a generally available feature with most modern NICS. It is called various things: channel bonding, trunking, link aggregation. There is a native linux driver that implements this feature in a generic way. It is called the bonding driver. It works with any NIC. look in drivers/net/bond*. Each NIC vendor also has a proprietary version that works with only their NIC. I gave urls for intel's product, iANS. Broadcom and 3com also have this feature. I believe there is a standard for this: 802.1q.
John Cronin
It would be nice if it could work across multiple switches, so if a single switch failed, you would not lose connectivity (I think the adaptive failover can do this, but that does not improve bandwidth).

Jake Garver garver@valkyrie.net 08 Aug 2001

No it wouldn't be nice because it would put a tremendous burdon on the link connecting the switches. If you are lucky, this link is 1Gb/sec, much slower than back planes which or 10Gb/sec and up. In general, you don't want to "load balance" your switches. Keep as much as you can on the same back plane.
So, are there any Cisco Fast EtherChannel experts out there? Can FEC run across multiple switches, or at least across multiple Catalyst blades? I guess I can go look it up, but if somebody already knows, I don't mind saving myself the trouble.
Fast EtherChannel cannot run across multiple switches. A colleague spent weeks of our time proving that. In short, each switch will see a distinct link, for a total of two, but your server will think it has one big one. The switches will not talk to each other to bond the two links and you don't want them to for the reason I stated above. Over multiple blades, that depends on your switch. Do a "show port capabilities" to find out; it will list the ports that can be grouped into an FEC group.

Michael E Brown michael_e_brown@dell.com

If you want HA, have one machine (machine A) with bonded channels connected to switch A, and have another machine (machine B) with bonded channels connected to switch B.

If you want to go super-paranoid, and have money to burn on links that won't be used during normal operations: have one machine (machine A) with bonded channels connected to switch A, and have backup bonded channels to switch B. Have software that detects failure of all bonded channels to switch A and fails over your IP to switch B (still on machine A). Have another machine (B), with two sets of bonded channels connected to switch C and switch D. lather, rinse, repeat. On Solaris, IP failover to backup link is called IP Multipathing, IIRC. New feature of Solaris 8. Various HA softwares for Linux, notably Steeleye Lifekeeper and possibly LinuxFailsafe, support this as well.

John Cronin

For the scenario described above (two systems), in many cases machine A is active and machine B is a passive failover, in which case you have already burned some money on an entire system (with bonded channels, no less) that won't be used during normal operations.

Considering I can get four (two for each system) SMC EtherPower dual port cards for about $250, including shipping, or four Zynx Netblaster quad cards for about $820, if I shop around carefully, or $1000 for Intel Dual Port Server adapters or $1600 for Adaptec/Cogent ANA-6944 quad cards, if a name brand is important), the cost seems less significant when viewed in this light (not to mention the cost of two Cisco switches that can do FEC too).

Back to channel bonding (John Cronin)

I presume it's not doable.

I think "not doable" is an incorrect statement - "not done" would be more precise. For the most part, beowulf is about performance, not HA. I know that Intel NICs can use their own channel aggregation or Cisco Fast-EtherChannel to aggregate bandwidth AND provide redundancy. Unfortunately, these features are only available on the closed-source Microsoft and Novell platforms.

http://www.intel.com/network/connectivity/solutions/server_bottlenecks/config_1.htm

Having 2 NICs on a machine with one being spare, is relatively new. No-one has implemented a protocol for redundancy AFAIK.

I assume that you mean both of these statements to apply to Linux and LVS only. Sun has had trunking for years, but IP multipathing is the way to go now as it is easier to set up. You do get some bandwidth improvements for OUTBOUND connections only, on a per connection basis, but the main feature is redundancy.

Look in http://docs.sun.com/ for IP, multipathing, trunking.

Sun also has had Network Adapter Fail-Over groups (NAFO groups) in Sun Cluster 2.X for years, and in Sun Cluster 3.0. Veritas Cluster Server has an IPmultiNIC resource that provides similar functionality. Both of these allow for a failed NIC to be more or less seamlessly replaced by another NIC. I would be surprised if IBM HACMP has not had a similar feature for quite some time. In most cases these solutions do not provide improved bandwidth.

The next question then is how often does a box fail in such a way that only 1 NIC fails and everything else keeps working? I would expect this to be an unusual failure mode and not worth protecting against. You might be better off channel bonding your 2 NICs and using the higher throughput (unless you're compute bound).

I would agree, with one exception. If you have the resources to implement redundant network paths farther out into your infrastructure, then having redundant NICs is much more likely to lead to improved availability. For example if you have two NICs, which are plugged into to two different switches, which are in turn plugged into two different routers, then you start to get some real benefit. It is more complicated to setup (HA isn't easy most of the time), but with the dropping prices of switches and routers, and the increased need for HA in many environments, this is not as uncommon as it might sound, at least not in the ISP and hosting arena.

I am not trying to slam LVS and Linux HA products - to the contrary; I am trying to inspire some talented soul to write a multipathing NIC device driver we can all benefit from. ;) I make my living doing work on Sun boxes, but I use Linux on my Dell Inspiron 8000 laptop (my primary workstation, actually - it's a very capable system). I would recommend Linux solutions in many situations, but in most cases my employers won't bite, as they prefer vendor supported solutions in virtually every instance, while complaining about the official vendor support.

for channel bonding both NICS on the host have the same IP and MAC address. You need to split the cabling for the two lots of NICs, so you don't have address collisions - you'll need two switches.
John Cronin

You either need multiple switches, or switches that understand and are willing participants in the channel aggregation method being used. Cisco makes switches that do Fast EtherChannel, and Intel makes adapters that understand this protocol (but again, not currently using Linux). Intel adapters also have their own channel aggregation scheme, and I think the Intel switches could also facilitate this scheme, but Intel is getting out of the switch business. Unfortunately, none of the advanced Intel NIC features are available using Linux (it would be nice to have the hardware IPsec support on their newest adapters, for example).

Michael E Brown michael_e_brown@dell.com

Depends on which kind of bonding you do. Fast Etherchannel depends on all of the nics being connected to the same switch. You have to do configure the switch for trunking. Most of the standardized trunking methods I have seen require you to configure the switch and have all your nics connected to the same switch.

You either need multiple switches, or switches that understand and are willing participants in the channel aggregation method being used. Cisco makes switches that do Fast EtherChannel, and Intel makes adapters that understand this protocol (but again, not currently using Linux).

Michael E Brown michael_e_brown@dell.com

Not true. You can download the iANS software from Intel. Not open source, but that is different from "not available".

look in http://isearch.intel.com for ians+linux.

Also, if you want channel bonding without intel proprietary drivers, see

/usr/src/linux/drivers/net/bonding.c:
/*
 * originally based on the dummy device.
 *
 * Copyright 1999, Thomas Davis, tadavis@lbl.gov.
 * Licensed under the GPL. Based on dummy.c, and eql.c devices.
 *
 * bond.c: a bonding/etherchannel/sun trunking net driver
 *
 * This is useful to talk to a Cisco 5500, running Etherchannel, aka:
 *      Linux Channel Bonding
 *      Sun Trunking (Solaris)
 *
 * How it works:
 *    ifconfig bond0 ipaddress netmask up
 *      will setup a network device, with an ip address.  No mac address
 *      will be assigned at this time.  The hw mac address will come from
 *      the first slave bonded to the channel.  All slaves will then use
 *      this hw mac address.
 *
 *    ifconfig bond0 down
 *         will release all slaves, marking them as down.
 *
 *    ifenslave bond0 eth0
 *      will attache eth0 to bond0 as a slave.  eth0 hw mac address will
either
 *      a: be used as initial mac address
 *      b: if a hw mac address already is there, eth0's hw mac address
 *         will then  be set from bond0.
 *
 * v0.1 - first working version.
 * v0.2 - changed stats to be calculated by summing slaves stats.
 *
 */

Michael McConnell

This definately does it!

It create this excellent kernel module, it contains ALL. I just managed to get this running on a Tyan 2515 Motherboard that has two Onboard Intel Nics.

I've just tested failover mode, works *PERFECT* not even a single packet dropped! I'm gonna try out adaptive load balancing next, and i'll let you know how I make out.

ftp://download.intel.com/df-support/2895/eng/ians-1.3.34.tar.gz

Michael E Brown michael_e_brown@dell.com

Broadcom also has proprietary channel bonding drivers for linux. The problem is getting access to this driver. I could not find any driver downloads from their website. It is possible that only OEMs have this driver. Dell factory installs this driver for RedHat 7.0 (and will be on 7.1, 7.2). You might want to e-mail Broadcom and ask.

Also

Broadcom also has an SSL offload card which is coming out and it has open source drivers for linux.

http://www.broadcom.com/products/5820.html

You need the openssl library and kernel.

The next release of Red Hat linux will have this support integrated in. The Broadcom folks are working closely with the OpenSSL team to get their userspace integrated directly into 0.9.7. Red Hat has backported this functionality into their 0.9.6 release.

If you look at Red Hat's latest public beta, all the support is there and is working.

Since there aren't docs yet, the "bcm5820" rpm is the one you want to install to enable everything. Install this RPM, and it contains an init script that enables and disables the OpenSSL "engine" support as appropriate. Engine is the new OpenSSL feature that enables hardware offload.

20.5 Service/realserver failout

To activate realserver failover, you can install mon on the director. Several people have indicated that they have written/are using other schemes. RedHat's piranha has monitoring code, and handles director failover and is documented there.

ldirectord handles director failover and is part of the Linux High Availability project. The author of ldirectord is Jacob Rief jacob.rief@tis.at with most of the later add-ons and code cleanup by Horms. ldirectord needs Net::SSLeay only if you are monitoring https (Emmanuel Pare emman@voxtel.com, Ian S. McLeod ian@varesearch.com)

To get ldirectord -

Jacob Rief jacob.rief@tis.at

the newest version available from 
cvs.linux-ha.org:/home/cvs/
user guest, 
passwd guest
module-name is: ha-linux
file: ha-linux/heartbeat/resource.d/ldirectord
documentation: ha-linux/doc/ldirectord

ldirectord is also available from http://reserve.tiscover.com/linux-ha/.

Andreas Koenig andreas.koenig@anima.de 7 Jun 2001

cvs access is described in http://lists.community.tummy.com/pipermail/linux-ha-dev/1999-October/000212.html

Here's a possible alternative to mon -

Doug Bagley doug@deja.com 17 Feb 2000

Looking at mon and ldirectord, I wonder what kind of work is planned for future service level monitoring?

mon is okay for general purposes, but it forks/execs each monitor process, if you have 100 real services and want to check every 10 seconds, you would fork 10 monitor processes per second. This is not entirely untenable, but why not make an effort to make the monitoring process as lightweight as possible (since it is running on the director, which is such an important host)?

ldirectord, uses the perl LWP library, which is better than forking, but it is still slow. It also issues requests serially (because LWP doesn't really make parallel requests easy).

I wrote a very simple http monitor last night in perl that uses non-blocking I/O, and processes all requests in parallel using select(). It also doesn't require any CPAN libraries, so installation should be trivial. Once it is prototyped in perl, conversion to C should be straightforward. In fact, it is pretty similar to the Apache benchmark program (ab).

In order for the monitor (like ldirectord) to do management of the ipvs kernel information, it would be easier if the /proc interface to ipvs gave a more machine readable format.

From: Michael Sparks zathras@epsilon3.mcc.ac.uk

Agreed :-)

It strikes me that rather than having:
type serviceIP:port mechanism
  -> realip:port tunnel weight active inactive
  -> realip:port tunnel weight active inactive
  -> realip:port tunnel weight active inactive
  -> realip:port tunnel weight active inactive

If the table was more like:

type serviceIP:port mechanism realip:port tunnel weight active inactive

Then this would make shell/awk/perl/etc scripts that do things with this table easier to cobble together.

That seems like a far reaching precedent to me. On the other hand, if the ipvsadm command wished to have a option to represent that information in XML, I can see how that could be useful.

This reminds me I should really finish tweaking the prog I wrote to allow pretty printing of the ipvsadm table, and put it somewhere else for others to play with if they like - it allows you to specify a template file for formatting the output of of ipvsadm, making displaying the stuff as XML, HTML, plain text, etc simpler/quicker. (It's got a few hardcoded settings at the mo which I want to ditch first :-)

20.6 Mon for server/service failout

Here's the prototype LVS



                        ________
                       |        |
                       | client |
                       |________|
                           |
                           |
                        (router)
                           |
                           |
                           |       __________
                           |  DIP |          |
                           |------| director |
                           |  VIP |__________|
                           |
                           |
                           |
         ------------------------------------
         |                 |                |
         |                 |                |
     RIP1, VIP         RIP2, VIP        RIP3, VIP
   ______________    ______________    ______________
  |              |  |              |  |              |
  | realserver1  |  | realserver2  |  | realserver3  |
  |______________|  |______________|  |______________|

Mon has two parts:

20.7 BIG CAVEAT

*Trap for the unwary*

Mon runs on the director, but...

Remember that you cannot connect to any of the LVS controlled services from within the LVS (including from the director) (see gotchas). You can only connect to the LVS'ed services from the outside (eg from the client). If you are on the director, the packets will not return to you and the connection will hang. If you are connecting from the outside (ie from a client) you cannot tell which server you have connected to. This means that mon (or any agent), running on the director (which is where is needs to be to execute ipvsadm commands), cannot tell whether an LVS controlled service is up or down.

With LVS-NAT an agent on the director can access services on the RIP of the realservers (on the director you can connect to the http on the RIP of each realserver). Normal (i.e. non LVS'ed) IP communication is unaffected on the private director/realserver network of LVS-NAT. If ports are not re-mapped then a monitor running on the director can watch the httpd on server-1 (at 10.1.1.2:80). If the ports are re-mapped (eg the httpd server is listening on 8080), then you will have to either modify the http.monitor (making an http_8080.monitor) or activate a duplicate http service on port 80 of the server.

For LVS-DR and LVS-Tun the service on the realserver is listening to the VIP and you cannot connect to this from the director. The solution to monitoring services under control of the LVS for LVS-DR and LVS-Tun is to monitor proxy services whose accessability should closely track that of the LVS service. Thus to monitor an LVS http service on a particular server, the same webpage should also be made available on another IP (or to 0.0.0.0), not controlled by LVS on the same machine.

Example:

LVS-Tun, LVS-DR
lvs IP (VIP): eth0 192.168.1.110
director:     eth0 192.168.1.1/24 (normal login IP)
              eth1 192.168.1.110/32 (VIP)
realserver:  eth0 192.168.1.2/24 (normal login IP)
              tunl0 (or lo:0) 192.168.1.110/32 (VIP)

On the realserver, the LVS service will be on the tunl (or lo:0) interface of 192.168.1.110:80 and not on 192.168.1.2:80. The IP 192.168.1.110 on the realserver 192.168.1.2 is a non-arp'ing device and cannot be accessed by mon. Mon running on the director at 192.168.1.1 can only detect services on 192.168.1.2 (this is the reason that the director cannot be a client as well). The best that can be done is to start a duplicate service on 192.168.1.2:80 and hope that its functionality goes up and down with the service on 192.168.1.110:80 (a reasonable hope).

LVS-NAT
lvs IP (VIP): eth0 192.168.1.110
director:     eth0 192.168.1.1/24 (outside IP)
              eth0:1 192.168.1.110/32 (VIP)
              eth1 10.1.1.1/24 (DIP, default gw for realservers)
realserver:  eth0 10.1.1.2/24

Some services listen to 0.0.0.0:port, ie will listen on all IPs on the and you will not have to start a duplicate service.

20.8 About Mon

Mon doesn't know anything about the LVS, it just detects the up/down state of services on remote machines and will execute the commands you tell it when the service changes state. We give Mon a script which runs ipvsadm commands to remove services from the ipvsadm table when a service goes down and another set of ipvsadm commands when the service comes back up.

Good things about Mon:

Bad things about Mon:

mon-0.37l, keeps executing alerts every polling period following an up-down transition. Since you want your service polled reasonable often (eg 15secs), this means you'll be getting a pager notice/email every 15secs once a service goes down. Tony Bassette kult@april.org let me know that mon-0.38 has a numalert command limiting the number of notices you'll get.

20.9 Mon Install

Mon is installed on the director.

Most of mon is a set of perl scripts. There are few files to be compiled - it is mostly ready to go (rpc.monitor needs to be compiled, but you don't need it for LVS).

You do the install by hand.

$ cd /usr/lib
$ tar -zxvof /your_dirpath/mon-x.xx.tar.gz

this will create the directory /usr/lib/mon-x.xx/ with mon and its files already installed.

LVS comes with virtualserver.alert (goes in alert.d) and ssh.monitor (goes in mon.d).

Make the directory "mon-x.xx" accessable as "mon" by linking it to "mon" or by renaming it

$ln -s mon-x.xx mon
or
$mv mon-x.xx mon

Copy the man files (mon.1 etc) into /usr/local/man/man1

Check that you have the perl packages required for mon to run

$perl -w mon

do the same for all the perl alerts and monitors that you'll be using (telnet.monitor, dns.monitor, http_t.monitor, ssh.monitor).

DNS in /etc/services is known as "domain" and "nameserver" but not "dns". To allow the use of the string "dns" in the lvs_xxx.conf files and to enable configure_lvs.pl to autoinclude the dns.monitor, add the string "dns" to the port 53 services in /etc/services with an entry like

    domain              53/tcp          nameserver dns  # name-domain server
    domain              53/udp          nameserver dns

Mon expects executables to be in /bin, /usr/bin or /usr/sbin. The location of perl in the alerts is #!/usr/bin/perl (and not /usr/local/bin/perl) - make sure this is compatible with your setup. (Make sure you don't have one version of perl in /usr/bin and another in /usr/local/bin).

The configure script will generate the required mon.cf file for you (and if you like copy it to the cannonical location of /etc/mon).

Add an auth.cf file to /etc/mon

I use

#auth.cf ----------------------------------
# authentication file
#
# entries look like this:
# command: {user|all}[,user...]
#
# THE DEFAULT IS TO DENY ACCESS TO ALL IF THIS FILE
# DOES NOT EXIST, OR IF A COMMAND IS NOT DEFINED HERE
#

list:           all
reset:          root
loadstate:      root
savestate:      root
term:           root
stop:           root
start:          root
set:            root
get:            root
dump:           root
disable:        root
enable:         root

#end auth.cf ----------------------------

20.10 Mon Configure

This involves editing /etc/mon/mon.cf, which contains information about

The mon.cf generated by configure_lvs.pl

20.11 Testing mon without LVS

The instructions here show how to get mon working in two steps. First show that mon works independantly of LVS, then second bring in LVS.

The example here assumes a working LVS-DR with one realserver and the following IPs. LVS-DR is chosen for the example here as you can set up LVS-DR with all machines on the same network. This allows you to test the state of all machines from the client (ie using one kbd/monitor). (Presumably you could do it from the director too, but I didn't try it.)

lvs IP (VIP): eth0 192.168.1.110
director:     eth0 192.168.1.1/24 (admin IP)
              eth0:1 192.168.1.110/32 (VIP)
realserver:  eth0 192.168.1.8/24

On the director, test ping.monitor (in /usr/lib/mon/mon.d) with

$ ./fping.monitor 192.168.1.8

You should get the command prompt back quickly with no other output. As a control test for a machine that you know is not on the net

$ ./fping.monitor 192.168.1.250
192.168.1.250

ping.monitor will wait for a timeout (about 5secs) and then return the IP of the unpingable machine on exit.

Check test.alert (in /usr/lib/mon/alert.d) - it writes a file in /tmp

$ ./test.alert foo

you will get the date and "foo" in /tmp/test.alert.log

As part of generating the rc.lvs_dr script, you will also have produced the file mon_lvsdr.cf. To test mon, place this in /etc/mon/mon.cf

#------------------------------------------------------
#mon.cf 
#
#mon config info, you probably don't need to change this very much
#

alertdir   = /usr/lib/mon/alert.d 
mondir     = /usr/lib/mon/mon.d 
#maxprocs    = 20
histlength = 100
#delay before starting
#randstart = 60s

#------
hostgroup LVS1 192.168.1.8 

watch LVS1 
#the string/text following service (to OEL) is put in header of mail messages
#service "http on LVS1 192.168.1.8" 
service fping 
        interval 15s 
        #monitor http.monitor 
        #monitor telnet.monitor 
        monitor fping.monitor 
        allow_empty_group 
        period wd {Sun-Sat} 
        #alertevery 1h 
                #alert mail.alert root 
                #upalert mail.alert root 
                alert test.alert  
                upalert test.alert  
                #-V is virtual service, -R is remote service, -P is protocol, -A is add/delete (t|u)
                #alert virtualserver.alert -A -d -P -t -V 192.168.1.9:21 -R 192.168.1.8 
                #upalert virtualserver.alert -A -a -P -t -V 192.168.1.9:21 -R 192.168.1.8 -T -m -w 1

#the line above must be blank

#mon.cf---------------------------

Now we will test mon on the realserver 192.168.1.8 independantly of LVS. Edit /etc/mon/mon.cf and make sure that all the monitors/alerts except for fping.monitor and test.alert are commented out (there is an alert/upalert pair for each alert, leave both uncommented for test.alert).

Start mon with rc.mon (or S99mon)

Here is my rc.mon (copied from the mon package)

# rc.mon -------------------------------
# You probably want to set the path to include
# nothing but local filesystems.
#

echo -n "rc.mon "

PATH=/bin:/usr/bin:/sbin:/usr/sbin
export PATH

M=/usr/lib/mon
PID=/var/run/mon.pid

if [ -x $M/mon ]
        then
        $M/mon -d -c /etc/mon/mon.cf -a $M/alert.d -s $M/mon.d -f 2>/dev/null
        #$M/mon -c /etc/mon/mon.cf -a $M/alert.d -s $M/mon.d -f
fi
#-end-rc.mon----------------------------

After starting mon, check that mon is in the ps table (ps -auxw | grep perl). When mon comes up it will read mon.cf and then check 192.168.1.8 with the fping.monitor. On finding that 192.168.1.8 is pingable, mon will run test.alert and will enter a string like

Sun Jun 13 15:08:30 GMT 1999 -s fping -g LVS3 -h 192.168.1.8 -t 929286507 -u -l 0

into /tmp/test.alert.log. This is the date, the service (fping), the hostgroup (LVS), the host monitored (192.168.1.8), unix time in secs, up (-u) and some other stuff I didn't need to figure out to get everything to work.

Check for the "-u" in this line, indicating that 192.168.1.8 is up.

If you don't see this file within 15-30secs of starting mon, then look in /var/adm/messages and syslog for hints as to what failed (both contain extensive logging of what's happening with mon). (Note: syslog appears to be buffered, it may take a few more secs for output to appear here).

If neccessary kill and restart mon

$ kill -HUP `cat /var/run/mon.pid`

Then pull the network cable on machine 192.168.1.8. In 15secs or so you should hear the whirring of disks and the following entry will appear in /tmp/test.alert.log

Sun Jun 13 15:11:47 GMT 1999 -s fping -g LVS3 -h 192.168.1.8 -t 929286703 -l 0

Note there is no "-u" near the end of the entry indicating that the node is down.

Watch for a few more entries to appear in the logfile, then connect the network cable again. A line with -u should appear in the log and no further entries should appear in the log.

If you've got this far, mon is working.

Kill mon and make sure root can send himself mail on the director. Make sure sendmail can be found in /usr/lib/sendmail (put in a link if neccessary).

Next activate mail.alert and telnet.monitor in /etc/mon/mon.cf and comment out test.alert. (Do not restart mon yet)

Test mail.alert by doing

$ ./mail.alert root
hello
^D

root is the address for the mail, hello is some arbitrary STDIN and controlD exits the mail.alert. Root should get some mail with the string "ALERT" in the subject (indicating that a machine is down).

Repeat, this time you are sending mail saying the machine is up (the "-u")

$ ./mail.alert -u root
hello
^D

Check that root gets mail with the string "UPALERT" in the subject (indicating that a machine has come up).

Check the telnet.monitor on a machine on the net. You will need tcp_scan in a place that perl sees it. I moved it to /usr/bin. Putting it in /usr/local/bin (in my path) did not work.

$ ./telnet.monitor 192.168.1.8

the program should exit with no output. Test again on a machine not on the net

$ ./telnet.monitor 192.168.1.250
192.168.1.250

the program should exit outputting the IP of the machine not on the net.

Start up mon again (eg with rc.mon or S99mon), watch for one round of mail sending notification that telnet is up (an "UPALERT) (note: for mon-0.38.21 there is no initial UPALERT). There should be no further mail while the machine remains telnet-able. Then pull the network cable and watch for the first ALERT mail. Mail should continue arriving every mon time interval (set to 15secs in mon_lvs_test.cf). Then plug the network cable back in and watch for one UPALERT mail.

If you don't get mail, check that you re-edited mon.cf properly and that you did kill and restart mon (or you will still be getting test.alerts in /tmp). Sometimes it takes a few seconds for mail to arrive. If this happens you'll get an avalanche when it does start.

If you've got here you are really in good shape.

Kill mon (kill `cat /var/run/mon.pid`)

20.12 Can virtualserver.alert send commands to LVS?

(virtualserver.alert is a modified version of Wensong's orginal file, for use with 2.2 kernels. I haven't tested it back with a 2.0 kernel. If it doesn't work and the original file does, let me know)

run virtualserver.alert (in /usr/lib/mon/alert.d) from the command line and check that it detects your kernel correctly.

$ ./virtualserver.alert

you will get complaints about bad ports (which you can ignore, since you didn't give the correct arguments). If you have kernel 2.0.x or 2.2.x you will get no other output. If you get unknown kernel errors, send me the output of `uname -r`. Debugging print statements can be uncommented if you need to look for clues here.

Make sure you have a working LVS-DR LVS serving telnet on a realserver. If you don't have the telnet service on realserver 192.168.1.8 then run

$ipvsadm -a -t 192.168.1.110:23 -r 192.168.1.8

then run ipvsadm in one window.

$ipvsadm

and leave the output on the screen. In another window run

$ ./virtualserver.alert -V 192.168.1.110:23 -R 192.168.1.8

this will send the down command to ipvsadm. The entry for telnet on realserver 192.168.1.8 will be removed (run ipvsadm again).

Then run

$ ./virtualserver.alert -u -V 192.168.1.110:23 -R 192.168.1.8

and the telnet service to 192.168.1.8 will be restored in the ipvsadm table.

20.13 Running mon with LVS

Connect all network connections for the LVS and install a LVS-DR LVS with INITIAL_STATE="off" to a single telnet realserver. Start with a file like lvs_dr.conf.single_telnet_off adapting the IP's for your situation and produce the mon_xxx.cf and rc.lvs_xxx file. Run rc.lvs_xxx on the director and then the realserver.

The output of ipvsadm (on the director) should be

grumpy:/etc/mon# ipvsadm
IP Virtual Server (Version 0.9)
Protocol Local Address:Port Scheduler
      -> Remote Address:Port   Forward Weight ActiveConn FinConn
TCP 192.168.1.110:23 rr

showing that the scheduling (rr) is enabled, but with no entries in the ipvsadm routing table. You should NOT be able to telnet to the VIP (192.168.1.110) from a client.

Start mon (it's on the director). Since the realserver is already online, mon will detect a functional telnet on it and trigger an upalert for mail.alert and for virtualserver.alert. At the same time as the upalert mail arrives run ipvsadm again. You should get

grumpy:/etc/mon# ipvsadm
IP Virtual Server (Version 0.9)
Protocol Local Address:Port Scheduler
      -> Remote Address:Port   Forward Weight ActiveConn FinConn
TCP 192.168.1.110:23 rr
      -> 192.168.1.8:23        Route   1      0          0

which shows that mon has run ipvsadm and added direct routing of telnet to realserver 192.168.1.8. You should now be able to telnet to 192.168.1.110 from a client and get the login prompt for machine 192.168.1.8.

Logout of this telnet session, and pull the network cable to the realserver. You will get a mail alert and the entry for 192.168.1.8 will be removed from the ipvsadm output.

Plug the network cable back in and watch for the upalert mail and the restoration of LVS to the realserver (run ipvsadm again).

If you want to, confirm that you can do this for http instead of telnet.

You're done. Congratulations. You can use the mon_xxx.cf files generated by configure.pl from here.

20.14 Why is the LVS monitored for failures/load by an external agent rather than by the kernel?

Patrick Kormann pkormann@datacomm.ch

Wouldn't it be nice to have a switch that would tell ipvsadm 'If one of the realservers is unreachable/connection refused, take it out of the list of real servers for x seconds' or even 'check the availability of that server every x seconds, if it's not available, take it out of the list, if it's available again, put it in the list'.

Lars

That does not belong in the kernel. This is definetely the job of a userlevel monitoring tool.

I admit it would be nice if the LVS patch could check if connections directed to the realserver were refused and would log that to userlevel though, so we could have even more input available for the monitoring process.

and quirks to make lvs a real high-availability system. The problem is that all those external checks are never as effective as a decition be the 'virtual server' could be.

That's wrong.

A userlevel tool can check reply times, request specific URLs from the servers to check if they reply with the expected data, gather load data from the real servers etc. This functionality is way beyond kernel level code.

Michael Sparks zathras@epsilon3.mcc.ac.uk

Personally I think monitoring of systems is probably one of the things the lvs system shouldn't really get into in it's current form. My rationale for this is that LVS is a fancy packet forwarder, and in that job it excels.

For the LVS code to do more than this, it would require for TCP services the ability to attempt to connect to the *service* the kernel is load balancing - which would be a horrible thing for a kernel module to do. For UDP services it would need to do more than pinging... However, in neither case would you have a convincing method for determining if the *services* on those machines was still running effectively, unless you put a large amount of protocol knowledge into the kernel. As a result, you would still need to have external monitoring systems to find out whether the services really are working or not.

For example, in the pathological case (of many that we've seen :-) of a SCSI subsystem failure resulting in indestructable inodes on a cache box, a cache box can reach total saturation in terms of CPU usage, but still respond correctly to pings and TCP connections. However nothing else (or nothing much) happens due to the effective system lockup. The only way round such a problem is to have a monitoring system that knows about this sort of failure, and can then take the service out.

There's no way this sort of failure could be anticipated by anyone, so putting this sort of monitoring into the kernel would create a false illusion of security - you'd still need an auxillary monitoring system. Eg - it's not just enough for the kernel to mark the machine out of service - you need some useful way of telling people what's gone wrong (eg setting off people's pager's etc), and again, that's not a kernel thing.

Lars

Michael, I agree with you.

However, it would be good if LVS would log the failures it detects. ie, I _think_ it can notice if a client receives a port unreachable in response to a forwarded request if running masquerading, however it cannot know if it is running DR or tunneling because in that case it doesn't see the reply to the client.

_think_ it can notice if a client receives a port unreachable in response to a

Wensong

Currently, the LVS can handle ICMP packets for virtual services, and forward them to the right place. It is easy to set the weight of the destination zero or temperarily remove the dest entry directly, if an PORT_UNREACH icmp from the server to the client passes through the LVS box.

If we want the kernel to notify monitoring software that a real server is down in order to let monitoring software keep consistent state of virtual service table, we need design efficient way to notify monitoring, more code is required. Anyway, there is a need to develop efficient communication between the LVS kernel and monitoring software, for example, monitoring software get the connection number efficiently, it is time-consuming to parse the big IPVS table to get the connection numbers; how to efficiently support ipvsadm -L <protocol, ip, port>? it is good for applications like Ted's 1024 virtual services. I checked the procfs code, it still requires one write_proc and one read_proc to get per virtual service print, it is a little expensive. Any idea?

Currently, the LVS can handle ICMP packets for virtual services, and forward them to the right place. It is easy to set the weight of the destination zero or temperarily remove the dest entry directly, if an PORT_UNREACH icmp from the server to the client passes through the LVS box.

Julian Anastasov uli@linux.tu-varna.acad.bg

PORT_UNREACH can be returned when the packet is rejected from the real server's firewall. In fact, only UDP returns PORT_UNREACH when the service is not running. TCP returns RST packet. We must carefully handle this (I don't know how) and not to stop the real server for all clients if we see that one client is rejected. And this works only if the LVS box is default gw for the real servers, i.e. for any mode: MASQ(it's always def gw), DROUTE and TUNNEL (PROT_UNREACH can be one of the reasons not to select other router for the outgoing traffic for these two modes). But LVS cn't detect the outgoing traffic for DROUTE/TUNNEL mode. For TUNNEL it can be impossible if the real servers are not on the LAN.

So, the monitoring software can solve more problems. The TCP stack can return PORT_UNREACH but if the problem with the service in the real server is more complex (real server died, daemon blocked) we can't expect PORT_UNREACH. It is send only when the host is working but the daemon is stooped. Please restart this daemon. So, don't rely on the real server, in most of the cases he can't tell "Please remove me from the VS configuration, I'm down" :) This is job for the monitoring software to exclude the destinations and even to delete the service (if we switch to local delivery only, i.e. when we switch from LVS to WEB only mode for example). So, I vote for the monitoring software to handle this :)

Wensong

Yeah, I prefer that monitoring software handles this too, because it is a unified approach for LVS-NAT, LVS-Tun and LVS-DR, and monitoring software can detect more failures and handle more things according to the failures.

What we discuss last time is that the LVS kernel sets the destination entry unavailable in virtual server table if the LVS detect some icmp packets (only for LVS-NAT) or RST packet etc. This approach might detect this kinds of problems just a few seconds earlier than the monitoring software, however we need more code in kernel to notify the monitoring software that kernel changes the virtual server table, in order to let the monitoring software keep the consistent view of the virtual server table as soon as possible. Here is a tradeoff. Personally, I prefer to keeping the system simple (and effective), only one (monitoring software) makes decision and keeps the consistent view of VS table.


Next Previous Contents