Switching packets based on the content of the payload (L7 switching) is something of a holy grail in this area. However, inspecting packet contents is slow (or CPU intensive) compared to L4 switching. Commercial L7 switches are expensive. It's better to handle the problem at the L4 layer, something that can often by done by correct design of applications.
Netfilter has code for inspecting the contents of packets through the u32 option. Presumably this could be coupled with fwmark to setup an LVS.
Preliminary code, KTCPVS, for this has been written by Wensong. Some documentation is in the source code.
Michael Sparks Michael.Sparks@wwwcache.ja.net
> 12 Jul 1999
Some of the emerging redirection switches on the market support something known as Level 7 redirection which essentially allows the redirector to look at the start of the TCP data stream, by spoofing the initial connection and making load balancing based on what it sees there. (Apologies if I'm doing the equivalent of "teaching your grandma to suck eggs", but at least this way there's less mis-understanding of what I'm getting at/after)
For example if we have X proxy-cache servers, we could spoof an HTTP connection, grab the requested URL, and hash it to one of those X servers. If it was possible to look inside individual UDP packets as well, then we would be able to route ICP (inter cache prototol) packets in the same way. The result is a cluster that looks like a single server to clients.
WensongDo you mean that these X proxy-cache servers are not equivalent, and they are statically partitioned to fetch different objects? for example, the proxy server 1 is to cache European URLs, and the proxy server 2 is to cache Asian URLs, then there is a need to parse the packets to grab the requested URL. Right?
If you want to do this, I think Apache's mod_rewrite and mod_proxy can be used to group these X proxy-cache servers as a single proxy server. Since the overhead of dispatching requests in application level is high, its scalability may not be very good, the load balancer might be a bottleneck when there are 4 proxy servers or more.
The other way is to copy data packet to userspace to grap the request if the request is in a single UDP packet, the userspace program select a server based on the request and pass it back to the kernel.
For generic HTTP servers this could also allow the server to farm cgi-requests to individual machines, and the normal requests to the rest. (eg allowing you to buy a dual/quad processor to handle soley cgi-requests, but cheap/fast servers for the main donkey work.)
WensongIt is statically partitioned. Not very flexible and scalable.
We've spoken to a number of commercial vendors in the past who are developing these things but they've always failed to come up with the goods, mainly citing incompatibility between our needs and their designs :-/
Any ideas how complicated this would be to add to your system?
WensongIf these proxy-cache servers are identical(they are the same of all kind of URL requests), I have a good solution to use LVS to build a high-performance proxy server.
request |-<--fetch objects directly | |-----Squid 1---->reply users directly |->LinuxDirector ---| (VS-TUN/VS-DR) |-----Squid 2 | ... |-----Squid iSince VS-TUN/VS-DR is on client-to-server half connection, squid servers can fetch object directly from the Internet and return objects directly to users. The overheading of forwarding request is low, scalabilty is very good.The ICP is used to query among these Squid servers. In order to avoid the mulitcasting storm, we can add one more NIC in each squid server for ICP query, we can call it multicast channel.
again Michael,
The reason I asked if the code could be modified to do this:
> look at the start of the TCP data stream, by spoofing the initial > connection and making load balancing based on what it sees there.Is to enable us to do this:
The reason for 1,2 & 3 is to have deterministic location of cached data, to eliminate redundancy in the cache system, and to reduce intra-cache cluster communication. The reason for 4 is because the clients of our caches are caches themselves - they're the UK National Academic root level caches servicing about a 3/4 billion requests per month during peak periods.
Also 2) can be used in future to implement a cache-digest server to serve a single cache digest for the entire cluster to eliminate delays for clients caused by ICP. (During peak periods this is large.)
The boxes from Arrowpint can do 1-3, but not 4 for example, and being proprietory hardware...
Essentially the ICP+cache digest thing for the cluster is the biggest nut - squid 2.X in a CARP mode can do something similar to 1,2 & 3, at the expense of having to handle a large number of TCP streams, but wouldn't provide a useful ICP service (it would always return ICP MISS), and can't provide the cache digest service. (Or would at least return empty digests)
> I have a good solution to use LVS to build a high-performance proxy > server.
Fine (to an extent) for the case where requests are coming from browsers, but bad where the clients are caches:
It's even worse with cache digests...
We've had to say this sort of thing to people like Alteon, etc too... (Lots more details in an overview at http://epsilon3.mcc.ac.uk/ zathras/WIP/Cache_Cooperation/).
later...
A slightly better discussion of how these techniques can be used is at: http://www.ircache.net/Cache/Workshop99/Papers/johnson-0.ps.gz)
WensongI have read the paper "Increasing the performance of transparent caching with content-aware cache bypass". If no inter-cache cooperation is concerned, it can be easily done on Linux, you don't need to buy expensive arrowpoint. As for availability, Linux boxes are reliable now. :)
I can modify the transparent proxy on Linux to do such a content-aware bypass and content-aware switching. The content-aware bypass will enable fetch non-cacheable objects directly, and the content-aware switching can let your cache cluster not overlapped.