Building a redundant Syslog Server
Article created 2006-02-01 by Rainer Gerhards.
For many organizations, syslog data is of vital importance. Some may even be required by law or other regulations to make sure that no log data is lost. The question is now, how can this be accomplished?
Let’s first look at the easy part: once the data has been received by the syslog server and stored to files (or a database), you can use “just the normal tools” to ensure that you have a good backup of them. This is pretty straight-forward technology. If you need to archive the data unaltered, you will probably write it to a write-once media like CD or DVD and archive that. If you need to keep your archive for a long time, you will probably make sure that the media persists sufficiently long enough. If in doubt, I recommend copying the data to some new media every now and then (e.g. every five years). Of course, it is always a good idea to keep vital data at least two different locations and on two different media sets. But again, all of this is pretty easy manageable.
We get down to a somewhat slippery ground when it comes to short term failure. Your backup will not protect you from a hard disk failure. OK, we can use RAID 5 (or any other fault-tolerance level) to guard against that. You can eventually even write an audit trail (comes for free with database written data, but needs to be configured and needs resources).
But what about a server failure? By the nature of syslog, any data that is not received is probably lost. Especially with UDP based (standard) syslog, the sender does not even know the receiver has died. Even with TCP based syslog many senders prefer to drop messages than to stall processing (the only other option left – think about it).
There are several ways to guard against such failures. The common denominator is that they all have some pros and cons and none is absolutely trouble-free. So plan ahead.
A very straightforward option is to have two separate syslog servers and make every device send messages to both of them. It is quite unlikely that both of the servers will go down at the same instant. Of course, you should make sure they are connected via separate network links, different switches and use differently fused power connections. If your organization is large, placing them at physically different places (different buildings) can also be beneficial, if the network bandwidth allows. This configuration is probably the safest to use. It can even guard you against the notorious UDP packet losses that standard syslog is prone to (and which happen unnoticed for most of the time). The backdraw of this configuration is that you have almost all messages at both locations. Only if a server fails (or a message is discarded by the network), you only have a single copy. So if you combine both event repositories to a central one, you will have lots of duplicates. The art of handling this is to find a good merge process, which correctly (correctly is a key word!) identifies duplicate lines and drops them. Identifying duplicates can be much harder than it initially sounds, but in most cases there is a good solution. Sometimes a bit sender tweaking is required, but after all, that’s what makes an admin happy…
The next solution is to use some clustering technology. For example, you can use the Windows cluster service to define two machines which act as a virtual single syslog server. The OS (Windows in this sample case) itself keeps track of which machine is up and which one is not. For syslog, an active-passive clustering schema is probably best, that is one where one machine is always in hot standby (aka: not used ;)). This machine only takes processing over when the primary one (the one usually active) fails. The OS handles the task of virtualizing the IP address and the storage system. It also controls the takeover of control from one syslog server software to the next. So this is very little hassle from the application point of view. Senders also send messages only once, resulting in half the network traffic. You also do not have to think about how to consolidate messages into a single repository. Of course, this luxury comes at a price: most importantly, you will not be guarded against dropped UDP packets (because there is only one receiver at one time). Probably more importantly, every “failover” logic has a little delay. So there will be a few seconds (up to maybe a minute or two) until the syslog server functionality has been carried over to the hot standby machine. During this period, messages will be lost. Finally, clustering is typically relatively expensive and hard to set up.
The third possible solution is to look at the syslog server application itself. My company offers WinSyslog and MonitorWare Agent which can be configured to work in a failover-like configuration. There, the syslog server detects failures and transfers control, just like in the clustering scenario. However, the operating system does not handle the failover itself and obviously the OS so does not need to be any special. This approach offers basically the same pros and cons as the OS clustering approach described above. However, it is somewhat less expensive and probably easier to configure. If the two syslog server machines need not to be dedicated, it can be greatly less expensive than clustering – because no additional hardware for the backup machine would be required. One drawback, however, is that the senders again need to be configured to send messages to both machine, thus doubling the network traffic compared to “real” clustering. However, syslog traffic bandwidth usage is typically no problem, so that should not be too much of a disadvantage.
Question now: how does it work? It’s quite easy! First of all, all senders are configured to send to both receivers simultaneously. The solution then depends on the receiver’s ability to see if its peer is still operational. If so, you define one active and one passive peer. The passive peer checks if the other one is alive (in short periods). If the passive detects that the primary one fails, it enabled message recording. Once it detects that the primary is up again, it disables message recording. With this approach, both syslog servers receive the message, but only one actually records them. The message files can than be merged for a nearly complete picture. Why nearly? Well, as with OS clustering, there is a time frame where the backup does not yet take over control, thus some messages may be lost. Furthermore, when the primary node comes up again, there is another small Window where both of the machines record messages, thus resulting in duplicates (this later problem is not seen with typical OS clustering). So this is not a perfect world, but pretty close to it – depending on your needs. If you are interested in how this is actually done, you can follow our step-by-step instructions for our product line. Similar methodologies might apply to other products, but for obvious reasons I have not researched that. Have a look yourself after you are inspired by the Adiscon sample.
What is the conclusion? There is no perfect way to handling syslog server failure. Probably the best solution is to run two syslog servers in parallel, the first solution I described. But depending on your needs, one of the others might be a better choice for what you try to accomplish. I have given pros and cons with each of them, hopefully this will help you judge what works best for you.