SECNOLOGY : Why High Availability (HA) is a big challenge for SIEMs ?

Why High Availability (HA) is a big challenge for SIEMs ?

Many in the SIEM landscape find obvious the idea of securing their SIEMs with high availability or redundancy.

Not only do they have no doubts that this is possible, but assume it isn’t even an issue since it’s already being done with other applications like databases, web servers, or messaging systems.

So, let’s demand that the SIEM be HA and why not !

Of course, it can be achieved, but we still have to ask ourselves the right questions; on which conditions, with which constraints, with which technologies, on which solutions, at what prices, and with what ROI?

Let’s try to simply explain the context and the problem.

Cyber Security Paradigm :

We have sources from which we collect traces and logs. These devices can be security devices, but also routers, switches, servers, PCs, Smartphones, IoTs, applications located either on the local network, remote networks, or in the Cloud.

We have to collect all traces and logs of all these devices in real time without loss, and store these traces in a secure manner. We have to process these traces in real time, and store the result of these processes and provide them to any user request. Finally, we must be able to deliver the results of these requests at all times.

So to summarize in simple terms:

No trace emitted by a source device must be lost.
As soon as a trace is received, it must always be accessible
Processing these traces must always be possible
The results of these operations must always be accessible
Users should always receive the correct response to their queries anytime.

If we analyze the global process that we have just mentioned, then we have quite simply required continuity of services:

At the network link level
At the communications level
At the processing level
At the session level
At the data level

That is already a main difference between Fault Tolerance and High Availability.

Assuming that the SIEM is duplicated at the component level (storage, process, collection), current SIEM assume that what is required is Fault Tolerance instead of High Availability.

Fault Tolerance Strategy :

The first strategy is to create a Computing Cluster of 2 identical machines. Both units must have the same IP and Mac Address. In this case, the primary and secondary machines are connected by a short cable (fiber chanel or iSCSI). Common devices in the cluster, such as disks and the connection media to access these disks, are owned and managed by one server at a time.

From a network point of view, there is only one machine but it is fault tolerant. If the primary machine is down then the secondary machine resets, becomes primary and takes over. The secondary machine does nothing but check whether the primary machine is online.

Clusters are typically used for file servers, print servers, database servers and mail servers, so therefore it is assumed that it will be suitable for SIEM as well.

The problem is that SIEMs have particular specificities, namely that they collect very large data volumes, which is far outside the norm for a computer system. There isn’t a computer system that can collect, store and process as much data as a SIEM.

SIEM deals with sensitive, confidential, and even critical data that cannot be uncollected, let alone lost.

If we choose to put HA on a Cluster basis, during the reset time, we lose all the traces sent by the devices that are supposed to be collected, stored and duplicated. This failure of even a few minutes in the HA is not without consequences for regulatory compliance. Losing traces can be very expensive.

In addition, in this case, it is mandatory to dock to the two primary and secondary machines, a common SAN such as NetApp, EMC, or HDS with RAID5 or other related technologies. This adds a significant cost to the solution price and to its maintainability.

SIEM like Splunk, Qradar, LogRythm, LogLogic, Exabeam and others rely on this Fault Tolerance strategy.

The second strategy, like SECNOLOGY implements High Availability with a Grid Computing type operational architecture which consists of doubling the Collectors, and ensuring that they are active simultaneously. Both Collectors are twins and mirrored so that they receive exactly the same duplicate traces, execute the same processes in parallel, store and transmit the same data.

This doesn’t require specific hardware like NetApp, EMC, or HDS with RAID5 or other such technology, but this isn’t the main advantage. It is that this architecture relies on the fact that both Collectors receive the traces through different links, on two different sites, which perfectly guarantees, trace reception, data collection, and data availability.

The other benefit is to ensure that any device, and in this case processors, have access to the collected data at all times.

With this architecture and HA implementation, Managers are able to operate in parallel simultaneously in Primary / Guest mode. Only the Primary Manager is able to manage and modify the global configuration. All Guests can be configured to perform the same processing on the same data and to each store the results locally on their own disks. They can also be configured as a Master / Backup pair, and Guests.

In this case, the Backup Manager monitors the state of the Master Manager on several levels to detect a failure whether in the OS, services, network links, interfaces, data access, or data availability.

In the event that the Backup Manager considers that the Master Manager has failed, the Backup Manager reconfigures itself as Master Manager. During this time, the Backup Manager has no other task than to synchronize the configuration and the results of the Master on its own disk.

No data is lost and no process has been skipped.

Finally, there is one last element to be addressed in HA, and that is the continuity of services in regards to the user. He has a binary visibility that completely ignores what is going on in the background. For him, the session is up or down !

High Availability Strategy :

In the Cluster Computing option, it is clear that if the Primary fails and the Secondary is reconfigured, then the user session drops regardless of the front-end architecture put in place.

Instead, with the SECNOLOGY architecture, the user session can be maintained. This can happen if user flows are redirected to Managers through 4-7 switches that will redirect user sessions to the Manager, which knows how to quickly respond to user requests. If it’s not the Master then it will be the Backup.

In this last case, it is reasonable to consider the case that the organization already has 4-7 switches and thus the implementation of this architecture is recommended. Otherwise, the justification for an additional investment will be difficult. Instead, users will be asked to restart their session manually.

Real Cost of High Availability

Finally, if we look at the financial aspects of HA and despite the mandatory doubling of hardware components, the impacts can be so big that they cannot be justified.

For many SIEM vendors, HA may require doubling the license. This means that instead of paying for one SIEM, you have to pay for two! Or, in other instances, the backup license costs half of the primary license.

In this regard, SECNOLOGY stands out from the competition in that the cost of high availability only requires the addition of a Collector license. This represents only a marginal cost on the global license regardless of the volume to be processed, and therefore the cheapest HA offer on the market.