Site availability and data integrity

Table of contents

Overview
Hosting the site in two data centers

Overview

Site - a set of computing resources with a wide channel of communication between them and high mutual accessibility, excluding disintegration into several unrelated parts except for the loss of individual servers, providing full functionality regardless of the availability of other parts of the system.

The impossibility of breaking up the site into several unrelated groups of servers is due to the need to ensure data integrity. To guarantee this requirement, a quorum model is used - data microservices (reserved in active-passive mode) are active only on those servers that are part of a cohesive group of more than half of the active site servers. Quorum mode is enabled in the configuration parameters of each site by the check_quorum parameter (off by default). The quorum case for half of the servers can be solved with an elementary arbiter - an external IP address by a simple ping. This is true for the case where communication between servers and communication with the outside world for each of the servers is provided by the same communication channel so that a server crashing or being disconnected from the network looks the same to other servers in the system and to devices in the outside world.

If there is only a risk of an individual server disconnecting, there is no need to apply quorum mode. First, a single server when deploying a site across multiple servers does not have full functionality, and it is highly likely that no business process will be able to complete successfully due to the need to interact with a large number of other microservices. The probability that they are all located on this server is non-zero only if the server is intentionally configured to provide full functionality (e.g., a two-server site). Second, an individual server is connected by the same channel to both external served devices and other servers in the site. Third, the loss of communication of a separately-taken server with other servers of the same data center within the same data center, while maintaining its operability, is indistinguishable from a server failure. If none of these conditions are met or are unacceptable, then quorum mode should be used. In this case, the active-passive microservices on the disconnected server will go out of service.

A special case is site redundancy by simultaneous deployment of computing resources in two data centers with a wide communication channel between them and low ping. This configuration is designed to ensure site availability at the risk of losing the entire data center (power, connectivity, or other reasons).

Hosting the site in two data centers

Required to customize the site so that:

if one of the data centers was unavailable, the remaining part of the system provided full functionality to ensure the availability of services.
when the direct (third) link between the data centers went down, one and only one of the groups provided full functionality to ensure data integrity.

For this purpose, the site’s servers are divided in half so that each group contains a full set of microservices:

Instances of active-active microservices are represented in both groups.
Instances of active-passive microservices are represented in each group with different priorities, balancing the load between data centers. Those serving domain data in each of the groups are capable of handling the full domain tree.
Database instances are presented in both groups and are configured in master-replica mode.
File stores are available to both groups and automatically restore integrity once both groups are back up and available after a crash.

Depending on how the communication between data centers is provided - a common external link (1) or a separate direct link (2) - the configuration can be done in different ways.

1. A case of using an external channel for communication between data centers.

The drop in computing resources of datacenter A and the loss of connectivity to it are indistinguishable from each other for external systems and servers of another datacenter B. Thus, the remaining unconnected server group A will not receive any requests from outside and will pose a threat to data integrity.

It is necessary and sufficient for the remainder of System B to consider itself to be in quorum to be operational.

Option 1: Split the servers strictly in half, enable quorum mode, register any external permanently available address as an arbitrator.

Then when group A is unavailable, the quorum mode in group B will fix the situation when exactly half of the servers are available and check the availability of the arbitrator. When it is expected to be available, Group B will decide on a quorum and proceed. Group A is not of interest if it is inactive. And if it is active and without external communication, it will also fix the availability of exactly half of the servers and perform a failed availability check of the arbitrator (since the unavailability of the servers and the arbitrator are equivalent due to the use of a common external channel), as a result of which, based on the lack of quorum, the microservices for working with data will be unloaded and group B will go to the state of waiting for quorum.

Option 2. Do not use the quorum mode, relying on the guaranteed unavailability of group A in case of both types of problems: drop of communication and drop of computational resources.

In this case, it is not mandatory that the servers be distributed across data centers strictly in half. It is only important that:

each of the groups was able to hold the full load of the site,
it was impossible to further divide the remaining group of servers into inaccessible parts, except in the case of disconnection of individual servers.

2. The case of separate external channel and direct channel between data centers.

In corporate networks, data centers are usually initially hardware-configured to provide redundant operation of distributed services:

one of the data centers is designated as a leader (A) and the other as a follower (B);
cross-channel communication between data centers is an order of magnitude wider than their external channels;
VPN cross-channel between data centers via external interfaces is supported;
routers mutually send heartbeat messages simultaneously via cross-channel and through the VPN over WAN;
routers in case of external link drop provide automatic rerouting for outgoing requests to external addresses, activating redirection of outgoing traffic to the router of another data center via cross channel;
router in datacenter B is configured to
- If there is no cross-channel hartbit and simultaneous presence of VPN over WAN hartbit — automatic disabling of traffic routing;
- If there is no hartbit from both directions — automatic traffic routing is enabled;
- if there is a hartbit from both directions — automatic traffic routing is enabled;
- If there is a cross-channel hartbit and at the same time there is no hartbit from VPN over WAN — automatically enable routing of traffic and provide additional routing for outgoing requests coming through the cross-channel;

If these conditions are met, there is no integrity threat due to the asymmetric hardware configuration. For example, due to the automatic shutdown of the hardware routing service, it becomes impossible to have two integrity points in both data centers. And due to the automatic redirection of outgoing traffic through the external channel of another data center, it becomes impossible when business processes use microservice instances that have no connection to the outside world.

When configuring a system in such an infrastructure, it is sufficient to ensure only one condition: to divide the servers in half so that each group can provide full functionality in the absence of the other group.

Further consideration is given to data centers that do not have the aforementioned hardware settings and are symmetric.

2.1. A case study of data center computational resource collapse.

This case is equivalent to case 1 and presents no other difficulties.

2.2. The case of external link failure at one of the data centers (A).

There is no threat to quorum, the arbiter is not polled by any of the servers.

All back-end microservices continue to be active in groups on both data centers and available to other microservices in the system.

2.2.1. The outer loop microservices in group A remain unconnected to the outside world.

ws (webserver). Works in active-active mode. Existing websocket connections are broken, new ones are established by clients to an available group B webserver. By default, ws does not make outgoing requests unless otherwise configured in scripts.
sg (sip gate). Works in active-active mode. New registrations and registration renewals will be made to the available Group B instance. Existing calls going through group A’s sg will lose control. If group A’s mediagate is used to serve traffic, audio will be lost and the call will be forced to terminate by the callers.
mg (media gate). Works in the active-active mode. It is reserved for servicing the call by other microservices of the system, so in the absence of special settings on average half of the calls will fall to the mediagates of group A and will not allow subscribers to exchange traffic. MG problem is considered separately: 3.2. MediaGate’s problem.
esg (external sip gate). Active-active, but each provider account is served by one and only one esg instance. The priority settings for selecting an esg to serve each provider account are set directly in the properties of the entity describing it (provider). Selection of a different esg is done on the basis of unavailability. In the considered case of absence of external communication channel of data center A, the accounts bound in priority to the esg instances of group A will continue to be served there without the possibility of data exchange with the provider. The solution of the issue of providing communication with the PSTN for subscribers is considered separately: 3.1. ISP problem.
bgmg (border media gate). is active on the servers next to esg, and is not used without the activity of the respective esg.

2.2.2. Inner loop microservices with processes for accessing external systems outside the external communication channel continue to operate, receiving unsatisfactory responses (3.3. Problem of access to external systems).

script (all types of scenarios). They work in the active-active mode. Each scenario is served on a single instance, and instance selection is performed by other microservices of the system. Components of access to web resources, to mail servers, to external databases in scenarios executed by Group A instances will terminate with an error. In particular ASR and TTS when using Yandex SpeechKit. The problem of hashing is considered separately: 3.4. Microservice instance selection problem.
im (instant-messaging). Provides exchange of text messages with messenger servers. The service works in the active-passive mode. Group A instance will receive a communication error when attempting to exchange data. The exchange of messages with the messengers will be suspended. The problem of distributed services with a leader is considered separately: 3.5. The problem of distributed applications and services with a leader.
email. Provides exchange of letters with mail servers. The service works in the active-passive mode. Group A instance will receive a communication error when attempting to exchange data. The exchange of letters with mail servers will be suspended. The problem of distributed services with a leader is considered separately: 3.5. The problem of distributed applications and services with a leader.
mware (middleware). The letsencrypt certificate generation service, access to telegram for administration needs. Group A instance will receive a communication error when attempting to exchange data. The problem of distributed services with a leader is considered separately: 3.5. The problem of distributed applications and services with a leader.

2.3. A case study of a link drop between data centers.

In this case, the external links in both data centers are active.

There is no communication between the servers of the system on network interface 1 (local addresses), and the system cannot function in a standard way, because microservices address each other exactly by local addresses. There are two unrelated groups, each of which is operational, fully functional, and ready to serve client requests and business processes. Each has exactly half of the site’s servers.

According to the results of the arbitrator’s survey, both groups are willing to continue to work independently.

To ensure data integrity, one and only one of the groups, either A or B, must be decommissioned. Then the case will be reduced to cases 1 and 2.1 (drop of computing resources of one of the data centers or its availability to all).

The determination is made in a coordinated manner - both groups make the same decision.

Network interface 2 and the external channel are used to determine which group is prioritized. Group A servers poll Group B servers at their external addresses specified in the configuration. Conversely, Group B servers poll Group A servers. When mutual availability is detected, the group is prioritized based on the function from the servers in each group (servers and their parameters in the two unrelated groups are guaranteed to be different). When another group is found to be unreachable at an external address, the group considers itself the only one, and with the arbitrator available and half of the site’s servers available, continues to work.

The group prioritization function uses the configuration parameter for the quorum_leadership_key server, and selects the group with the lexicographically minimal value. If there is no parameter, the name of the server in the configuration is taken as the default value. If keys for different servers match, their names are compared secondarily.

Mutual polling by groups is done over http/https by sending an HTTP request to a public URL (with a static key) to the current site’s web servers in an inaccessible zone.

2.4. The fall of multiple entities.

When both the external channel of one of the data centers and the internal channel between them are down simultaneously, the case is equivalent to a drop in the computational resources of one of the data centers. The external arbiter solves the problem because the disconnected group is inaccessible to the external, and has no way to query the external arbiter.

2.4.1. Channels and computing resources of one of the data centers

When compute resources in the data center and one of its links are down simultaneously, the case boils down to a simple drop in compute resources in the data center.

2.4.2. Multiple servers in different groups

With the loss of multiple servers scattered across different groups, uptime is configuration dependent and is equivalent to considering the usual case where all site servers are located in the same data center. There is a case when all servers serving a certain type of microservice according to the configuration have failed. Then all business processes in which it participates will be impossible on the site.

2.4.3. One of the data centers and a server in another data center

If one of the data centers is unavailable or down, quorum mode is used to ensure integrity. Each of the two data centers has the minimum number of servers required for a quorum - 50%.

In this arrangement, the site does not allow any additional servers in the group remaining in quorum to be lost and taken out of service.

In the event of such an occurrence, quorum is lost and the group goes into quorum standby mode, resulting in unavailability of the service (except for the case study 2.3. A case study of a link drop between data centers.).

And given the randomness of choosing the lost server, there’s another problem in addition to the lack of a quorum. The site configuration must be prepared for the loss of three arbitrary servers. That means that, uh

active-active microservices must have 4 or more instances - at least 2 in each group.
active-passive microservices should have not two but four instances - at least 2 in each group.
DBMS instances must reside in highly available virtual environments rather than on site servers, or must be externally managed to provide 4 replica tracking.

2.4.4. Weighting models for quorum calculation

(TODO) Extended weighting models could potentially be applied to the quorum calculation, allowing for groups with less than half of the total number of servers on the site. This capability is provided by the concept of arbiter servers and a predefined group boundary running between data centers. In this case:

The requirement to distribute servers strictly in half can be eliminated, up to and including distributing site servers across more than two data centers.
Low-performance servers located outside the data center that do not affect quorum can connect to the site. For example, this may be useful to prioritize the use of media gateways located in remote locations for call handling needs within the remote location.

3. Consideration of the problems identified.

3.1. ISP problem

3.1.1. Outgoing calls

Access to the provider is reserved by creating multiple accounts. They are configured to oppositely prioritize binding to instances of the esg.

In routing, multiple identical rules are created, differing only in their bindings to different accounts (and possibly different priorities). This ensures consistent application of the routing rules in the event of a failed account status check initiated during each attempt to apply a call routing rule to the ISP.

To ensure that accounts and esg servers are loaded evenly in the normal state, you can set the rules to the same priority. In this case, random sorting is applied.

3.1.2. Incoming calls.

Options:

If one of the servers is unavailable to the provider, the INVITE request does not receive a 100 Trying acknowledgement. On this basis, the provider’s equipment hastily sends the request to another account (internal forwarding). The provider must support call forwarding on non-response in a short interval much shorter than the transaction lifetime (32 seconds).
The provider’s equipment routes incoming calls to an external number to two accounts at once (group number with parallel call, or variants with a slight delay). When one of them answers, a CANCEL is sent to the other. The system can be configured to receive calls in a scenario with instant answer (and instant rejection of the duplicate call) and further transfer through the routing table in order to eliminate duplicates in the statistics.
Requests are sent to both servers at the same time (forking), when one of them responds, the other receives CANCEL. Similarly to the previous one, receiving calls in an instant answer scenario can be configured.
(TODO) Two esg instances are simultaneously registered under the same account (forking). The provider must support this mode. Outbound calls are sent simultaneously to the two devices registered under the same account. The system task is to a) allow configuring simultaneous registration under the same account, b) reject a duplicate call on an input when its neighbor is detected on another instance based on a configured criterion, applying a transaction.

3.2. MediaGate’s problem

The media gate is selected by the active mgc controller. Several active media groups consisting of a controller and several mediagates statically bound to it in the configuration can be used at the same time. Then the selection of the media group to serve the call is random and sequential.

3.2.1. Configuring the persistent use of the bgmg edge mediagate, always located directly on the same server as the edge sip server (esg, sg) calling the corresponding call shoulder. Then the unavailability of the external channel for the primary mediagate used in servicing the call is not significant. However, in this case, the failure of the edge server results in the inability to continue exchanging traffic of existing calls.

3.2.2. When the controller selects a media gate, it automatically takes into account the availability of the arbitrator. The filter applies only to media contexts managed by the microservice b2b. The filter does not apply for media contexts managed by ivr and conf microservices (since the possibility of their contact with external subscribers is excluded, they are always connected to b2b media gates, and the internal channel is not broken. The problem of microservice instance selection is considered separately: 3.4. Microservice instance selection problem.

3.2.3. (TODO) Spreading media groups across different sites along with their media gates. When organizing a random list, priority is given to the media group that is located in the same datacenter as the call initiator server (sg, esg, ivr, conf).

3.3. Problem of access to external systems

The solution uses regular arbitration polling of each server running microservices from the list of those requiring external access. Based on the result of the arbitrator poll, the server is assigned an offline or online state. The arbitrator in this case applies far beyond just determining a quorum for the cohesive half of the servers. A 20 second cache is used to reduce the number of requests.

The offline state tracking mode is enabled in the configuration parameters of each site by the check_offline parameter (off by default). When the referee is unavailable, the server is considered offline.

Among the microservices that have the problem, there are both active-active (e.g., script) and active-passive (e.g., email). Different schemes are used for them.

3.3.1. Active-active.

When selecting a node to execute a script, an offline filter is applied based on arbitrator availability information for servers with script microservices and ivrscript. A cache with an update interval of 20 seconds is used.

Similar behavior when selecting media gates (3.2. MediaGate’s problem).

3.3.2. Active-passive

In the absence of communication with the arbitrator, the microservice in group A switches to the mode of waiting for his appearance, not participating in the process of competition for leadership. Thus the microservice instance in group B is activated regardless of the configured leadership priority in the configuration. Recovery is also made based on information about the appearance of communication with the arbitrator. The check interval is 10 seconds. The data in the cache is not used, but is updated.

3.4. Microservice instance selection problem

If there are multiple microservice instances, whenever an executor instance is selected, an external channel availability filter is applied.

Regular (20-second interval) initiation of arbitrator availability checks from the server where the microservice instance resides, and caching of the information for later use during selection (in particular, for mediagate - on the media controller). The arbitrator in this case is applied far from only when determining quorum for the cohesive half of the servers, but regularly from each server (where microservices that require filtering on the presence of external communication are running).

In order to eliminate the risk of dependence of functioning on the performance of the arbitrator itself, several highly available external servers are registered in the configuration for the site. Configuration option for the site odd_referee as a string "IP1, IP2, IP3".

3.5. The problem of distributed applications and services with a leader

Leader selection for distributed applications is based on the priorities set in the configuration for the role at the site. When a failure occurs, the next highest priority microservice instance on the site becomes the new leader (failover). When a higher-priority instance is restored, it forcibly takes over the leaderboard (takeover).

The leader selection for dynamic distributed services is done by a capture transaction and is random in nature. The failover operation is followed by the next iteration of randomly taking over the leadership by one of the remaining instances. The takeover operation does not apply.

Instances operating on servers with no access to an external arbiter are removed from the competition for leadership. No activation on boot, while running - 3 failed attempts after 2 consecutive seconds before a decision is made. Site option check_offline.

3.6. Problem PostgreSQL

All PostgreSQL instances have streaming replicas and are connected to a replication controller (a mware microservice running in a active-passive). The replication controller is only active in the group that is in quorum and remains in the lead after mutual polling through the WAN. In case the database instance in the active group is a replica, this is detected by the active replication controller. After 30 seconds of master unavailability detection, the controller converts the replica to a master. The conversion operation is point-to-point and is completed in milliseconds. From then on, microservices that interact with the database (mdc, dms, possibly script) will be able to connect to the database.

Once communication between groups is restored, the controller will detect two instances in master state and initiate the process of converting the older master to a replica by creating a full backup. The operation is long, and depending on the size of the data can take tens of minutes. A backup will be left nearby (by default in /tmp, which is cleared every time the container/server is restarted).

See also: Quality attribute management methodologies