Chapter 5. Quality attribute management techniques

Scaling

Scaling depends on the mode of loading and basically implies transferring part of the functional blocks of the system to separate physical or virtual servers. To solve this task, the system is initially designed as a set of functional roles, each of which has its own responsibilities.

Each functional role can be located on a separate server. And in case of excessive load it has a methodology of division into groups by responsibility.

In this case, the architecture of the "Incoplax" communication platform assumes the classification of roles based on availability tactics: active-active and active-passive (see Step 6).

The entire system can be assembled from several independent parts, each with full functionality and separate physical resources (sites).

In the limit, the system looks like

  • Multiple sites (groups of highly available servers) linked by links with lower stability and bandwidth requirements.

  • Each site contains multiple servers.

  • Each server contains one or more roles.

  • Each role is located on multiple servers within each site.

  • Each role interacts with other roles within the site in terms of services.

  • Each role interacts with its mirror on other sites in terms of data matching.

On one server

The consideration starts with the system installed on a single server. All roles are active. Let it be a 4-core server.

In this case, the system is not reserved from machine crash, when one or another role crashes, it is restarted - existing activities are lost, data is saved. Conditionally, one server is able to process the following scales of work with full load:

  • or 6000 SIP messages per second on SIP roles;

  • or 20,000 media packets per second (200 simultaneous switches * 2 trunks * 50 packets per second) on a MG

  • or 5000 routing rule lookup operations per second (~5 routes, ~10 vectors per route).

  • or 40000 logging operations per second per 1 core

  • …​

One server can hold many domains, many registrations, many calls, but everything together is limited.

Load calculations

Each subscriber sends registration on average once every 5 minutes: UA - Gate - B2B. 12 SIP messages.

10000 subscribers - 400 SIP messages per registration operation per second - 8% of the limit.

Each established dialog with audio sharing is ~0.5% of the limit.

Each additional CPS is (30 + 15*(number of fork) + 24*(number of reinvite)) SIP messages + (5 + 3*(reinvite number)) MEGACO messages. In the simple case, it is 0.75% + 0.5% + 0.2% + one routing operation from the server capacity limit.

Conditionally 10000 subscribers on one server can exist with up to 150 simultaneous conversations and CPS up to 50 (each max) or conditionally 100/20 or 50/30.

State subscriptions, group and multifork calls also add significant load.

The load increases slightly in case of complex cross-domain routing with a large number of rules and forwarding.

Finding bottlenecks

  1. If the number of simultaneous calls increases, the MGC and MG roles must be allocated to separate servers. If the number of simultaneous calls exceeds 200, the number of MG role instances can and should be multiplied.

  2. If the number of simultaneous calls increases above 2000, MGCs should be multiplied and media groups should be allocated. Media groups can also be multiplied to increase availability and resilience in case of failures.

  3. As CPS increases - it is necessary to allocate to separate servers and multiply B2B and SG.

  4. When the number of SGs is large, it makes sense to introduce the role of REDIRECT.

  5. When increasing CPS further - in some cases it makes sense to split STORE into groups.

  6. When increasing the number of users up to 50000 or CPS up to 1000 (routing, authorization, registration, subscriptions) - it is necessary to divide users into domains. The recommended maximum number of users in a domain is up to 30 thousand.

  7. If subscriptions are used en masse - it is necessary to split StateStore into domain-bound groups, and to multiply B2B and SG by allocating them to separate servers.

  8. When the load increases on several domains - it is necessary to split DC, REGISTRAR, StateStore into groups of servers with binding to domains.

  9. When the number of DC, Registrar, StateStore, Store, MGC, RPC servers increases to 40 - sites must be allocated.

Scaling directions

Infrastructure:

  • server addition;

  • adding sites.

Logic:

  • Adding active-active roles with allocation to new servers;

  • division of active-passive roles into groups of non-overlapping responsibility (by domain, by hashing).

Data:

  • adding new domains to the tree, transferring parts of organizational structures to other domains;

  • moving databases to separate servers for each domain;

  • Allocating a server with data replication for each role PostgreSQL/ClickHouse (dc, dms);

  • allocation of separate storage for each domain S3.

Improving accessibility

The system is based on functional roles that form active-active hash-ring and active-passive redundant groups. The roles provide services to each other and to external users.

Each role in the context of the architecture is independently concerned with its own accessibility.

Each role can be used on multiple physical nodes.

When a node is active, role activity is guaranteed by the configuration supervisor, with activity provided by OS daemons/services.

Improving system availability is related to increasing the availability of individual roles, communication channels, physical units, and reducing the dependence of some units on others. Improving the availability of roles is ensured by duplication on several servers:

  • Hot Standby (active-passive).

  • Parallel accessibility (active-active).

  • A mixin in which active-passive groups are divided by responsibility (reg,dc,store) or by load (mgc,mg) into multiple concurrent groups.

  • Quickly apply configuration changes.

  • The cluster’s external network loop contains only the WS, ESG, SG, REDIRECT, and possibly RPCO roles, each of which has mechanisms to protect against intrusion, account password mining, and performance loss from (D)DoS attacks through the use of built-in edge filters.

Increasing the availability of the system in connection with the use of storages (external to the platform) - S3, PostgreSQL, ClickHouse, is directly provided by the availability of the storages themselves and ensuring the increase of their availability for the system. Thus, in the system configuration each connection to the storage can be duplicated by any number of alternatives. In this case, if the connection is unavailable on the first connection line, the system searches through the remaining alternatives and stops at the next available one. Intradomain connection settings to S3 and ClickHouse can also be duplicated with any number of alternatives.

The main case is the provision of communication processing, and when storage for statistics and call records is unavailable, the main work of the system does not stop.

Making roles more accessible

Roles ACTIVE-ACTIVE (B2B, MG, WS, …​)

  • Adding servers to the configuration that use the appropriate roles (number of 1,3,15,…​).

Roles ACTIVE-PASSIVE (REGISTRAR, MDC, SDC, MGC, STATESTORE, …​)

  • Adding servers to the configuration that use the corresponding roles in standby mode (number of 1,2,3).

  • For roles divided by domains. Division of a reservation group into several reservation groups by domain responsibility, with partial or complete transfer to separate servers and formation of non-overlapping groups active-passive (registrar, mdc, sdc, statestore, …​).

  • For the role of system object storage (store). Split into several groups on separate servers - requests are distributed automatically via hashring by keys.

  • For the role of media controller (mgc). Split into several groups on separate servers - requests are distributed automatically through random group selection during media context creation. Each group must receive MGs bound to it, and it is rational to use the same number of MGs in all MGC groups on the site.

Increasing the availability of services

Media processing

  • adding new servers to the configuration on which new MGC-MG groups with new numbers are deployed.

External connections to providers

  • duplicate connections on different ESG role servers, with simultaneous configuration in sequential or competing priority routes.

SIP-signaling

  • SG multiplication, entering duplicate addresses into SIP devices Outbound proxy server.

  • SG propagation, DNS routing to a backup server at the same address.

  • utilization REDIRECT.

Increased availability during physical disruptions

Lack of communication between cities

  • Allocating sites that have full internal functionality and do not require constant communication with other sites.

  • linking domains to sites, providing caching and storage of domain information on the site.

(D)DoS server attacks

  • propagation of SG, ESG and WS with different external addresses.

  • application of whitelisting/blacklisting on SGs and ESG.

  • SG and ESG have a built-in network filter on the sender addresses of failed authentications.

Site down

  • domain binding to multiple sites (backup or parallel), data is automatically replicated.

  • Configuring phones to outbound proxy where another site is involved.