Failure support

From CatchChallenger wiki
Jump to: navigation, search

The cluster mode need to be resistant to failure.

From Wikipedia:

Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity.
Business continuity encompasses a loosely defined set of planning, preparatory and related activities which are intended to ensure that an organization's critical business functions will either continue to operate despite serious incidents or disasters that might otherwise have interrupted them, or will be recovered to an operational state within a reasonably short period. As such, business continuity includes three key elements and they are
  • Resilience: critical business functions and the supporting infrastructure are designed and engineered in such a way that they are materially unaffected by most disruptions, for example through the use of redundancy and spare capacity;
  • Recovery: arrangements are made to recover or restore critical and less critical business functions that fail for some reason.
  • Contingency: the organization establishes a generalized capability and readiness to cope effectively with whatever major incidents and disasters occur, including those that were not, and perhaps could not have been, foreseen. Contingency preparations constitute a last-resort response if resilience and recovery arrangements should prove inadequate in practice.

Network

Login

Work, just use another login server. You should load balance the traffic and exclude the no functional node (via ip, dns, ...).

Resuming normal operation: Just repair the network, the login will reconnect him self to the master and be available.

Master

Important, it's single point of failure, but: All the player connected on game server will transparently continue to play (via the direct mode or proxy mode). No new connection allowed during this period.

Resuming normal operation: Just repair the network, all the server will reconnect him self and the disconnection will be merged to the lock list.

Game server

All the player will be disconnected, this mode of work allow cheaper operation, fastest performance. The player will wait the repair or play to other game server.

Resuming normal operation: Repair the network, after the master re-connection the player will come again.

Hardware

Login

Work, just use another login server. You should load balance the traffic and exclude the no functional node (via ip, dns, ...).

  • Work with hdd dead if configured with mirror to offload the datapack loading.

Resuming normal operation: Just repair the hardware, restart the login server if needed, and the node will be readded to the network. If HA balancing, it need take care of it.

Master

Important, it's single point of failure, but: All the player connected on game server will transparently continue to play (via the direct mode or proxy mode). No new connection allowed during this period.

  • Work with hdd dead, no disk access after success full server start

Resuming normal operation: Just repair the hardware, all the server will reconnect him self and the disconnection will be merged to the lock list. In case of restart of master server, all the player id/lock connected on game server will be merged to remake the lock list. In case of conflict, both player will be ejected and the player blocked for determined time.

Game server

All the player will be disconnected, this mode of work allow cheaper operation, fastest performance. The player will wait the repair or play to other game server.

  • Work with hdd dead if configured with mirror to offload the datapack loading.

Resuming normal operation: Repair the hardware, after the master re-connection the player will come again.

Time

Some ARM don't have realtime clock, mean at ntp synchronization you have some years time drift (switch from 2012 to 2016). CatchChallenger detect it and fix as better.

Another return ms where it's asked seconds or seconds where it asked ms. After some time catchchallenger should detect and fix it.

Database

If it disconnected to database (network o database restart), it will loop to reconnect to the database until it reach it.