Multi-DC

This document describes the multi-datacenter design for GoAkt and how developers can use it. Each datacenter runs an independent GoAkt cluster. There is no global cluster state shared between datacenters. Cross-DC communication is DC-transparent to callers and relies on a DC control plane (registry) for routing.

Goals and Non-Goals

Goals

  • Isolation: Each datacenter remains a standalone GoAkt cluster.

  • DC-transparent messaging: Callers address actors/grains by logical identity; the runtime discovers which DC contains the target.

  • Control plane as source of truth: The DC control plane registry drives routing and placement.

  • DC-aware placement: SpawnOn with WithDataCenter spawns actors in a specific datacenter.

  • Liveness: Periodic heartbeats keep routing tables accurate.

Non-Goals

  • Global cluster membership or consensus across datacenters.

  • Transparent global actor identity with shared state.

  • Strong consistency across datacenters.

Quick Start

1. Choose a Control Plane

Use etcdarrow-up-right or NATS JetStreamarrow-up-right as the DC registry. One backend can serve both cluster discovery and the DC control plane (with different namespaces/keys per use).

2. Configure the Actor System

3. Start and Check Readiness

4. Cross-DC Messaging

Send messages to actors/grains without specifying a DC; the runtime discovers the target:

5. Spawn in a Specific Datacenter

Architecture

  • DC leader is the sole writer for its DataCenterRecord; followers do not run the controller.

  • Control plane stores records, endpoints, and liveness TTLs/leases.

  • Routing uses the control plane cache; local DC is tried first, then remote DCs in parallel.

Developer Guide

Endpoints Configuration

datacenter.Config.Endpoints lists remoting addresses (host:port) advertised to other DCs. Other DCs use these to discover and talk to actors/grains in this DC.

  • Single endpoint (e.g. leader only): All cross-DC traffic goes to one node.

  • Multiple endpoints: Cross-DC spawns and lookups are spread across nodes.

Example:

Endpoints must be reachable from other DCs.

Readiness and Cache Refresh

  • DataCenterReady(): Returns true when the DC controller is operational and the cache has been refreshed at least once. Use for readiness probes before cross-DC operations.

  • DataCenterLastRefresh(): Returns the time of the last successful cache refresh. Zero when multi-DC is disabled or the controller has never refreshed.

FailOnStaleCache

  • true (default): Cross-DC operations (e.g. SpawnOn with WithDataCenter) fail with ErrDataCenterStaleRecords if the cache is stale. Favours consistency.

  • false: Operations proceed with best-effort routing using a potentially stale cache. Favours availability.

DC-Aware Spawn (SpawnOn + WithDataCenter)

To spawn an actor in a different datacenter:

The runtime looks up the target DC by DataCenter.ID() (Zone+Region+Name) in the active records, then sends RemoteSpawn to one of that DC's Endpoints chosen at random.

Errors: ErrDataCenterNotReady, ErrDataCenterStaleRecords, ErrDataCenterRecordNotFound.

Control Plane Providers

Provider
Package
Notes

Etcd

datacenter/controlplane/etcd

Supports TLS, auth, watch

NATS JetStream

datacenter/controlplane/nats

Requires NATS with -js (JetStream)

Control Plane

Interface

  • Register: Create a DC record; returns ID and version.

  • Heartbeat: Renew lease; extends liveness.

  • SetState: Transition record state (e.g. ACTIVE β†’ DRAINING on shutdown).

  • ListActive: Return ACTIVE records with valid leases for routing.

  • Watch: Stream updates (optional; return ErrWatchNotSupported if unsupported).

  • Deregister: Remove record on shutdown for faster cleanup than lease expiry.

Data Model

DataCenterRecord:

Field
Description

ID

Stable identifier (Zone+Region+Name or Name)

DataCenter

Name, Region, Zone, Labels

Endpoints

Advertised remoting addresses

State

REGISTERED, ACTIVE, DRAINING, INACTIVE

LeaseExpiry

Liveness lease expiry

Version

Monotonic revision for conflict-free updates

State transitions:

  • REGISTERED β†’ ACTIVE: First heartbeat after start.

  • ACTIVE β†’ DRAINING: On actor system shutdown.

  • ACTIVE/DRAINING β†’ INACTIVE: TTL expiry or explicit disable.

Built-in Implementations

etcd

NATS JetStream

NATS must be started with JetStream: nats-server -js.

Contract Guarantees

  • Register, SetState, and Heartbeat are linearisable per record (CAS by Version).

  • Heartbeat extends the lease only when Version is current.

  • ListActive returns only ACTIVE records with unexpired leases.

  • Watch emits ordered updates (clients de-dupe by Version).

  • Deregister removes the record immediately for clean shutdown.

Configuration Reference

Field
Default
Description

ControlPlane

required

etcdarrow-up-right or NATSarrow-up-right control plane implementation

DataCenter

required

Name, Region, Zone, Labels

Endpoints

required

Remoting addresses (host:port)

HeartbeatInterval

10s

Heartbeat cadence

CacheRefreshInterval

10s

Cache refresh interval

MaxCacheStaleness

30s

Max age before cache is considered stale

LeaderCheckInterval

5s

Leadership recheck interval

RequestTimeout

5s

Control plane operation timeout

WatchEnabled

true

Use Watch when supported

FailOnStaleCache

true

Fail cross-DC ops when cache is stale

JitterRatio

0.1

Jitter for intervals

MaxBackoff

30s

Max backoff on errors

Cluster Integration

Multi-DC is enabled by adding datacenter config to the cluster config:

Failure Handling

  • Leader failover: The new leader takes over heartbeats.

  • Registry outages: Local DC operation continues; cross-DC routing may use cached data.

  • Stale cache: Configurable via FailOnStaleCache (strict vs best-effort).

  • Control plane unavailable: Does not block local DC; only affects cross-DC operations.

Routing Flow (DC-Transparent Messaging)

  1. Runtime checks local datacenter via ActorOf.

  2. If not found locally, fetches active DCs from the control plane cache.

  3. Queries all active DC endpoints in parallel via RemoteLookup.

  4. Uses the first DC that returns a valid actor address.

  5. Cancels remaining lookups.

  6. Sends the message via remote messaging to the chosen endpoint.

Sequence: DC Registration

  1. DC leader starts and registers DataCenterRecord (metadata + endpoints).

  2. Leader sends heartbeats at HeartbeatInterval (less than TTL).

  3. Registry expires the record if heartbeats stop.

  4. On shutdown, leader transitions ACTIVE β†’ DRAINING β†’ INACTIVE and calls Deregister.

Last updated