WSRouter — Production Hardening Guide
Everything you need to know before pointing WSRouter at real traffic.
This guide is paired with the framework code that was hardened across
commits dbc912c (capacity exceptions + ServerRegistry GC + graceful sweep)
and b4a101e (backpressure + stats + per-room rate limit). What's
automatic, what you need to configure, what stays your responsibility.
1. Capacity sizing
The two OpenSwoole\Table segments WSRouter allocates have demo-grade
defaults that you MUST size up for production:
| Table | Default | What it caps |
|---|---|---|
ws_owner |
4 096 rows | concurrent connections (cluster-wide) |
ws_room_members |
16 384 rows | (room × member) pairs (cluster-wide) |
When full, Store::set() previously returned false silently and the
join was lost. Now it throws ZealPHP\WS\CapacityException (a typed
subclass of StoreException).
Recipe
WSRouter::initOptions(
ownerCapacity: 200_000, // 200k concurrent connections
roomMembersCapacity: 1_000_000, // 1M (room × member) pairs
slowConsumerBytes: 8 * 1024 * 1024, // 8 MB per-fd send queue cap
);
WSRouter::init();
Sizing math
RAM ≈ maxRows × (4 × Σ column_bytes + ~32 B/row overhead)
For ws_owner (columns: server_id STRING(192) + conn_id STRING(32) = 224 bytes nominal;
client_id is the row key, not a column):
- 1k rows → ~930 KB
- 100k rows → ~93 MB
- 1M rows → ~930 MB
For ws_room_members (columns: room STRING(64) + client_id STRING(128) + server_id STRING(192)
joined_at INT(8)= 392 bytes nominal; keyed by{room}:{client_id}):
- 100k rows → ~160 MB
- 1M rows → ~1.6 GB
These segments are allocated at master fork — the RAM is committed up front, not lazy. Size against peak, not average.
Handling CapacityException
$app->ws('/chat',
onOpen: function ($server, $request) {
try {
WSRouter::own($request->get['user'] ?? '', $request->fd);
} catch (\ZealPHP\WS\CapacityException $e) {
// WSRouter::CLOSE_CAPACITY (4013) — semantic capacity close code.
// 1013 (CLOSE_TRY_AGAIN_LATER) is also valid as the RFC hint.
$server->disconnect($request->fd, WSRouter::CLOSE_CAPACITY, 'server at capacity');
return;
}
},
);
2. Stale-row cleanup — automatic
Cleanup happens on two paths:
| Path | When | What it does |
|---|---|---|
App::onWorkerStop sweep |
Graceful shutdown / reload | Worker 0 drops all ws_owner + ws_room_members rows owned by this server, plus its own ws_servers row |
App::tick(60_000) GC |
Periodic, automatic | Worker 0 scans ws_servers; rows older than 90 s (no heartbeat for >2× the 30 s interval) are reaped along with their dependent ws_owner + ws_room_members rows |
Hard-crash recovery: the framework writes a heartbeat row to
ws_servers every 30 s. If a node kernel-panics, its row goes stale; the
next GC tick reaps everything it owned. Cleanup is eventually
consistent within 90–150 s of a crash.
Nothing for you to configure. Available for inspection:
// Manually trigger a GC sweep (e.g. from an admin endpoint)
$reaped = \ZealPHP\WSRouter::runStaleServerGC();
// Returns count of rows reaped.
3. Heartbeat — server-side TCP health checks
The framework's stale-server GC handles server-level recovery. For individual dead connections (laptop closed, network dropped without a close frame), use OpenSwoole's built-in heartbeat:
$app->run([
'heartbeat_check_interval' => 30, // check every 30 s
'heartbeat_idle_time' => 90, // disconnect after 90 s idle
]);
OpenSwoole sweeps idle fds and closes them; your onClose handler fires
naturally, releasing local state. No application-level ping/pong needed
unless you want app-level liveness signals.
If you also want client → server pings (to detect application-level hangs even when TCP says the connection is alive), the client sends a WebSocket ping frame every 25 s; the server responds with pong automatically. No app code needed — protocol-level.
4. Authentication — how to wire PHP session auth into WebSockets
WebSocket upgrade requests are HTTP requests; they carry cookies. Read
the session cookie in onOpen, resolve to the authenticated user, and
use that as the trusted identity throughout the connection's lifetime.
Pattern: bridge HTTP session → WS identity
$app->ws('/chat',
onOpen: function ($server, $request) {
// 1. Read session cookie
$sid = $request->cookie['PHPSESSID'] ?? '';
if ($sid === '') {
$server->disconnect($request->fd, WSRouter::CLOSE_AUTH_REQUIRED, 'unauthenticated');
return;
}
// 2. Resolve to user via your session backend
session_id($sid);
session_start();
$user = $_SESSION['user'] ?? null;
if (!$user) {
$server->disconnect($request->fd, WSRouter::CLOSE_AUTH_INVALID, 'session expired');
return;
}
// 3. Bind the AUTHENTICATED identity to this fd — IGNORE whatever
// the client sends in subsequent frames. This is the trust boundary.
try {
WSRouter::own($user['username'], $request->fd);
} catch (\ZealPHP\WS\CapacityException) {
$server->disconnect($request->fd, WSRouter::CLOSE_CAPACITY, 'server at capacity');
return;
}
},
onMessage: function ($server, $frame) {
// 4. Look up identity from the SERVER-SIDE map — never trust the frame
$local = WSRouter::localFds();
$username = null;
foreach ($local as $cid => $row) {
if ($row['fd'] === $frame->fd) { $username = $cid; break; }
}
if ($username === null) {
$server->disconnect($frame->fd, WSRouter::CLOSE_AUTH_REQUIRED, 'fd not bound');
return;
}
// Now safe to use $username as the trusted sender identity
},
);
Why not just trust the frame's username?
The widget JS sends {"type":"join", "username":"alice"}. A malicious
client can send username: "bob" and impersonate Bob. The framework
intentionally doesn't enforce identity — it leaves auth to you. The
pattern above pins identity to the WS upgrade's session cookie, which is
HMAC-signed and tamper-resistant.
Close codes
WSRouter ships named constants for all semantic codes. Use them instead of hardcoded integers so client code can react specifically:
| Constant | Code | Meaning |
|---|---|---|
WSRouter::CLOSE_NORMAL |
1000 | Clean close |
WSRouter::CLOSE_GOING_AWAY |
1001 | Server going down / page navigating away |
WSRouter::CLOSE_PROTOCOL_ERROR |
1002 | Malformed frame |
WSRouter::CLOSE_POLICY_VIOLATION |
1008 | Generic policy violation (RFC 6455) |
WSRouter::CLOSE_MESSAGE_TOO_BIG |
1009 | Message too large |
WSRouter::CLOSE_INTERNAL_ERROR |
1011 | Server internal error |
WSRouter::CLOSE_TRY_AGAIN_LATER |
1013 | Server temporarily overloaded — retry (RFC 6455) |
WSRouter::CLOSE_AUTH_REQUIRED |
4001 | Not authenticated — client should log in |
WSRouter::CLOSE_AUTH_INVALID |
4002 | Bad token / expired session |
WSRouter::CLOSE_FORBIDDEN |
4003 | Authenticated but not permitted for this room |
WSRouter::CLOSE_CAPACITY |
4013 | Server at capacity (paired with CapacityException) |
WSRouter::CLOSE_RATE_LIMITED |
4029 | Client exceeded per-client rate limit (WS-4) |
WSRouter::CLOSE_IDLE |
4040 | Connection idle / heartbeat missed |
Example:
$server->disconnect($request->fd, WSRouter::CLOSE_AUTH_REQUIRED, 'unauthenticated');
$server->disconnect($request->fd, WSRouter::CLOSE_FORBIDDEN, 'not in room');
$server->disconnect($request->fd, WSRouter::CLOSE_CAPACITY, 'server at capacity');
5. Reconnect — client-side responsibility
The framework provides NO automatic rejoin. If a client's WS drops + the JS reconnects, it must re-send any join frames + replay state.
Pattern: client-side reconnect with re-join
class ChatClient {
constructor(url, room, user) {
this.url = url;
this.room = room;
this.user = user;
this.attempts = 0;
this.connect();
}
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
this.attempts = 0;
// Re-join immediately on every (re)connect.
this.ws.send(JSON.stringify({ type: 'join', room: this.room, username: this.user }));
};
this.ws.onmessage = (ev) => { /* handle message */ };
this.ws.onclose = (ev) => {
// Don't retry on auth failures (close codes 4001/4003 are app-level)
if (ev.code === 4001 || ev.code === 4003) { return; }
const backoff = Math.min(1000 * 2 ** this.attempts, 30_000); // exp backoff, max 30 s
this.attempts++;
setTimeout(() => this.connect(), backoff);
};
this.ws.onerror = () => { /* logged in onclose */ };
}
send(body) { this.ws.send(JSON.stringify({ type: 'message', body })); }
}
History replay
If your handler sends a "history" frame on join (as the chatroom lesson does), reconnect automatically gets the latest N messages — clients catch up. Pair with SQLite-backed message history (the Lesson 22 pattern); no server-side state needs to survive the disconnect.
For at-least-once delivery semantics during a reconnect window, switch
to Store::publishReliable (Redis Streams) — the consumer-group machinery
handles missed messages via XPENDING + XAUTOCLAIM.
6. Backpressure — automatic
Each push runs in its own coroutine + checks getClientInfo()['send_queue_bytes']
before sending. If the per-fd send queue exceeds slowConsumerBytes (default
4 MB), the message is dropped for that fd and counted in
stats()['pushes_dropped_slow_consumer'].
Why drop instead of disconnect: a temporarily slow client (mobile on 3G) recovers. Disconnecting them means full reconnect + re-join + history replay. Dropping a few frames preserves the connection.
If you want disconnect-instead-of-drop semantics (e.g. for trading
systems where lost frames are intolerable), check send_queue_bytes
yourself and call $server->disconnect($fd, WSRouter::CLOSE_TRY_AGAIN_LATER, 'slow consumer').
Tuning
WSRouter::initOptions(slowConsumerBytes: 16 * 1024 * 1024); // 16 MB before drop
Bigger = more lenient (mobile clients on flaky networks); smaller = faster drop of stuck consumers.
7. Metrics and introspection
Cluster-level counts — /healthz dashboard helpers
// Total connected clients across the cluster (O(1) — SCARD on Redis, Atomic on Table).
// Cheap enough to call on every /healthz request.
$total = WSRouter::onlineCount();
// Per-server breakdown — O(N_total_clients), groups ws_owner by server_id.
// Useful for debugging load-balancer imbalance; avoid calling on every request.
$byServer = WSRouter::onlineByServer();
// Returns: ['hostname1:1234' => 512, 'hostname2:5678' => 488, ...]
// This server's cluster-unique id (defaults to hostname:pid, set in WSRouter::init()).
$me = WSRouter::serverId();
// Fire-and-forget pub/sub publish to any channel (sugar over Store::publish).
$receivers = WSRouter::broadcast('alerts:global', json_encode($payload));
Paginated room roster for large rooms
Room::members() loads the full roster into memory (SSCAN loop). For rooms
with more than ~10 000 members, use Room::membersPaged():
$room = WSRouter::room('general');
$cursor = '0';
do {
['cursor' => $cursor, 'members' => $batch] = $room->membersPaged($cursor, 200);
foreach ($batch as $clientId) { /* process */ }
} while ($cursor !== '0');
membersPaged(string $cursor = '0', int $count = 100) returns one SSCAN batch
(['cursor' => string, 'members' => list<string>]). Cursor '0' starts a
fresh scan; a returned cursor of '0' signals end-of-scan. On the Table
backend it falls back to the full roster in one batch (no incremental SSCAN
available there).
Per-worker stats — WSRouter::stats()
$app->route('/healthz/ws', fn() => WSRouter::stats()->snapshot());
Counters (per-worker, accumulate since boot):
| Counter | What |
|---|---|
owns_total |
successful WSRouter::own() calls |
releases_total |
successful WSRouter::release() calls |
sendToClient_total |
published cross-server sends |
sendToClient_owner_missing |
sends where the client wasn't connected anywhere |
room_joins_total |
$room->join() calls |
room_leaves_total |
$room->leave() calls |
room_pushes_total |
$room->push() calls accepted |
pushes_total |
actual fd-level $server->push() calls that succeeded |
pushes_failed_total |
$server->push() returned false (rare) |
pushes_dropped_slow_consumer |
dropped due to send-queue cap |
pushes_to_dead_fd_total |
push attempted on an fd that wasn't isEstablished |
capacity_exceeded_owner_total |
CapacityException thrown on own() |
capacity_exceeded_room_total |
CapacityException thrown on join() |
rate_limit_drops_total |
pushes dropped by the per-room rate limit |
client_rate_limit_drops_total |
pushes dropped by the per-client rate limit (WS-4) |
hmac_verify_failures_total |
pub/sub messages rejected due to bad/missing HMAC (WS-3) |
For cluster-wide metrics, scrape each server's /healthz/ws and sum.
8. Rate limiting — per room and per client
Per-room cap
// 100 pushes / 60 s / room
WSRouter::setRoomRateLimit(100, 60);
// Disable
WSRouter::setRoomRateLimit(0);
Sliding-window via Counter::increment keyed by room:bucket-id. Old
buckets age out naturally (no explicit deletion). Over-rate pushes
silently drop + count in rate_limit_drops_total.
Per-client cap (WS-4)
Per-client rate limiting is a first-class API — no hand-rolling needed:
// 50 messages / 10 s / client
// (applies to sendToClient always, and to Room::push only when a
// $fromClientId is attributed; server-originated broadcasts are not
// per-client-capped)
WSRouter::setClientRateLimit(50, 10);
When a client exceeds the limit, the push is dropped and counted in
client_rate_limit_drops_total. If you want to disconnect the client
instead of silently dropping, check the stat in your onMessage handler
and call $server->disconnect($frame->fd, WSRouter::CLOSE_RATE_LIMITED, 'rate limited')
(close code 4029).
// Disable
WSRouter::setClientRateLimit(0);
9. Ordering — what's guaranteed, what's not
| Guarantee | Status |
|---|---|
$server->push() to the same fd preserves order |
YES — OpenSwoole serializes per-fd send queue |
| Pub/sub delivery preserves publisher's order on the channel | YES — Redis guarantees |
| Cross-WORKER ordering of pushes from the same publish event | NO — workers receive subscribers in parallel; their pushes can interleave at the receiver |
| Cross-PUBLISH ordering of rapid-fire messages | NO — Redis pub/sub delivers in publish order BUT each worker schedules independently |
Mitigation: include a timestamp / seq in every message
$room->push([
'type' => 'message',
'body' => 'hello',
'ts' => microtime(true), // high-resolution timestamp
]);
Client reorders by ts before display. Two messages within a few µs
are effectively concurrent; users won't notice if displayed in
arrival-order. For trading / commit-log workloads where strict total
ordering matters, use Store::publishReliable (Redis Streams) — Streams
have a global monotonic message id (XADD returns it).
10. Trust model
WSRouter assumes the Redis network is trusted — anyone who can
PUBLISH on ws:room:* or write to ws_owner can spoof messages /
membership. Same threat model as TieredBackend invalidation (C2 in the
hardening pass).
If your Redis is NOT in a trusted network (e.g. shared Redis-as-a-service across tenants), enable the built-in channel HMAC (WS-3):
// Set once at boot, before App::run().
WSRouter::setChannelHmacSecret(getenv('ZEALPHP_WS_HMAC') ?: null);
When a secret is set, every sendToClient() and Room::push() publish
wraps the payload in a signed {"v":1,"hmac":"...","payload":"..."} envelope.
The receiving worker verifies the HMAC before delivering the message; bad or
missing signatures are silently dropped and counted in
hmac_verify_failures_total (visible via WSRouter::stats()). Without a
secret the channel is unauthenticated (trusted-network default).
11. Production checklist
Before pointing real traffic at WSRouter:
-
WSRouter::initOptions(ownerCapacity, roomMembersCapacity, slowConsumerBytes)sized for peak -
App::run(['heartbeat_check_interval' => 30, 'heartbeat_idle_time' => 90]) -
onOpenreads session cookie → resolves identity →WSRouter::own($trustedId, $fd) -
onMessagelooks up identity fromWSRouter::localFds(), never trusts frame - Client-side reconnect logic with exponential backoff + 4001/4003 stop conditions
-
WSRouter::setRoomRateLimit($n, $window)for per-room spam protection -
WSRouter::setClientRateLimit($n, $window)for per-client spam protection (WS-4) - Expose
/healthz/wsreturningWSRouter::stats()->snapshot() -
Store::defaultBackend(Store::BACKEND_REDIS)if you want cross-server federation - If using Redis: capacity for
ws_owner/ws_room_memberskeys; monitor Redis memory - If using Table backend: capacity sized against
OpenSwoole\TableRAM budget at master fork - Document close codes your app uses — use
WSRouter::CLOSE_*constants (4001 unauth, 4002 bad token, 4003 forbidden, 4013 capacity, 4029 rate-limited, 4040 idle) - Load-test at expected peak — connections × broadcasts/sec × room sizes
- Decide ordering strategy —
tsfield + client reorder ORpublishReliableStreams - Trust model — is your Redis network trusted? If not, set
WSRouter::setChannelHmacSecret(getenv('ZEALPHP_WS_HMAC') ?: null)(WS-3) - Sweep test: kill -9 a server; verify the GC reaps its rows within 150 s
The first 8 items are configuration / wiring — under 50 LOC total. The last 5 are operational tests + decisions.
What WSRouter does NOT do (you own these)
| Identity / auth | You read session/cookie in onOpen |
| ACL on join | You check "can alice join chat:42?" before $room->join() |
| Message persistence | Use SQLite (Lesson 22 pattern) or your DB |
| Cross-region federation | Redis pub/sub is in-cluster; for multi-region run a Redis replica per region + dedicated pub/sub bus |
| At-least-once delivery | Use Store::publishReliable (Streams) instead of Store::publish |
| Client SDK | Ship your own; the JS in _chatroom_widget.php is the reference shape |
The framework gives you the **transport + cluster-wide state + recovery
- observability**. The semantics on top are yours.