User Swarm Isolation

Before You Change Anything

These pages often point at shared systems. Confirm the cluster, namespace, and ownership boundary before running mutating commands.

Each user gets a separate runtime.

In plain language, those runtimes still live together inside shared cluster infrastructure. Isolation comes from private routing, ownership checks, and internal-only access, not from a separate namespace for every user.

This page answers one practical question: "how is one user's runtime kept separate from another user's runtime?"

UserSwarm is the cluster record we use to track one user's runtime.

The actual StatefulSets, Services, and PVCs are created in the shared userswarms namespace. Swarm pods are never directly reachable from the internet.

Why Swarm Runtimes Are Not Publicly Reachable

Agent Runtime pods have no public HTTPRoute. They are not exposed through the Envoy Gateway or any Ingress resource. There is no direct public path from the internet to a swarm pod.

If those terms are unfamiliar, the important point is simpler than the Kubernetes vocabulary: swarm runtimes are private internal services, not public apps.

Shared Namespace, Private Services

Swarm pods are reached through per-swarm internal Services in the shared userswarms namespace.

Those services are ClusterIP services, which means they are only reachable from inside the cluster.

The current runtime does not create a per-swarm NetworkPolicy. Communication between the orchestrator and runtime pods uses gRPC bidirectional streaming on port 42618, authenticated via HMAC.

That means the current model is "private by topology and backend mediation," not "hard isolated by one namespace and one network policy per user."

Isolation today comes from:

No public HTTPRoute or ingress to the runtime services
Per-workspace UserSwarm ownership and deterministic service naming
The orchestrator resolving the target service from the authenticated workspace
Cluster-internal DNS and service discovery instead of public exposure

Orchestrator Proxy Model

All swarm traffic is proxied through the orchestrator:

The orchestrator is the component that holds the mapping between users, workspaces, and swarm endpoints.

That ensures:

Authentication and authorization happen before any swarm access
The backend can enforce rate limits, billing controls, and audit logging
Cross-user agent traffic is mediated by the backend rather than by public runtime addresses

Step 1

The app sends a message

The mobile app calls /v1/workspaces/{id}/conversations/{id}/messages.

Step 2

The orchestrator checks identity

It authenticates the user and resolves the workspace to the correct swarm service.

Step 3

The request stays private

The orchestrator opens a gRPC bidi stream to the swarm's ClusterIP service on port 42618 inside the cluster.

Step 4

The result streams back

The runtime streams text chunks, tool calls, usage events, and a terminal done event back to the orchestrator, which forwards them to the mobile app via Socket.IO.

Lifecycle Cleanup

When a workspace is deleted, cleanup should remove the matching runtime and its supporting resources.

There are multiple layers because cluster cleanup can fail halfway through, and we do not want orphaned runtimes left behind.

The current 3-layer defense is:

Layer	Mechanism	Description
1	Orchestrator	Deletes the `UserSwarm` CR when a workspace is deleted
2	Metacontroller + webhook finalize hook	Deletes the StatefulSet, Services, PVC, ConfigMap, and ServiceAccount
3	Reaper	Periodic background job that finds orphaned cluster-scoped `UserSwarm` CRs and deletes them

Deletion Flow

Step 1

Delete the workspace

The user deletes the workspace through the API.

Step 2

Remove the UserSwarm record

The orchestrator deletes the cluster-scoped UserSwarm custom resource.

Step 3

Run final cleanup

Metacontroller calls the finalize hook and tears down the runtime children in userswarms.

Step 4

Catch leftovers

If a CR is left behind without a live owner, the reaper removes it on the next sweep.

The backend should wait for swarm verified=true, not just pod readiness, before routing user traffic.

🔗 Terms On This Page

If a term below is unfamiliar, open its glossary entry. For the full list, go to Internal Glossary.

UserSwarm: The Crawbl custom resource that represents one user runtime and its lifecycle.
HTTPRoute: The routing rule that tells the gateway which hostname and path should reach which service.
ClusterIP Service: A Kubernetes service that is reachable only from inside the cluster.
StatefulSet: The Kubernetes workload type used when pods need stable identities and persistent storage.
PVC: A PersistentVolumeClaim, which requests persistent storage for a workload.
Metacontroller: A controller framework used to create and clean up user runtime resources from custom resources.

Why Swarm Runtimes Are Not Publicly Reachable​

Shared Namespace, Private Services​

Orchestrator Proxy Model​