Architecture Case Study

๐Ÿ“ก

Slack: The Hidden Complexity of Real-Time Messaging at Enterprise Scale

How Slack architects real-time messaging for millions of concurrent users while maintaining search, delivery guarantees, and the illusion of simplicity.

Real-Time SystemsWebSocketsSearch InfrastructureFan-Out Architecture

The System

Slack handles over 750,000 organizations, millions of concurrent connected users, and billions of messages per day. Each connected user maintains a persistent WebSocket connection to Slack's servers, receiving real-time updates for every channel they're a member of. When you see a message appear instantly in Slack without refreshing โ€” that's a real-time push through a WebSocket, not a polling request.

But beneath the clean, simple chat interface lies an extraordinary amount of architectural complexity. Slack isn't just a messaging app โ€” it's a real-time search engine, a file storage system, a notification router, an integration platform (with 2,600+ apps), and an audit log โ€” all operating simultaneously with sub-second latency.

The Constraints

1. Real-time delivery with persistence. Messages must appear instantly for online users and be stored permanently for offline users and search. This dual requirement โ€” ephemeral delivery plus durable storage โ€” is the fundamental tension in Slack's architecture.

2. Channel cardinality explosion. A single enterprise organization might have 50,000 channels. A single user might be a member of 500 channels. When a message is posted to a channel with 10,000 members, 10,000 WebSocket connections need to be notified โ€” the "fan-out" problem.

3. Search must be fast and accurate. Enterprise users search their message history frequently. Slack must index every message in near-real-time and serve search results quickly across potentially millions of messages. Full-text search at this scale is a hard infrastructure problem.

4. Ordering guarantees. Messages within a channel must be displayed in a consistent order for all users. In a distributed system with multiple servers handling writes, ensuring global ordering is non-trivial.

5. Enterprise compliance. Large organizations require message retention policies, audit logs, e-discovery support, and data residency guarantees. The architecture must support these enterprise requirements without degrading the consumer-grade user experience.

The Architecture

The Connection Layer โ€” Gateway Servers

Each connected client maintains a persistent WebSocket connection to a Gateway Server. The gateway is responsible for:

  • Authenticating the connection
  • Managing per-user channel subscriptions
  • Receiving real-time events from the message bus and pushing them to the correct WebSocket connections
  • Handling connection lifecycle (heartbeats, reconnection, graceful disconnects)

Gateway servers are stateful โ€” each one holds the WebSocket connections for thousands of users. This statefulness is a deliberate trade-off: it allows efficient fan-out (the gateway knows exactly which connections care about a given channel) but complicates scaling and failover.

The Message Flow

When a user sends a message, the flow is:

1. Client sends message via WebSocket โ†’ Gateway Server
2. Gateway Server โ†’ Message Service (write to durable storage)
3. Message Service โ†’ writes to MySQL (source of truth)
4. Message Service โ†’ publishes event to Message Bus
5. Message Bus โ†’ all Gateway Servers with subscribers
6. Each Gateway Server โ†’ pushes to relevant WebSocket connections
7. Message Service โ†’ sends to Search Indexer (async)
8. Search Indexer โ†’ updates Elasticsearch/Solr index

The critical insight is the separation between the write path (steps 1-3, which must be durable and ordered) and the notification path (steps 4-6, which must be fast but can tolerate brief delays).

The Fan-Out Problem

The fan-out problem is Slack's most interesting architectural challenge. When a message is posted to #general in a 10,000-person organization, the system must:

  1. Determine which of the 10,000 members are currently connected
  2. Determine which gateway server each connected member is connected to
  3. Send the message event to each relevant gateway server
  4. Have each gateway server push the event to the correct WebSocket connections

Slack addresses this with a channel-to-gateway mapping maintained in a distributed cache. When a user connects to a gateway, the gateway registers that user's channel subscriptions in the mapping. When a message arrives, the message bus uses this mapping to route the event only to the gateway servers that have interested clients.

MySQL as the Source of Truth

In a surprising architectural choice for a system at this scale, Slack uses MySQL as its primary message store โ€” not Cassandra, not DynamoDB, not a custom storage engine. Messages are sharded by workspace (organization), with each workspace's messages living on a specific MySQL shard.

This choice reflects a pragmatic philosophy:

  • MySQL is well-understood, well-tooled, and battle-tested
  • Workspace-level sharding provides natural data isolation for enterprise customers
  • ACID transactions within a shard simplify ordering guarantees
  • The team had deep MySQL expertise and could optimize it aggressively

The trade-off: cross-workspace queries are impossible at the database level. Search and analytics must be handled by separate systems (Elasticsearch) that aggregate data from multiple shards.

Search Architecture

Slack's search is powered by a separate indexing pipeline:

  1. Messages are written to MySQL (source of truth)
  2. A change data capture (CDC) pipeline streams new messages to the search indexer
  3. The indexer processes messages (tokenization, stemming, permission filtering) and writes to the search cluster
  4. Search queries hit the search cluster directly, with permission checks applied at query time

The search index is eventually consistent with the message store โ€” a message might take 1-5 seconds to become searchable after being sent. This is an acceptable trade-off: users expect instant delivery but tolerate a brief delay in searchability.

Notification Routing

Notifications add another layer of complexity. When a message is posted, Slack must determine:

  • Who needs a push notification (mobile)?
  • Who needs a desktop notification?
  • Who needs a badge count update?
  • Who has Do Not Disturb enabled?
  • Who has custom notification settings for this channel?

This is handled by a Notification Service that consumes the same message bus events as the gateway servers. The Notification Service evaluates each user's preferences and routes notifications to the appropriate delivery channels (APNs for iOS, FCM for Android, desktop notification APIs).

The Trade-offs

Gained:

  • Real-time experience: Messages appear instantly for all connected users, creating a sense of presence and immediacy.
  • Searchability: The full history of every conversation is searchable, making Slack a knowledge base as well as a communication tool.
  • Enterprise features: Workspace-level sharding enables data isolation, compliance controls, and data residency requirements.
  • Integration ecosystem: The message bus architecture makes it natural to add integrations โ€” bots, apps, and webhooks are just additional consumers of the same event stream.

Sacrificed:

  • Stateful connections: WebSocket gateway servers are stateful, which complicates horizontal scaling, deployment (connections must be drained during deploys), and failover (if a gateway crashes, all its connections drop).
  • Eventual consistency in search: Messages are immediately visible in the channel but take seconds to become searchable. This is usually acceptable but occasionally confusing.
  • Fan-out cost: Large channels with thousands of members are expensive to serve. Each message requires thousands of WebSocket pushes. This has led to architectural limits on channel membership in some cases.
  • MySQL at scale: While MySQL is reliable and well-understood, operating thousands of MySQL shards requires significant DBA expertise and custom tooling for migrations, backups, and failover.

The Lessons

1. Separate the write path from the notification path. Durable storage and real-time delivery have fundamentally different requirements (consistency vs. latency). By separating them, Slack can optimize each independently. The message is considered "sent" when it's written to MySQL, not when it's delivered via WebSocket.

2. Choose boring databases and optimize them. Slack chose MySQL โ€” not because it's the "best" database for messaging, but because their team understood it deeply and could optimize it aggressively. The lesson: deep expertise in a "boring" technology often outperforms shallow expertise in a "cutting-edge" one.

3. Fan-out is the hardest problem in real-time systems. Delivering a message to one user is trivial. Delivering it to 10,000 users simultaneously, in real time, with ordering guarantees, is an architectural challenge that shapes every design decision. Any system with broadcast semantics (chat, social feeds, collaborative editing) must solve the fan-out problem explicitly.

4. Eventual consistency is usually acceptable when it's documented. Users tolerate a 2-second delay in search results but would not tolerate a 2-second delay in message delivery. By identifying where eventual consistency is acceptable and where it's not, Slack can simplify its architecture without degrading the perceived experience.

5. Statefulness is a valid architectural choice. The industry trend toward stateless services is valuable, but some problems are inherently stateful. WebSocket connections are stateful by nature. Rather than fighting this with complex state externalization, Slack embraces it and engineers around the operational challenges.

Credits & References

  • Slack Engineering Blog (slack.engineering): Technical deep dives into Slack's infrastructure, including the migration to a new service architecture.
  • "Scaling Slack" by Keith Adams at QCon: Talk on the challenges of scaling Slack's real-time infrastructure.
  • "How Slack Built Shared Channels" (Slack Engineering): On the architectural challenge of connecting separate Slack workspaces.
  • Designing Data-Intensive Applications by Martin Kleppmann: Context on the distributed systems principles (fan-out, eventual consistency, event sourcing) that Slack applies.