Building Scalable Backend Systems at TikTok
When you join a team that serves hundreds of millions of users, the first thing you learn is that everything you thought you knew about "scale" was off by a few orders of magnitude.
The reality of scale
At TikTok, a "small" service might handle 100K QPS on a quiet day. The systems I work on process millions of events per second, and the margin for error is razor-thin. A 50ms latency spike that would be invisible in a smaller system can cascade into a full-blown incident here.
What I've learned so far
1. Simplicity is the ultimate sophistication
The most reliable systems I've seen are not the cleverest ones — they're the simplest. Every abstraction layer is a potential failure point. When you're operating at this scale, you want boring, predictable code that does exactly one thing well.
// Good: explicit, boring, works at 3am when you're on-call
func ProcessEvent(ctx context.Context, event *Event) error {
if err := validate(event); err != nil {
return fmt.Errorf("validation failed: %w", err)
}
return store.Write(ctx, event)
}2. Observability is not optional
You cannot debug a distributed system by reading code. You need metrics, traces, and logs — and they need to be there before things break. I've learned to instrument first, implement second.
3. Graceful degradation over perfect reliability
100% uptime is a myth. The question isn't "will this fail?" but "how does it fail?" Circuit breakers, fallbacks, and rate limiting are not nice-to-haves — they're the foundation.
Looking ahead
I'm still early in my career, and every week brings a new lesson in distributed systems. The gap between textbook knowledge and production reality is vast, and I'm grateful to be learning on the job at this scale.