My company has operated a cloud for three years that now manages hundreds of Cli...

My company has operated a cloud for three years that now manages hundreds of ClickHouse clusters on Kubernetes. We use the Altinity Kubernetes Operator for ClickHouse, aka "clickhouse operator," which we wrote and maintain.

I was very skeptical of data on Kubernetes when we first started, in part due to some initial experience with Kubernetes in 2018 but mostly due to prejudice against change. Overall it has worked out great. Here are 4 of many things we've learned.

1. Most modern databases are distributed systems. You don't just set up a single node but rather several or even dozens of nodes. Well-written operators make this relatively trivial even though it's quite complex underneath. In fact, the simplest way to learn how to set up a ClickHouse cluster is to bring it up under the operator and then look at the configuration on each container. That's how I learned it.

2. Kubernetes portability is overall quite good. We ported our cloud from AWS to GCP in 8 weeks. We've since expanded to run in many other environments as well.

3. We map ClickHouse server containers 1-to-1 to VMs spawned using Karpenter or native node groups. It makes it a lot easier to reason about performance, including things like network bandwidth to storage.

4. ClickHouse is still basically a shared nothing architecture where individual servers own patches of storage. Kubernetes enables a great scaling model if you use VMs attached to block storage--you can scale nodes from 2 to 64 vCPUs in a few minutes, plus you can easily extend volumes. This scaling model is in my opinion highy under-rated for databases. It's decoupled compute/storage that really works. With Kubernetes you get it essentially for free.

It's not all roses. Containers create new failure modes. You can't just ssh in, look at logs, and fix things. Pod crash loops [0] can be very problematic. Certain failure modes like bad EBS volumes (kinda alive states) are hard fix if your operator cannot quickly replace a node. And operator bugs create a new class of very-hard-to-debug problems. The best solution to all of these problems is not to have them, which means you need to focus--often for years--on operator reliabily and your day 2 infrastucture, such as monitoring.

[0] https://altinity.com/blog/fixing-the-dreaded-clickhouse-cras...

Disclaimar: I work for Altinity