Designing a distributed system is hard. Designing a stateful system is harder. Designing a distributed stateful system is one of the most complicated endeavors in modern technology. There are so many things that can go wrong, and beyond that, there are so many early decisions you make along the road to building out one of these systems that it's rarely clear where you made a bad decision until it's too late to fix it.
It turns out this is something I'm kind of good at, so I thought maybe it'd be a good place to start a new blog series. A little about me: my name is Keith McClellan, and I work on the Solutions Engineering team at Cockroach Labs - the makers of CockroachDB. I've been doing distributed systems almost my entire career, spanning technologies like MPP data warehouses (Netezza, Greenplum), Big Data (Hadoop, Platfora), cloud-native container orchestration (Mesosphere), NoSQL (DataStax), and now DistributedSQL (CockroachDB). Oh yeah, and I built the original prototype and acted as the engineering manager for the CockroachDB Kubernetes Operator. Before that, I built a distributed grid ETL solution for a customer using Pentaho Data Integration with a scheduler called HTCondor and a distributed filesystem called Gluster.
That last one was awesome, but we made a TON of mistakes... we didn't have a backup for any of our configuration, and when we lost the Gluster head-nodes one time we had to rebuild the entire system from scratch. Ouch. It's avoiding those kinds of problems that I'm hoping breaking down how to design these systems will accomplish. So we'll start at the beginning with some theory and work our way through designing and then building a distributed stateful application that you could run worldwide.