Talk from SRECon22

Adam McKaig and I did a talk called “How the Metrics backend works at Datadog” for SREcon22 Americas.

Overview

Datadog is a popular cloud monitoring service which operates at scale in all three major cloud providers, ingesting 10s of GB/s of points across many billions of timeseries into PiBs of hot and cold storage. Naturally, reliability is paramount.

In this talk, we’ll show how our very large distributed system works today, and how it grew from a very small not-distributed system. We’ll share the most interesting scaling and reliability challenges we faced along the way, how we solved them (for now), and some important lessons and strategies which emerged. We’ll also share a couple of bonus problems which are still very much unsolved today, and what we’re planning next.

Talk

slides that I hand drew :’)