Preface
I started writing this book because I couldn’t find it anywhere.
There are plenty of books that teach you how to use Spark. How to set up a Databricks workspace. How to create a Delta table. O’Reilly has great ones. Manning too. They explain the features, walk you through the APIs, and show you how things work when everything goes right.
This is not that book.
This book is about what happens when things go wrong. When your MERGE rewrites half the table because you forgot to add a partition column to the ON clause. When your streaming checkpoint gets corrupted and you can’t restart the pipeline without losing data. When a DBR upgrade silently changes the default working directory and your init scripts start failing in production at 2am.
These aren’t hypothetical problems. Every chapter in this book comes from something that actually happened — to me, to my teams, or to engineers I’ve spoken with in Databricks community forums, GitHub issues, and conference hallways. The incidents are real. The error messages are real. The late-night debugging sessions are very real.
Who this is for
You already know Spark. You’ve read the docs. You’ve probably built pipelines that run in production. What you haven’t done — yet — is hit every failure mode these tools have to offer.
This book is for:
- Data engineers running Spark/Delta/Databricks in production who want to learn from other people’s mistakes instead of making them all themselves
- Teams migrating to Unity Catalog or upgrading DBR versions who want to know what actually breaks
- Engineers debugging performance issues who need diagnostic checklists, not theoretical explanations
- Technical leads making architectural decisions about partitioning, medallion layers, or cluster economics who want to see where those decisions go wrong
If you’re still learning the basics of Spark, start with Learning Spark by Damji et al. Come back here when your first production pipeline starts doing something unexpected.
How this book works
Every chapter follows the same structure:
- An incident — a real production failure that opens the chapter. What happened, what the symptoms looked like, how long it took to diagnose.
- The mechanism — why it happened. Not the documentation explanation, but the practical “here’s what’s actually going on under the hood” explanation.
- A runnable example — code you can copy into a Databricks notebook or local Spark session using public datasets (NYC Taxi, TPC-H, Databricks sample data). Try it yourself. Break it. Fix it.
- The fix — what we changed. Configuration, code, architecture. Before and after.
- A diagnostic checklist — if you see symptom X, check Y, fix Z. The runbook distilled into a chapter.
You don’t need to read this book cover to cover. Each chapter is self-contained. If your VACUUM just broke your streaming consumer, go straight to Chapter 8. If your MERGE is slow, Chapter 9. Treat it as a reference you keep next to your terminal.
Early access
This book is being published one chapter at a time, roughly every two weeks. I’m releasing it this way deliberately — each chapter gets feedback from readers before the next one is written. If you find an error, a missing edge case, or a better fix, I want to hear about it.
The latest version is always available on this site. A downloadable PDF will be available once enough chapters are published to form a complete part.
Kiran Kumar March 2026