A better way to learn a new codebase

Some people work on the same codebase for years, developing deep mastery. The rest of us have to start from scratch as we switch projects or jobs. Being able to learn a new codebase quickly and effectively is a programmer superpower.

Most guides I’ve seen on learning a codebase boil down to “read and ask questions”. For a small codebase, that can work. But as the codebase grows, that isn’t enough.

A trick I learned in high school gave me a better way.

In high school, I took an AP history class that really taught me how to study. Passing the AP exam meant knowing facts – lots of facts. My teacher forced us to use a rigid note-taking system to master the material:

Read a chapter
Re-read it, highlighting significant facts
Create an outline of the chapter from highlighted notes

The power of this method was the discipline to put the material into my own structure and words. And it worked! I passed the test with with ease.

What works for studying history also works for studying code. 🔗︎

The problem with studying only by reading is that it’s a passive activity. Writing down what I’ve learned is an active exercise that requires me to more deeply process and retain the content.

When I’ve joined a new team and had an unfamiliar project to learn, I’ve drawn on these high school lessons to accelerate my understanding of the project’s codebase.

Outlining the structure of code means that I’m actively engaged in finding words to explain what I’ve just read. If I didn’t really understand it, I’ll struggle to summarize it. That internal dialog forces me to be honest with myself about whether I read with comprehension or lost focus and started skimming.

Summarizing requires deciding what is essential and what is trivial. 🔗︎

This is exactly my goal when learning a codebase. Summarizing is like lossy compression, where I’m constantly seeking out the essential signal that preserves enough of the meaning of the code.

That’s deep learning in a way that merely reading will never accomplish.

The other wonderful benefit is that it creates the ‘Cliffs Notes’ version of the code. I can refer back to it to refresh my memory, or I can give it to someone else to get oriented quickly. Done well, it’s a form of documentation.

If this kind of documentation already exists, it’s tempting to use it as a field guide alongside the codebase instead of going through the work of writing my own summary. This works okay, because reconciling what’s in the field guide to what’s in the code requires an active form of reading. But I don’t think it’s as effective as making my own.

Work breadth-first, mostly. 🔗︎

I work breadth-first to get a broad overview before diving too far down into subroutines. For an executable, I like to start with main and follow the flow of execution. For a service, I like to start with the most common request type and follow that. Often, I don’t need to grok the whole codebase right away, so I might start with a subsystem or library, tracing the main entry points.

A good outline reads like ‘reverse stubbing’ the codebase.

In systems or languages that are object-oriented, I might also make a list of the major classes I encounter. I write down their purpose and notes about what I can see about their life cycles. I don’t write down every type. When I have to decide which are essential to the logic and which are mere utilities, I get another opportunity for internal dialog about significance.

Write in a style that feels comfortable. 🔗︎

In many cases, I write an outline in a sectioned document with bullets points within each section. I use headings for major functions and nested bullet points for that function’s logic and any small subroutines it calls. In that way, I ‘inline’ small utility functions or small call trees that are easy to summarize. But if there is a big function that seems like it needs its own section, I stub that as a new section and link to it.

You’ll notice that I don’t work strictly breadth-first. As I work across a function or type, I’ll keep a soft gaze with a ‘shallow depth of field’ into what subroutines are generally doing and whether they’re simple or not. I use this for context as I read with breadth and to decide what’s worth diving into when I’m ready to go deeper.

Other times, particularly for a smaller codebase or subsystem, I might summarize in prose instead of an outline, particularly if the logic flow is more linear or if branching is mostly an implementation detail rather than significant control flow.

As I go, I also keep a journal of questions or refactoring ideas. After I’m done, I go back to see if I’ve learned the answers and if my refactoring ideas still make sense.

Set aside time for learning. 🔗︎

When planning the time I’ll need, I’m guided by studies on code review speed. These suggest that high-quality code review speed is about 200-400 LOC/hour. Summarizing code isn’t the same as code review – I’m not reading closely line by line or judging correctness – but I think the time winds up comparable. I need to understand the code at a high level plus I need time to write my notes on it.

Assume I have a 10k LOC project or similarly-sized subsystem of a larger one. According to one guide, this is a ‘medium’ project, big enough that it’s just barely possible to hold a model of it in your head. If I can study at 400 LOC/hour, it will take me 25 hours to learn it. If I can maintain concentrated study for 4 hours a day, that’s about 6 days.

Don’t be afraid of an 80/20 approach. 🔗︎

As I said in my 80/20 article, some things have diminishing returns to incremental effort. By working breadth-first, I’ll probably understand enough to be able to stop without covering the entire codebase, so 6 days is an upper bound.

Let’s assume in practice I’ll be able to stop half way. That means learning a 10k LOC project will take me about 3 days. That seems pretty a reasonable amount of time not just to read, but to actually learn and retain the structure of the code.

Find tools to help you. 🔗︎

Good tooling makes this work easier. It helps a lot if you have a good programming editor or IDE that lets you jump to definitions and jump in and out of functions. I also like having a good ‘grep’ tool like ripgrep to search for patterns, phrases or types. That kind of search can show a lot of usage patterns at once, along with concentrations of usage by directory or file.

Think also about where you’re going to write the notes. I like using Google Docs because it makes it easy to create out-of-band comment threads for things I didn’t understand and want to think about or ask about later.

When writing prose rather than an outline, I sometimes use Markdown, particularly if I’m planning on adding my work as project documentation. (But you can write in Google Docs and convert it to Markdown later if you need to.)

Occasionally, I’ve used mind-mapping software. If you’re a fan of mind-maps, it can work well for this, too.

The ’trick’ is putting in the effort. 🔗︎

Writing down notes on what you read is age-old advice. Like a lot of simple advice, we often ignore it because it seems so much harder than trying to get by doing the easy thing. But putting in the effort does work! You’ll learn more and get more out of your study time.

The next time you’re feeling overwhelmed with a new codebase, I hope you’ll remember this article and give it a try.