Git Internals: What Happens When You Type git commit

A guided tour of the object model hiding under every commit

Smarc Included in

18-06-2024 1884 words 9 min read

Git Internals: What Happens When You Type git commit

Contents

A few years ago I git reset --hard-ed away an afternoon’s work, panicked for about ninety seconds, then recovered every byte of it in two commands. The difference between those two states — losing the work and getting it back — wasn’t luck. It was knowing that the commit I’d “destroyed” was still sitting in the object store, unreferenced but very much alive, with the reflog holding a pointer to it. That single fact has saved me more times than any backup.

Most of us use Git the way we use a microwave: press the buttons, food gets warm, never think about the magnetron. git add, git commit, git push, repeat. That works right up until it doesn’t — a detached HEAD, a botched rebase, a “lost” commit — and suddenly the buttons stop making sense. The cure is understanding what Git actually does when you commit, because the underlying model is far simpler and more elegant than the porcelain commands suggest. Spend an afternoon with the plumbing and Git stops being magic.

Why bother learning the internals at all

The “why” matters here, because nobody should learn implementation details for their own sake. The payoff is that Git’s surface commands — the porcelain, in Git’s own terminology — are a thin, sometimes inconsistent layer over a handful of plumbing commands that operate directly on the data model. Once you can see the data model, the porcelain stops being a collection of memorised incantations and becomes a set of obvious operations on a structure you understand. A rebase is no longer “the scary one”; it’s just rewriting a chain of commit objects. A detached HEAD isn’t an error state; it’s a pointer pointing somewhere slightly unusual. The model is small enough to hold in your head, and that’s exactly why it’s worth holding.

Git is a content-addressed object store

Strip away the commands and Git is a tiny key-value database. It stores four kinds of objects, and the key for every object is the hash of its contents. (Git is migrating towards SHA-256, but the classic 40-character hex hashes you see everywhere are SHA-1, and the model is identical either way.) Because the key is the hash of the content, the store is “content-addressed”: identical content always produces an identical hash, so Git stores it once and deduplication is automatic. Move a file, copy it, check out the same content on three branches — there’s still exactly one blob on disk.

The four object types are:

blob — the raw bytes of a file. Just contents, no filename, no permissions.
tree — a directory listing: names, modes, and the hashes of the blobs and sub-trees it contains.
commit — a snapshot pointer plus metadata: the top-level tree, parent commit(s), author, committer, and message.
tag — an annotated tag object (the named, signed kind).

Everything you think of as “a commit” is just a commit object pointing at a tree, which points at blobs and more trees. That’s the whole data model. You can prove it to yourself with git cat-file, the plumbing command that prints any object by hash. Pass it -t to ask the type of an object and -p to pretty-print the contents:

1
2
3
4
5
$ git cat-file -t HEAD
commit
$ git cat-file -p HEAD^{tree}      # the tree this commit points at
100644 blob ce013625...    greeting.txt
040000 tree a1b2c3d4...    src

Notice the tree contains a blob entry for the file and a tree entry for the subdirectory. Trees nest into trees the way directories nest into directories — it’s directories all the way down, terminating in blobs.

The staging area is a real file

When you git add a file, Git does two concrete things. It writes the file’s contents as a blob into the object store, and it records that blob’s hash, path and mode in the index — a single binary file at .git/index. The index is the staging area; it’s not an abstraction, it’s a file you can inspect. This is the detail that finally made git add -p, git reset HEAD <file> and “why is my file still showing as modified” click for me: every one of those is just editing the index.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
$ echo "hello" > greeting.txt
$ git add greeting.txt

# the index now lists the staged file and its blob hash
$ git ls-files --stage
100644 ce013625030ba8dba906f756967f9e9ca394464a 0	greeting.txt

# and that blob really exists in the store
$ git cat-file -p ce013625
hello

That hash, ce0136..., is the hash of the blob (a short header plus the bytes hello\n). Note Git stored the content under a hash, but nowhere in the blob is the name greeting.txt — the filename lives in the index now, and in a tree later. This is why renaming a file with unchanged contents adds nothing to the object store: same bytes, same blob, same hash.

How the objects actually sit on disk

It’s worth knowing where these objects physically live, because it explains a couple of things that otherwise look like Git misbehaving. A freshly written object is a loose object: a single file under .git/objects/, named by its hash split into a two-character directory and a 38-character filename, with its contents zlib-compressed. The blob for greeting.txt lands at .git/objects/ce/013625030ba8dba906f756967f9e9ca394464a.

1
2
3
# the loose object on disk (compressed — don't expect readable text)
$ ls .git/objects/ce/
013625030ba8dba906f756967f9e9ca394464a

That’s fine for a handful of objects, but a busy repo would accumulate thousands of tiny files. So Git periodically runs git gc, which bundles loose objects into a compressed packfile under .git/objects/pack/ and stores related objects as deltas against one another. This is why du -sh .git can shrink after a gc despite no commits being deleted, and why a repo with a hundred thousand objects still clones quickly: it’s a couple of packfiles, not a hundred thousand files. The content-addressing is untouched by all this — the hash of an object is the hash of its uncompressed content, regardless of whether it’s loose or packed.

What commit actually does

Now the main event. When you run git commit, Git performs a tidy sequence of steps, all of which you can do by hand with plumbing commands.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 1. Turn the current index into a tree object.
$ git write-tree
3c4e9cd789d88d8d89c1073707c3585e41b0e614

# 2. Create a commit object pointing at that tree,
#    with the current HEAD as its parent.
$ echo "Add a greeting" | git commit-tree 3c4e9cd -p HEAD
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0

# 3. Move the current branch to point at the new commit.
$ git update-ref refs/heads/main a1b2c3d4

That’s it. git commit is sugar over those three operations. write-tree snapshots the index into a tree object (creating sub-trees for directories as needed). commit-tree wraps that tree with a parent pointer, your author/committer details and a message, hashing the lot into a commit object. Then the branch reference is updated to point at the new commit. Inspect the result and the structure is laid bare:

1
2
3
4
5
6
7
$ git cat-file -p HEAD
tree 3c4e9cd789d88d8d89c1073707c3585e41b0e614
parent 9f8e7d6c5b4a3f2e1d0c9b8a7f6e5d4c3b2a1f0e
author Smarc <[email protected]> 1718700000 +0100
committer Smarc <[email protected]> 1718700000 +0100

Add a greeting

A commit is a snapshot, not a diff. The “diffs” you see in git log -p are computed on demand by comparing a commit’s tree with its parent’s tree. Git stores whole snapshots and works out changes when asked — the opposite of how the diff-centric mental model imagines it. This explains a lot: it’s why git checkout of an old commit is instant rather than replaying a history of patches, and why a merge commit can have two parents and still just point at one resulting tree.

Branches and HEAD are just pointers

Here’s the realisation that demystifies most Git confusion: a branch is a 41-byte file containing a commit hash and a newline. Look inside .git:

1
2
3
4
5
$ cat .git/refs/heads/main
a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0

$ cat .git/HEAD
ref: refs/heads/main

HEAD normally points at a branch ref, which points at a commit. Creating a branch writes a new tiny file with the same hash — instant, regardless of repository size, because nothing is copied. A “detached HEAD” simply means .git/HEAD holds a commit hash directly instead of ref: refs/heads/...; you’re sitting on a commit with no branch pointing at it, which is fine to look around in but means new commits won’t be reachable unless you make a branch before you leave.

When it goes wrong: recovering lost commits

This is where the model earns its keep. A commit you’ve “lost” after a bad reset, an aborted rebase, or a force-push usually still exists in the object store — it just has no ref pointing at it. Git records every place HEAD has been in the reflog, so the orphaned hash is one command away:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# show where HEAD has been, newest first
$ git reflog
a1b2c3d HEAD@{0}: reset: moving to HEAD~3
7f3e9a1 HEAD@{1}: commit: the work I thought I destroyed
...

# get it back: branch off the orphaned commit, or hard-reset onto it
$ git branch rescue 7f3e9a1
# or
$ git reset --hard 7f3e9a1

A few troubleshooting notes from experience. The reflog is local and per-clone — it isn’t pushed, so a fresh clone has none of your history of moves; recover before you re-clone. Orphaned objects are not immortal either: git gc eventually prunes unreachable objects (the default grace period is two weeks via gc.reflogExpireUnreachable), so “I’ll fix it next month” is not a plan. If git reflog shows nothing useful, git fsck --lost-found will list dangling commits and blobs it can find directly in the object store. And if the work was only staged, never committed, a dangling blob from git fsck plus git cat-file -p can still pull the contents back. The recurring lesson is that Git rarely truly loses your work — the objects linger until garbage collection, long after the pointers have moved on. Knowing that turns “I’ve destroyed everything” into “where did the pointer go”.

So is this worth learning?

If Git is just add, commit, push and it never goes wrong, you can happily skip all this and lose nothing. But the moment you hit a confusing rebase, an interactive history rewrite, a detached HEAD or a panic about lost commits, this model is the difference between flailing and fixing it in thirty seconds. Knowing that commits are immutable snapshots, branches are throwaway pointers, and nothing is gone until gc runs turns Git from an anxiety machine into a tool you trust.

You don’t need to use the plumbing day to day — I almost never type commit-tree in anger. But knowing it’s there, and roughly what commit is doing on your behalf, is one of the highest-return afternoons a developer can spend. It pairs well with the rest of the workflow, too: once you trust the object model, layering automation on top is far less nerve-wracking, whether that’s pre-commit hooks catching mistakes before they reach the repo or letting a model draft your commit messages. The structure underneath stays the same; you’re just deciding what writes to it.

Written by Smarc

Founder and editor of vo.rs. A lifelong tinkerer who self-hosts far more than is sensible, hardens Linux boxes for fun, and prods the latest AI tools to see what they can really do. The how-to guides here are the notes Smarc wishes had existed the first time round.

Tagged#git #programming

Contents

Git Internals: What Happens When You Type git commit

A guided tour of the object model hiding under every commit

Why bother learning the internals at all

Git is a content-addressed object store

The staging area is a real file

How the objects actually sit on disk

What commit actually does

Branches and HEAD are just pointers

When it goes wrong: recovering lost commits

So is this worth learning?

Related Content

Go, the good, bad and ugly

TypeScript, the good, bad and ugly

Rust, the good, bad and ugly

Bash, the good, bad and ugly