Git Packfiles from the Ground-Up: What they are and Why they Matter

A friendly, from-scratch walkthrough of how Git packfiles shrink storage and speed up transfer, backed by real outputs and a full reproducible script appendix.

Git Packfiles from the Ground-Up: What they are and Why they Matter

Most of us trust Git every day without thinking too hard about how it stores data.

Then one day you notice something odd: "I changed just one line. Why did Git create another full blob?"

That question is exactly where packfiles become interesting.

This article is a gentle walkthrough of what happens before and after packing, why it matters for both disk usage and network transfer, and how to verify all of it with real evidence.

If you prefer following along in video format, watch this walkthrough:

Primary reference:
https://git-scm.com/book/en/v2/Git-Internals-Packfiles

A Quick Mental Model

Think of Git storage in two phases:

  1. Write phase (loose objects):
    Git stores objects independently. This is simple and robust.

  2. Compaction phase (packfiles):
    Git rewrites many objects into a compact binary format and stores deltas where possible.

You can picture it like this:

Before gc (loose objects)

repo.rb v1  -> loose blob
repo.rb v2  -> another loose blob
commit/tree -> loose objects

After gc (packed)

pack file = [base objects + delta objects + metadata]
idx file  = [offset map into pack]

This "simple first, optimize later" strategy is one reason Git feels fast and reliable.

The Question We Want to Answer

If two versions of a large file are almost the same, how expensive is each version:

  • before packing, and
  • after packing?

To answer that, I ran a reproducible experiment and captured every relevant output.

What We Observed from a Real Run

In the experiment, we commit a large file (repo.rb), append one line, commit again, and then run manual git gc.

Here are the key measured values:

blob_v1_sha1=033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5
blob_v2_sha1=b042a60ef7dff760008df33cee372b945b6e884e
blob_v1_logical_bytes=22044
blob_v2_logical_bytes=22054
blob_v1_loose_bytes=6886
blob_v2_loose_bytes=6893
two_blob_loose_total_bytes=13779
loose_total_bytes=14191
pack_bytes=6237
idx_bytes=1240
pack_plus_idx_bytes=7477
saved_vs_pack_only_pct=56.05
saved_vs_pack_plus_idx_pct=47.31
delta_object_bytes=9
delta_packed_bytes=20
base_object_bytes=22054
base_packed_bytes=5799

If you only remember one thing from this article, remember this:

  • before packing, the two near-identical versions cost almost the full loose blob size each;
  • after packing, one version can become a tiny delta.

Step 1: What Loose Objects Are Doing

After first commit:

  • Blob SHA: 033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5
  • Logical size: 22044
  • Loose on-disk bytes: 6886

After adding one line and committing again:

  • Blob SHA: b042a60ef7dff760008df33cee372b945b6e884e
  • Logical size: 22054
  • Loose on-disk bytes: 6893

The important teaching point here is subtle:

Git loose objects are not trying to be globally optimal at write time. They are trying to be straightforward and correct. That is why tiny edits can still produce another standalone blob before compaction.

Step 2: What git gc Changes

When we run git gc, loose objects are repacked.

Observed sizes:

  • .pack: 6237 bytes
  • .idx: 1240 bytes
  • pack + index combined: 7477 bytes

Compared to loose total (14191), that is a major reduction:

  • 56.05% saved vs pack only
  • 47.31% saved even including index

This is not a corner case. It is the core mechanism Git relies on for repositories with lots of similar history.

Step 3: The Most Important Proof (git verify-pack)

These two lines explain almost everything:

033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 blob   9 20 6197 1 b042a60ef7dff760008df33cee372b945b6e884e
b042a60ef7dff760008df33cee372b945b6e884e blob   22054 5799 306

What this means in plain terms:

  • b042... (newer version) is stored as the base representation.
  • 033b... (older version) is stored as a delta referencing b042....
  • That delta is tiny: logical 9 bytes, packed 20 bytes.

The numbers are the punchline. One large version plus one tiny delta beats storing two near-duplicate full blobs.

Step 4: A Small Byte-Level Peek

Packfiles are binary containers, not text files. We can still verify structure quickly.

Header excerpt:

00000000: 50 41 43 4b 00 00 00 02 00 00 00 06 ...

Interpretation:

  • 50 41 43 4b -> PACK
  • 00000002 -> format version 2
  • 00000006 -> 6 objects in this pack

From the same run, the delta object offset was 6197, and object-type decoding reported ofs-delta, matching verify-pack.

This gives us enough binary evidence without turning this post into a parser implementation guide.

Why This Helps Push and Pull Too

Packfiles are not only about local disk size.

When you fetch, pull, or push, Git negotiates object reachability and commonly transfers compact pack streams. So the same design helps on both axes:

  • smaller local storage footprint,
  • smaller network payloads.

That is why packfiles matter even if you never manually run low-level Git internals commands.

Space Savings Snapshot

Metric Value
Blob v1 logical size 22044 bytes
Blob v2 logical size 22054 bytes
Blob v1 loose file size 6886 bytes
Blob v2 loose file size 6893 bytes
Two loose blobs total 13779 bytes
All loose objects total 14191 bytes
Pack only 6237 bytes
Pack + idx 7477 bytes
Savings vs pack only 56.05%
Savings vs pack + idx 47.31%
Delta object (logical / packed) 9 / 20 bytes

Reproduce This Yourself (Optional)

You do not need to run anything to follow this article, but if you want to reproduce the same flow:

# full script in appendix

curl https://gist.githubusercontent.com/shrsv/a86381552282e25342c89fcc78898ae7/raw/6a0967ef4a34a8df65fc01126c8ff14be7d20f00/git-packfiles-demo.sh 

chmod git-packfiles-demo.sh

./git-packfiles-demo.sh --artifact-dir /tmp/git-packfiles-artifacts --keep-workdir

Artifact files generated:

  • run.log
  • summary.env
  • verify-pack-full.txt
  • verify-pack-focus.txt
  • pack-header-64.txt
  • delta-object-64.txt

Final Takeaway

Git chooses a practical strategy:

  • write objects simply first,
  • compact them intelligently later.

Packfiles are the second half of that strategy, and the measured 9/20 delta from this run makes the benefit concrete.


Appendix: Full Reproduction Script