Git Packfiles from the Ground-Up: What they are and Why they Matter
Most of us trust Git every day without thinking too hard about how it stores data.
Then one day you notice something odd: "I changed just one line. Why did Git create another full blob?"
That question is exactly where packfiles become interesting.
This article is a gentle walkthrough of what happens before and after packing, why it matters for both disk usage and network transfer, and how to verify all of it with real evidence.
If you prefer following along in video format, watch this walkthrough:
Primary reference:
https://git-scm.com/book/en/v2/Git-Internals-Packfiles
A Quick Mental Model
Think of Git storage in two phases:
-
Write phase (loose objects):
Git stores objects independently. This is simple and robust. -
Compaction phase (packfiles):
Git rewrites many objects into a compact binary format and stores deltas where possible.
You can picture it like this:
Before gc (loose objects)
repo.rb v1 -> loose blob
repo.rb v2 -> another loose blob
commit/tree -> loose objects
After gc (packed)
pack file = [base objects + delta objects + metadata]
idx file = [offset map into pack]
This "simple first, optimize later" strategy is one reason Git feels fast and reliable.
The Question We Want to Answer
If two versions of a large file are almost the same, how expensive is each version:
- before packing, and
- after packing?
To answer that, I ran a reproducible experiment and captured every relevant output.
What We Observed from a Real Run
In the experiment, we commit a large file (repo.rb), append one line, commit again, and then run manual git gc.
Here are the key measured values:
blob_v1_sha1=033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5
blob_v2_sha1=b042a60ef7dff760008df33cee372b945b6e884e
blob_v1_logical_bytes=22044
blob_v2_logical_bytes=22054
blob_v1_loose_bytes=6886
blob_v2_loose_bytes=6893
two_blob_loose_total_bytes=13779
loose_total_bytes=14191
pack_bytes=6237
idx_bytes=1240
pack_plus_idx_bytes=7477
saved_vs_pack_only_pct=56.05
saved_vs_pack_plus_idx_pct=47.31
delta_object_bytes=9
delta_packed_bytes=20
base_object_bytes=22054
base_packed_bytes=5799
If you only remember one thing from this article, remember this:
- before packing, the two near-identical versions cost almost the full loose blob size each;
- after packing, one version can become a tiny delta.
Step 1: What Loose Objects Are Doing
After first commit:
- Blob SHA:
033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 - Logical size:
22044 - Loose on-disk bytes:
6886
After adding one line and committing again:
- Blob SHA:
b042a60ef7dff760008df33cee372b945b6e884e - Logical size:
22054 - Loose on-disk bytes:
6893
The important teaching point here is subtle:
Git loose objects are not trying to be globally optimal at write time. They are trying to be straightforward and correct. That is why tiny edits can still produce another standalone blob before compaction.
Step 2: What git gc Changes
When we run git gc, loose objects are repacked.
Observed sizes:
.pack:6237bytes.idx:1240bytes- pack + index combined:
7477bytes
Compared to loose total (14191), that is a major reduction:
56.05%saved vs pack only47.31%saved even including index
This is not a corner case. It is the core mechanism Git relies on for repositories with lots of similar history.
Step 3: The Most Important Proof (git verify-pack)
These two lines explain almost everything:
033b4468fa6b2a9547a70d88d1bbe8bf3f9ed0d5 blob 9 20 6197 1 b042a60ef7dff760008df33cee372b945b6e884e
b042a60ef7dff760008df33cee372b945b6e884e blob 22054 5799 306
What this means in plain terms:
b042...(newer version) is stored as the base representation.033b...(older version) is stored as a delta referencingb042....- That delta is tiny: logical
9bytes, packed20bytes.
The numbers are the punchline. One large version plus one tiny delta beats storing two near-duplicate full blobs.
Step 4: A Small Byte-Level Peek
Packfiles are binary containers, not text files. We can still verify structure quickly.
Header excerpt:
00000000: 50 41 43 4b 00 00 00 02 00 00 00 06 ...
Interpretation:
50 41 43 4b->PACK00000002-> format version 200000006-> 6 objects in this pack
From the same run, the delta object offset was 6197, and object-type decoding reported ofs-delta, matching verify-pack.
This gives us enough binary evidence without turning this post into a parser implementation guide.
Why This Helps Push and Pull Too
Packfiles are not only about local disk size.
When you fetch, pull, or push, Git negotiates object reachability and commonly transfers compact pack streams. So the same design helps on both axes:
- smaller local storage footprint,
- smaller network payloads.
That is why packfiles matter even if you never manually run low-level Git internals commands.
Space Savings Snapshot
| Metric | Value |
|---|---|
| Blob v1 logical size | 22044 bytes |
| Blob v2 logical size | 22054 bytes |
| Blob v1 loose file size | 6886 bytes |
| Blob v2 loose file size | 6893 bytes |
| Two loose blobs total | 13779 bytes |
| All loose objects total | 14191 bytes |
| Pack only | 6237 bytes |
| Pack + idx | 7477 bytes |
| Savings vs pack only | 56.05% |
| Savings vs pack + idx | 47.31% |
| Delta object (logical / packed) | 9 / 20 bytes |
Reproduce This Yourself (Optional)
You do not need to run anything to follow this article, but if you want to reproduce the same flow:
# full script in appendix
curl https://gist.githubusercontent.com/shrsv/a86381552282e25342c89fcc78898ae7/raw/6a0967ef4a34a8df65fc01126c8ff14be7d20f00/git-packfiles-demo.sh
chmod git-packfiles-demo.sh
./git-packfiles-demo.sh --artifact-dir /tmp/git-packfiles-artifacts --keep-workdir
Artifact files generated:
run.logsummary.envverify-pack-full.txtverify-pack-focus.txtpack-header-64.txtdelta-object-64.txt
Final Takeaway
Git chooses a practical strategy:
- write objects simply first,
- compact them intelligently later.
Packfiles are the second half of that strategy, and the measured 9/20 delta from this run makes the benefit concrete.