Friday, July 12, 2013

Git Fails On Large Files

Turns out git fails spectacularly when working with large files. I was surprised, but the behavior is pretty well documented. In typical git fashion, there is an obscure error message and an equally obscure command to fix it.

The Problem


A real-life example (with repository names changed):

artem@MBP:~/git$ git clone git@gitlab:has_a_large_file.git
Cloning into 'has_a_large_file'...
Identity added: /Users/artem/.ssh/devkey (/Users/artem/.ssh/devkey)
remote: Counting objects: 6, done.
error: git upload-pack: git-pack-objects died with error.
fatal: git upload-pack: aborting due to possible repository corruption on the remote side.
remote: Compressing objects: 100% (5/5), done.
remote: fatal: Out of memory, malloc failed (tried to allocate 1857915877 bytes)
remote: aborting due to possible repository corruption on the remote side.
fatal: early EOF
fatal: index-pack failed

I pushed the large file without issues, but couldn't pull it again because the remote was dying. The astute reader will notice the remote was running gitlab. The push also broke the gitlab web interface for the repository.

From my Googling, the problem is that the remote side is running out of memory when compressing a large file (read more about git packfiles here). Judging by the error, git attempts to malloc(size_of_large_file) and the malloc fails.

This situation raises conundrums that may only be answered by Master Git:
  • Why was I able to push a large file, but not pull it?
  • Why would one malloc(size_of_large_file) ?
  • What happens when you push a >4Gb file to a 32-bit remote?

I was curious enough about the last one to look at the code: it will likely die gracefully (see line 49 of wrapper.c). Integer overflow likely avoided; would need to read more code much more carefully to be sure.

The Solution


In theory, the solution is to re-pack the remote with a smaller pack size limit. That requires ssh access to the remote repository, which I don't have. So the following fix is untested, and taken from http://www.kevinblake.co.uk/development/git-repack/. The obscure command in question (must be run on the remote):

git repack -a -f -d

Of course, repacking the remote but having non-repacked local repositories around may cause other problems.

Just For Fun


Here is another large file fail:

artem@MBP:~/temp/largerandomfile$ dd if=/dev/urandom of=./random_big_file bs=4096 count=1048577
1048577+0 records in
1048577+0 records out
4294971392 bytes transferred in 437.836959 secs (9809522 bytes/sec)

artem@MBP:~/temp/largerandomfile$ git add random_big_file
artem@MBP:~/temp/largerandomfile$ git commit -m "Added a big random file"
[master 377db57] Added a big random file
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 random_big_file

artem@MBP:~/temp/largerandomfile$ git push origin master
Counting objects: 4, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (2/2), done.
error: RPC failed; result=22, HTTP code = 413 KiB/s
fatal: The remote end hung up unexpectedly
Writing objects: 100% (3/3), 4.00 GiB | 18.74 MiB/s, done.
Total 3 (delta 0), reused 1 (delta 0)
fatal: recursion detected in die handler
Everything up-to-date

Everything up-to-date, indeed.

1 comment: