Git Annex

The freesurfer repository contains many large binary files that can't be stored directly on github. Instead, they should be stored in a separate space, reserved for times when retrieval is required, like for updating test data, performing local installations, etc. Think of source code as text files, and think of data files as binary files not required for compilation. This page details how to work with git-annex, the software used for storing and retrieving these data files in the freesurfer git repo, and it's mostly meant as a guide for internal maintainers. For basic instructions on retrieving annex data, visit the build guide. Additional documentation can be found on the git-annex website.

General Concept

Git annex is a tool for storing large files outside of the main repository. When you commit a new data file with git annex, the main repo is aware of this new file, but it will actually store it as a symlink. For example, if you were to create a fresh clone of the freesurfer repository from github, you would see that mri_convert/testdata.tar.gz is actually a broken relative symlink to a hashed file in the .git/annex directory of your repo. It's broken because you have yet to pull the actual data files from the separate annex repository in /space/freesurfer/repo/annex.git. Instructions to do this are described below.

Setup

As described in the build guide, the annex data source must be setup as a remote repository. For those developing on the Martinos filesystem, cd into your repository and run:

git remote add datasrc file:///space/freesurfer/repo/annex.git

For those developing outside of Martinos, run:

git remote add datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git

Afterwards, the output of git remote -v should look something like this:

datasrc       file:///space/freesurfer/repo/annex.git (fetch)
datasrc       file:///space/freesurfer/repo/annex.git (push)
origin        git@github.com:ahoopes/freesurfer.git (fetch)
origin        git@github.com:ahoopes/freesurfer.git (push)
upstream      git@github.com:freesurfer/freesurfer.git (fetch)
upstream      git@github.com:freesurfer/freesurfer.git (push)

Adding a File

Generally, only the freesurfer source code administrators should add a file to the annex - especially since only users at the Martinos Center will have write access to the filesystem. The following example assumes we want to add an example test script and test data tarball called 'testdata.tar.gz' to a subdirectory:

git add test.sh
git annex add testdata.tar.gz
git commit -m "added test to subdirectory"
git push
git annex copy --to datasrc

Getting a File

First, fetch the state of the remote data source. This must be done every time you want to download new annex data:

git fetch datasrc

Then, to retrieve the contents of a data file:

git annex get mri_convert/testdata.tar.gz

Or to retrieve everything under the current directory:

git annex get .

Modifying a File

To modify the contents of a data file, first unlock it, which eliminates the symlink:

git annex unlock mri_convert/testdata.tar.gz

Then, after making modifications, re-add it to the annex:

git annex add mri_convert/testdata.tar.gz
git commit -m "updated the test data"
git push
git annex copy --to datasrc

Tagging

Git -annex provides the ability to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 2 categories:

  1. Files required for build time checks - makecheck

  2. Files required for a local installation - makeinstall

It is essential that data files get tagged properly so that our servers and diskspace are not overwhelmed when only a known subset of the data is required.

Get Tagged Files

To get only the data files required for installation:

git fetch datasrc
git annex get --metadata fstags=makeinstall .

Display metadata

To show all the metadata associated with a file:

git annex metadata mri_convert/testdata.tar.gz

Assign Metadata

Assigning metadata to a datafile is the job of a source code administrator, similar to adding a datafile. When adding metadata to an annex file, it is best to start with a clean checkout of the repository and be in the 'dev' branch. Then add the tag as follows:

git annex metadata mri_convert/testdata.tar.gz -s fstags=makecheck
git annex copy --to datasrc

We can also append tags:

git annex metadata mri_convert/testdata.tar.gz -s fstags+=makeinstall
git annex copy --to datasrc

No need to perform any commits or pushes or pull requests after this is done.

Listing Files with a Given Tag

git annex find --metadata fstags=makeinstall

Mirroring

The git annex repository exists on the local file system in the following directory:

/space/freesurfer/repo/annex.git

The public-facing git annex repository exists on local file system in the following directory (mounted by our server):

/cluster/pubftp/dist/freesurfer/repo/annex.git

Currently, we mirror the two repositories daily using the following commands:

ssh pinto
rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git
git update-server-info

GitAnnex (last edited 2019-02-03 12:10:15 by AndrewHoopes)