The freesurfer repository contains many large binary files that can't be stored directly on github. Instead, they should be stored in a separate space, reserved for times when retrieval is required, like for updating test data, performing local installations, etc. Think of source code as text files, and think of data files as binary files not required for compilation. This page details how to work with git-annex, the software used for storing and retrieving these data files in the freesurfer git repo, and it's mostly meant as a guide for internal maintainers. For basic instructions on retrieving annex data, visit the build guide. Additional documentation can be found on the git-annex website.
Git annex is a tool for storing large files outside of the main repository. When you commit a new data file with git annex, the main repo is aware of this new file, but it will actually store it as a symlink. For example, if you were to create a fresh clone of the freesurfer repository from github, you would see that mri_convert/testdata.tar.gz is actually a broken relative symlink to a hashed file in the .git/annex directory of your repo. It's broken because you have yet to pull the actual data files from the separate annex repository in /space/freesurfer/repo/annex.git. Instructions to do this are described below.
As described in the build guide, the annex data source must be setup as a remote repository. For those developing on the Martinos filesystem, cd into your repository and run:
git remote add datasrc file:///space/freesurfer/repo/annex.git
For those developing outside of Martinos, run:
git remote add datasrc https://surfer.nmr.mgh.harvard.edu/pub/dist/freesurfer/repo/annex.git
Afterwards, the output of git remote -v should look something like this:
datasrc file:///space/freesurfer/repo/annex.git (fetch) datasrc file:///space/freesurfer/repo/annex.git (push) origin email@example.com:ahoopes/freesurfer.git (fetch) origin firstname.lastname@example.org:ahoopes/freesurfer.git (push) upstream email@example.com:freesurfer/freesurfer.git (fetch) upstream firstname.lastname@example.org:freesurfer/freesurfer.git (push)
Adding a File
Generally, only the freesurfer source code administrators should add a file to the annex - especially since only users at the Martinos Center will have write access to the filesystem. The following example assumes we want to add an example test script and test data tarball called 'testdata.tar.gz' to a subdirectory:
git add test.sh git annex add testdata.tar.gz git commit -m "added test to subdirectory" git push git annex copy --to datasrc
Getting a File
First, fetch the state of the remote data source. This must be done every time you want to download new annex data:
git fetch datasrc
Then, to retrieve the contents of a data file:
git annex get mri_convert/testdata.tar.gz
Or to retrieve everything under the current directory:
git annex get .
Modifying a File
To modify the contents of a data file, first unlock it, which eliminates the symlink:
git annex unlock mri_convert/testdata.tar.gz
Then, after making modifications, re-add it to the annex:
git annex add mri_convert/testdata.tar.gz git commit -m "updated the test data" git push git annex copy --to datasrc
Git -annex provides the ability to tag data files. Freesurfer utilizes tags so that subsets of the data can be retrieved without having to download everything. The data files have been broken down into the following 2 categories:
Files required for build time checks - makecheck
Files required for a local installation - makeinstall
It is essential that data files get tagged properly so that our servers and diskspace are not overwhelmed when only a known subset of the data is required.
Get Tagged Files
To get only the data files required for installation:
git fetch datasrc git annex get --metadata fstags=makeinstall .
To show all the metadata associated with a file:
git annex metadata mri_convert/testdata.tar.gz
Assigning metadata to a datafile is the job of a source code administrator, similar to adding a datafile. When adding metadata to an annex file, it is best to start with a clean checkout of the repository and be in the 'dev' branch. Then add the tag as follows:
git annex metadata mri_convert/testdata.tar.gz -s fstags=makecheck git annex copy --to datasrc
We can also append tags:
git annex metadata mri_convert/testdata.tar.gz -s fstags+=makeinstall git annex copy --to datasrc
No need to perform any commits or pushes or pull requests after this is done.
Listing Files with a Given Tag
git annex find --metadata fstags=makeinstall
The git annex repository exists on the local file system in the following directory:
The public-facing git annex repository exists on local file system in the following directory (mounted by our server):
Currently, we mirror the two repositories daily using the following commands:
ssh pinto rsync -av /space/freesurfer/repo/annex.git/* /cluster/pubftp/dist/freesurfer/repo/annex.git git update-server-info