smish.dev
cmake_external_data

Overview

One of the libraries we use at work has some large mesh data files that take up about 200 MB in its git repo. These files are used by the library in some of its performance tests, but unless you're running those benchmarks or actively developing in that library, those files aren't really needed. However, when cloning the repo, there's no good way to opt-out of downloading those files, so you end up downloading those large files regardless of whether or not you need them.

CMake has a module, ExternalData, that can help address this problem by associating data files with certain targets, and only downloading those files when those targets are built. This makes downloading the large files opt-in, so the costs are only incurred when those files are needed.

CMake's documentation on ExternalData is good, but there aren't very many example projects of this module in use, so this post will walk through an example of how to use this feature.

Imagine we have a C++ project and one of its tests depends on some data file:

We could register a test with CMake that runs this executable with a given data file:

If we did that, then after building we could run ctest and it would run the command we specified:

So far so good, except this test assumes that data_file.bin is already available. Ideally, the user wouldn't have to download this file if BUILD_TESTS was off, and ExternalData lets us do that by making a few small changes. First, we need to hash the data file with one of the supported algorithms. Using MD5 in this example, we can execute

to hash the data file, and write that value to the file data_file.bin.md5. This will be the file that we keep in our repo, and it serves as a placeholder for the actual data. Next, rename the actual data file with the hash value that was calculated above, and put it somewhere (e.g. on the local filesystem, or in a separate repo on github) in a directory named MD5

For this example, I put this data file in a github repo, here.

Then, we tell CMake how to locate the actual data files from their hashes, by using ExternalData:

and finally, we use ExternalData_Add_Test instead of add_test to register the test. Note how the data file name is wrapped by DATA{...}

Now, if we configure CMake with -DBUILD_TESTS=TRUE and build, we will see CMake fetch the data files associated with the targets that are being built.

If we had -DBUILD_TESTS=FALSE then these downloads would have been skipped, because the targets that required the data files were not being built. Just like before, we can run ctest, and verify that everything is working

The complete example repo can be found on github here.

Summary

All in all, this feature of CMake does provide a way to manage these large files, but it seems like a considerable amount of work to set up and use. In practice, I'd rather just put the data files in a separate repo that is optionally included by something like ExternalProject or as a git submodule.