CMake, ExternalData, and custom fetch script

I failed to find a concrete example on how to use the CMake module ExternalData with a custom fetch script. Since I finally manage to work out how to use it I thought I’d try to help out the next person who needs to go down this route.

Why ExternalData?

I thought I’d start with a short justification of why I was looking at the module at all.

At work I work with a product that processes images and video. When writing tests we often need some rather large files (from MiB to GiB) as input. The two obvious options are:

  1. Check the files into our Git repo, or
  2. Put them on shared storage

Neither of these are very appealing. The former just doesn’t feel quite right, these are large binary files that rarely, if ever, change, why place them under version control at all? And if they do change the Git repo is likely to balloon in size and impact cloning times negatively. The latter makes it difficult to run our tests on a machine that isn’t on the office network and any changes to the files will break older tests, unless we always only add files, never modify any in place. On the other hand, the former guarantees that the files needed for testing are always available and it is possible to modify the files without breaking older tests. The pro of the latter is that we only download the files needed for the current tests.

ExternalData is one option to address this. On some level it feels like it offers a combination of both options above:

  • It’s possible to use the shared storage
  • When the shared storage isn’t available it’s possible to fall back on downloading the files via other means
  • The layout of the storage is such that modifying in place is much less likely
  • Only the files needed for the currest tests will be downloaded when building off-site

The object store

We do our building in docker images that do have our shared storage mapped in, so I’d like them to take advantage of that. At the same time I want the builds performed off-site to download the files. To get this behaviour I defined two object stores:

set(ExternalData_OBJECT_STORES
  ${CMAKE_BINARY_DIR}/ExternalData/Objects
  /mnt/shared/over/nfs/Objects
  )

The module will search each of these for the required files and download only if they aren’t found. Downloaded files will be put into the first of the stores. Oh, and it’s very important that the first store is given with an absolute path!

The store on the shared storage looks something like this:

/mnt/share/over/nfs/Objects
└── MD5
    ├── 94ed17f9b6c74a732fba7b243ab945ff
    └── a2036177b190fbee6e9e038b718f1c20

I can then drop a file MyInput.avi.md5 in my source tree with the md5 of the real file (e.g. a2036177b190fbee6e9e038b718f1c20) as the content. Once that is done I can follow the example found in the introduction of the reference documentation.

curl vs sftp

So far so good. This now works on-site, but for off-site use I need to fetch the needed files. The last section of the reference documentation is called Custom Fetch Scripts. It mentions that files are normally downloaded using file(DOWNLOAD). Neither there, nor in the documentation for file is there a mention of what is used under the hood to fetch the files. After asking on in #cmake I found out that it’s curl. While curl does handle SFTP I didn’t get it to work with my known_hosts file, nor with my SSH agent (both from OpenSSH). On the other hand it was rather easy to configure sftp to fetch a file from the internet-facing SSH server we have. Now I just had to hook it into CMake somehow.

Custom fetch script

As the section on “Custom Fetch Scripts” mention three things are needed:

  1. Specify the script via the ExternalDataCustomScript:// protocol.
  2. Tell CMake where it can find the fetch script.
  3. The fetch script itself.

The first two steps are done by providing a URL template and pointing to the script via a special variable:

set(ExternalData_URL_TEMPLATES "ExternalDataCustomScript://sftp/mnt/shared/over/nfs/Objects/%(algo)/%(hash)")
set(ExternalData_CUSTOM_SCRIPT_sftp ${CMAKE_SOURCE_DIR}/cmake/FetchFromSftp.cmake)

It took me a ridiculous amount of time to work out how to write a script that turns out to be rather short. This is an experience that seems to repeat itself when using CMake; it could say something about me, or something about CMake.

get_filename_component(FFS_ObjStoreDir ${ExternalData_CUSTOM_FILE} DIRECTORY)
get_filename_component(FFS_InputFilename ${ExternalData_CUSTOM_LOCATION} NAME)
get_filename_component(FFS_OutputFilename ${ExternalData_CUSTOM_FILE} NAME)

execute_process(COMMAND sftp sftp.company.com:/${ExternalData_CUSTOM_LOCATION}
  RESULT_VARIABLE FFS_SftpResult
  OUTPUT_QUIET
  ERROR_VARIABLE FFS_SftpErr
  )

if(FFS_SftpResult)
  set(ExternalData_CUSTOM_ERROR "Failed to fetch from SFTP - ${FFS_SftpErr}")
else(FFS_SftpResult)
  file(MAKE_DIRECTORY ${FFS_ObjStoreDir})
  file(RENAME ${FFS_InputFilename} ${FFS_ObjStoreDir}/${FFS_OutputFilename})
endif(FFS_SftpResult)

This script is run with cmake -P in the binary dir of the CMakeLists.txt where the test is defined, which means it’s oblivious about the project it’s part of. PROJECT_BINARY_DIR is empty and CMAKE_BINARY_DIR is the same as CMAKE_CURRENT_BINARY_DIR. This is the reason why the first store in ExternalData_OBJECT_STORES has to be an absolute path – it’s very difficult, if not impossible, to find the correct placement of the object store otherwise.

⟸ Qt5+D-Bus+CMake, a complete example Free, take 2 ⟹
Leave a comment