DataLad starter : retrieve Brain/MINDS datasets.

Overview

Most of Brain/MINDS public data is available as DataLad datasets hosted by Brain/MINDS Gin server.

This document is just a collection of code snippets useful to handle DataLad datasets with DataLad’s cli or with its Python API.

Prerequisite

DataLad must be installed on your platform (requires Python / Git / GitAnnex).

Please refer to the installation section of the DataLad Handbook.

1. Obtaining remote Datasets

Getting a local description of all the files contained in the dataset, but not its full content yet (because annexed files are not downloaded right away).

First the simple & straightforward way (A.), then a bit more detailed way (B.).

A. Clone a dataset in current directory [simple version]

The local clone is created in a new directory which implicitely takes the dataset’s name, and is located within the current directory.

datalad clone https://datasets.brainminds.jp/brainminds/BMA-2019
import datalad.api as dl

dataset_url = 'https://datasets.brainminds.jp/brainminds/BMA-2019'

# make local copy of the dataset (but do not download data yet)
ds = dl.clone(source=dataset_url)

B. Clone a dataset in a specific directory [detailed version]

The local clone is created in the explicitly specified directory, and possible sub-datasets are also cloned recursively.

datalad install -r  \
  -s https://datasets.brainminds.jp/brainminds/BMA-2019  \
  /home/hackathon23/CopyOf-BMA-2019
import os
import datalad.api as dl

dataset_url = 'https://datasets.brainminds.jp/brainminds/BMA-2019'

# parent directory where the dataset will be copied
base_dir = '/home/hackathon23'

dataset_name = dataset_url.rsplit('/', 1)[-1]
# working dir of the clone
dataset_wd = os.path.join(base_dir, 'CopyOf-' + dataset_name)

# make local copy of the dataset 
try:
  ds = dl.install(path=dataset_wd,
        source=dataset_url,
        recursive=True)

  # at this point the dataset is cloned locally, but files' content has not been downloaded yet...

except Exception as ex:
  print("Failed to clone dataset ", dataset_url)

2. Retrieving actual data

Selectively downloading the content of files that will be used locally.

In this example we assume the dataset is already cloned, and that we only want to retrieve the content of NifTi files.

cd /home/hackathon23/CopyOf-BMA-2019

niftis=`find . -name '*.nii.gz'`

datalad get $niftis
import os
import glob
import datalad.api as dl

dataset_wd='/home/hackathon23/CopyOf-BMA-2019'

# create dataset object 
ds = dl.Dataset(dataset_wd)

# check that the working dir contains a dataset clone
if not ds.is_installed():
  print('No dataset in ', dataset_wd)
  exit(1)

# get path of NiFTi files
nifti_paths = glob.glob(
                os.path.join(dataset_wd, '**/*.nii.gz'),
                recursive=True)

# actually download the data
for p in nifti_paths:
  ds.get(p)

All DataLad functionalities may be performed through its Python API. Here, the API allows us to extend the previous example and check actual content size beforehand and decide whether to download files or not.

import os
import glob
import datalad.api as dl

dataset_wd='/home/hackathon23/CopyOf-BMA-2019'

# create dataset object 
ds = dl.Dataset(dataset_wd)

# check that the working dir contains a dataset clone
if not ds.is_installed():
  print('No dataset in ', dataset_wd)
  exit(1)

# check dataset's status, including size of annexed content
st = ds.status(annex='all')

# we don't want to download files larger than 200Mb 
size_limit = 200*1024*1024

#get path of smaller NiFTi files
nifti_paths = [p for p in [e['path'] for e in st if e['bytesize']<size_limit] if p.endswith('nii.gz')]

# actually download the data
for p in nifti_paths:
  ds.get(p)

3. Discard wasteful data

Once data has been sucessfully processed, it maybe worth reclaiming the disk space since datasets tend to be large!

All annexed content of a dataset can be delete as follow.

cd /home/hackathon23/CopyOf-BMA-2019
datalad drop . -r

or

datalad drop -d /home/hackathon23/CopyOf-BMA-2019 -r
import datalad.api as dl

dataset_wd='/home/hackathon23/CopyOf-BMA-2019'

dl.drop(dataset=dataset_wd, recursive=True)

References