DataLad starter : retrieve Brain/MINDS datasets.
Overview
Most of Brain/MINDS public data is available as DataLad datasets hosted by Brain/MINDS Gin server.
This document is just a collection of code snippets useful to handle DataLad datasets with DataLad’s cli
or with its Python
API.
Prerequisite
DataLad must be installed on your platform (requires Python / Git / GitAnnex).
Please refer to the installation section of the DataLad Handbook.
1. Obtaining remote Datasets
Getting a local description of all the files contained in the dataset, but not its full content yet (because annexed files are not downloaded right away).
First the simple & straightforward way (A.), then a bit more detailed way (B.).
A. Clone a dataset in current directory [simple version]
The local clone is created in a new directory which implicitely takes the dataset’s name, and is located within the current directory.
- With the
cli
:
datalad clone https://datasets.brainminds.jp/brainminds/BMA-2019
- With
Python
:
import datalad.api as dl
dataset_url = 'https://datasets.brainminds.jp/brainminds/BMA-2019'
# make local copy of the dataset (but do not download data yet)
ds = dl.clone(source=dataset_url)
B. Clone a dataset in a specific directory [detailed version]
The local clone is created in the explicitly specified directory, and possible sub-datasets are also cloned recursively.
- With the
cli
:
datalad install -r \
-s https://datasets.brainminds.jp/brainminds/BMA-2019 \
/home/hackathon23/CopyOf-BMA-2019
- With
Python
:
import os
import datalad.api as dl
dataset_url = 'https://datasets.brainminds.jp/brainminds/BMA-2019'
# parent directory where the dataset will be copied
base_dir = '/home/hackathon23'
dataset_name = dataset_url.rsplit('/', 1)[-1]
# working dir of the clone
dataset_wd = os.path.join(base_dir, 'CopyOf-' + dataset_name)
# make local copy of the dataset
try:
ds = dl.install(path=dataset_wd,
source=dataset_url,
recursive=True)
# at this point the dataset is cloned locally, but files' content has not been downloaded yet...
except Exception as ex:
print("Failed to clone dataset ", dataset_url)
2. Retrieving actual data
Selectively downloading the content of files that will be used locally.
In this example we assume the dataset is already cloned, and that we only want to retrieve the content of NifTi files.
- With the
cli
:
cd /home/hackathon23/CopyOf-BMA-2019
niftis=`find . -name '*.nii.gz'`
datalad get $niftis
- With
Python
:
import os
import glob
import datalad.api as dl
dataset_wd='/home/hackathon23/CopyOf-BMA-2019'
# create dataset object
ds = dl.Dataset(dataset_wd)
# check that the working dir contains a dataset clone
if not ds.is_installed():
print('No dataset in ', dataset_wd)
exit(1)
# get path of NiFTi files
nifti_paths = glob.glob(
os.path.join(dataset_wd, '**/*.nii.gz'),
recursive=True)
# actually download the data
for p in nifti_paths:
ds.get(p)
All DataLad functionalities may be performed through its Python API. Here, the API allows us to extend the previous example and check actual content size beforehand and decide whether to download files or not.
import os
import glob
import datalad.api as dl
dataset_wd='/home/hackathon23/CopyOf-BMA-2019'
# create dataset object
ds = dl.Dataset(dataset_wd)
# check that the working dir contains a dataset clone
if not ds.is_installed():
print('No dataset in ', dataset_wd)
exit(1)
# check dataset's status, including size of annexed content
st = ds.status(annex='all')
# we don't want to download files larger than 200Mb
size_limit = 200*1024*1024
#get path of smaller NiFTi files
nifti_paths = [p for p in [e['path'] for e in st if e['bytesize']<size_limit] if p.endswith('nii.gz')]
# actually download the data
for p in nifti_paths:
ds.get(p)
3. Discard wasteful data
Once data has been sucessfully processed, it maybe worth reclaiming the disk space since datasets tend to be large!
All annexed content of a dataset can be delete as follow.
- With the
cli
:
cd /home/hackathon23/CopyOf-BMA-2019
datalad drop . -r
or
datalad drop -d /home/hackathon23/CopyOf-BMA-2019 -r
- With
Python
:
import datalad.api as dl
dataset_wd='/home/hackathon23/CopyOf-BMA-2019'
dl.drop(dataset=dataset_wd, recursive=True)
References
- DataLad installation doc:
https://handbook.datalad.org/en/latest/intro/installation.html
- DataLad Python API:
- Brain/MINDS Gin server help page
https://datasets.brainminds.jp/G-Node/info/wiki