Skip to content

ReproNim/reproman

Repository files navigation

ReproMan

Supports python version GitHub release PyPI version fury.io Tests codecov.io Documentation

ReproMan aims to simplify creation and management of computing environments in Neuroimaging. While concentrating on Neuroimaging use-cases, it is by no means is limited to this field of science and tools will find utility in other fields as well.

Status

ReproMan is under rapid development. While the code base is still growing the focus is increasingly shifting towards robust and safe operation with a sensible API. There has been no major public release yet, as organization and configuration are still subject of considerable reorganization and standardization.

See CONTRIBUTING.md if you are interested in internals and/or contributing to the project.

Installation

ReproMan requires Python 3 (>= 3.8).

Linux'es and OSX (Windows yet TODO) - via pip

By default, installation via pip (pip install reproman) installs core functionality of reproman allowing for managing datasets etc. Additional installation schemes are available, so you could provide enhanced installation via pip install 'reproman[SCHEME]' where SCHEME could be

  • tests to also install dependencies used by unit-tests battery of the reproman
  • full to install all of possible dependencies, e.g. DataLad

For installation through pip you would need some external dependencies not shipped from it (e.g. docker, singularity, etc.) for which please refer to the next section.

Debian-based systems

On Debian-based systems we recommend to enable NeuroDebian from which we will soon provide recent releases of ReproMan. We will also provide backports of all necessary packages from that repository.

Dependencies

Python 3.8+ with header files possibly needed to build some extensions without wheels. They are provided by python3-dev on debian-based systems or python-devel on Red Hat systems.

Our setup.py and corresponding packaging describes all necessary python dependencies. On Debian-based systems we recommend to enable NeuroDebian since we use it to provide backports of recent fixed external modules we depend upon. Additionally, if you would like to develop and run our tests battery see CONTRIBUTING.md regarding additional dependencies.

A typical workflow for reproman run

This example is heavily based on the "Typical workflow" example created for ///repronim/containers which we refer you to discover more about YODA principles etc. In this reproman example we will follow exactly the same goal -- running MRIQC on a sample dataset -- but this time utilizing ReproMan's ability to run computation remotely. DataLad and ///repronim/containers will still be used for data and containers logistics, while reproman will establish a little HTCondor cluster in the AWS cloud, run the analysis, and fetch the results.

Step 1: Create the HTCondor AWS EC2 cluster

If it is the first time you are using ReproMan to interact with AWS cloud services, you should first provide ReproMan with secret credentials to interact with AWS. For that edit its configuration file (~/.config/reproman/reproman.cfg on Linux, ~/Library/Application Support/reproman/reproman.cfg on OSX)

[aws]
access_key_id = ...
secret_access_key = ...

Disclaimer/Warning: Never share or post those secrets publicly.

filling out the ...s. If reproman fails to find this information, error message Unable to locate credentials will appear.

Run (need to be done once, makes resource available for reproman login or reproman run):

reproman create aws-hpc2 -t aws-condor -b size=2 -b instance_type=t2.medium

to create a new ReproMan resource: 2 AWS EC2 instances, with HTCondor installed (we use NITRC-CE instances).

Disclaimer/Warning: It is important to monitor your cloud resources in the cloud provider dashboard(s) to ensure absent run away instances etc. to help avoid incuring heavy cost for used cloud services.

Step 2: Create analysis DataLad dataset and run computation on aws-hpc2

Following script is an exact replica from ///repronim/containers where only the datalad containers-run command, which fetches data locally and runs computation locally and serially, is replaced with reproman run which publishes dataset (without data) to the remote resource, fetches the data, runs computation via HTCondor in parallel across 2 nodes, and then fetches results back:

#!/bin/sh
(  # so it could be just copy pasted or used as a script
PS4='> '; set -xeu  # to see what we are doing and exit upon error
# Work in some temporary directory
cd $(mktemp -d ${TMPDIR:-/tmp}/repro-XXXXXXX)
# Create a dataset to contain mriqc output
datalad create -d ds000003-qc -c text2git
cd ds000003-qc
# Install our containers collection:
datalad install -d . ///repronim/containers
# (optionally) Freeze container of interest to the specific version desired
# to facilitate reproducibility of some older results
datalad run -m "Downgrade/Freeze mriqc container version" \
    containers/scripts/freeze_versions bids-mriqc=0.16.0
# Install input data:
datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata
# Setup git to ignore workdir to be used by pipelines
echo "workdir/" > .gitignore && datalad save -m "Ignore workdir" .gitignore
# Execute desired preprocessing in parallel across two subjects
# on remote AWS EC2 cluster, creating a provenance record
# in git history containing all condor submission scripts and logs, and
# fetching them locally
reproman run -r aws-hpc2 \
   --sub condor --orc datalad-pair \
   --jp "container=containers/bids-mriqc" --bp subj=02,13 --follow \
   --input 'sourcedata/sub-{p[subj]}' \
   --output . \
   '{inputs}' . participant group -w workdir --participant_label '{p[subj]}'
)

ReproMan: Execute documentation section provides more information on the underlying principles behind reproman run command.

Step 3: Remove resource

Whenever everything is computed and fetched, and you are satisfied with the results, use reproman delete aws-hpc2 to terminate remote cluster in AWS, to not cause unnecessary charges.

License

MIT/Expat

Disclaimer

It is in a beta stage -- majority of the functionality is usable but Documentation and API enhancements is WiP to make it better. Please do not be shy of filing an issue or a pull request. See CONTRIBUTING.md for the guidance.