How to execute workflows with BPS at CC-IN2P3

The pipetask command is limited to multiprocessing on a single node, and therefore does not allow for sufficient parallelization when execution of large number of tasks is needed. BPS (Batch Production Service) is the LSST framework dedicated to multi-node processing. It can be used with several Workflow Management Systems like Parsl, which provides parallelism of Python codes on distributed computing resources. Using BPS together with its Parsl plugin, LSST pipelines can be easily executed on the CC-IN2P3 Slurm computing at large scale.

Submitting a BPS run

General setup

First, create the following directories that will be used to store all files related to the BPS run:

export BPS_RUNDIR=/sps/lsst/users/${USER}/bps
mkdir -p ${BPS_RUNDIR}/bps_parsl_logs
mkdir -p ${BPS_RUNDIR}/submit

Place your BPS configuration file in ${BPS_RUNDIR}/bps_config.yaml. See How to configure your BPS run. Then cd in your bps directory:

cd ${BPS_RUNDIR}

Source the environment:

source /pbs/throng/lsst/software/bps-parsl/prod/setup.sh

And submit your run:

bps_submit.sh --interactive bps_config.yaml

Hint

The --interactive option runs BPS interactively, without this option it would run in a Slurm job, which can be preferred in case of long runs for example.

More details on using the submission script can be obtained with bps_submit.sh --help.

How to configure your BPS run

The processing campaign to be launched and the configuration to use must be described in a YAML file. An example of such file with explanatory comments is provided here: /pbs/throng/lsst/software/bps-parsl/prod/examples/ci_hsc_isr.yaml. In particular, the most important items to be modified are:

  • release: version of stack to be used for the processing jobs, for instance w_2024_09 (see How to use the LSST Science Pipelines)

  • pipelineYaml: pipeline to be executed

  • project, campaign and payloadName: description of the processing, used for the creation of directories and collections

  • butlerConfig: the butler repository

  • inCollection: collection where input data can be found

  • dataQuery: input data selection

More details can be found in the BPS documentation.

Accessing informations relative to the run

Once the run has been submitted, the BPS ans Parsl logs are written in the ${BPS_RUNDIR}/bps_parsl_logs directory. The ${BPS_RUNDIR}/submit directory contain one subdirectory for each run, containing the task logs. Usual Slurm commands can also be used to obtain informations on the jobs. And if needed, the ${BPS_RUNDIR}/parsl_runinfo directory contains more detailed informations provided by Parsl.

Advanced configuration

Task resources specification

CPU cores and memory requirements of each task can be specified in the BPS YAML file, either directly or by including a dedicated file:

includeConfigs:
  - requestMemory.yaml

The memory needed to run each task can be specified using the following syntax:

pipetask:
  task1:
    requestMemory: 4096
  task2:
    requestMemory: 8192

The Parsl plugin for BPS will then ensure that this task run in a job that provide enough memory. Note that in the case of Clustering, the maximum memory of all tasks in the cluster is taken.

Hint

The CC-IN2P3 configuration for the BPS Parsl plugin currently does not support multicore tasks. It is generally advised to run multiple parallel single-core tasks.

Clustering

The BPS clustering feature allows for gathering several tasks in a single pipetask command execution. This is in particular useful to reduce the number of tasks and increase scalability, especially when their execution time is short.

Clustering can be enabled in the BPS YAML file, either directly or by including a dedicated file:

includeConfigs:
  - clustering.yaml

The general syntax is the following:

clusterAlgorithm: lsst.ctrl.bps.quantum_clustering_funcs.dimension_clustering
cluster:
    clusterLabel1:
        pipetasks: label1, label2     # comma-separated list of labels
        dimensions: dim1, dim2        # comma-separated list of dimensions
        equalDimensions: dim1:dim1a   # e.g., visit:exposure

Multiple clusters can be defined in a single file. Similar tasks running on multiple datasets can be clustered together. For example one can gather isr tasks on all the detectors for each visit:

clusterAlgorithm: lsst.ctrl.bps.quantum_clustering_funcs.dimension_clustering
cluster:
    visit_isr:
        pipetasks: isr
        dimensions: visit
        equalDimensions: visit:exposure

Different tasks can also be clustered together, especially when they depend on each other and are run sequentially in the workflow. For example one can gather the isr task with its downstream tasks, for each detector:

clusterAlgorithm: lsst.ctrl.bps.quantum_clustering_funcs.dimension_clustering
cluster:
    visit_step1:
        pipetasks: isr,characterizeImage,calibrate,writePreSourceTable,transformPreSourceTable
        dimensions: visit,detector
        equalDimensions: visit:exposure

Hint

It is recommended to execute BPS runs that do not excede ~ 100.000 tasks. If needed, in addition to clustering, runs can be split in multiple smaller runs using the dataQuery.

More details on clustering can be found in the dedicated section of the BPS documentation.

How to configure your Parsl jobs

By default Parsl will launch 4 type of jobs: “small”, “medium”, “large” and “xlarge” jobs. Each of these jobs types have a different amount of memory. Tasks will run on the job which is the closest to its memory requirement (see Task resources specification). If needed, these jobs can be configured in the site.ccin2p3 section of the BPS YAML file, see the ccin2p3 class documentation for more details.

Hint

In order to optimize the usage of computing resources, it is important to submit jobs with memory allocations as close as possible to the tasks requirements.

Example: ci_hsc processing

In this tutorial we will use the ci_hsc scripts to process HSC test data with the isr task using BPS. The required scripts are in the ci_hsc_gen3 repository, and the corresponding input set of data is provided by the testdata_ci_hsc repository.

Create a directory dedicated to this tutorial, and clone the required repositories:

export BPS_RUNDIR=/sps/lsst/users/${USER}/test_ci_hsc_bps
mkdir -p ${BPS_RUNDIR}/bps_parsl_logs
mkdir -p ${BPS_RUNDIR}/submit
cd ${BPS_RUNDIR}
git clone git@github.com:lsst/ci_hsc_gen3.git
git clone git@github.com:lsst/testdata_ci_hsc.git

After a successfull clone, the directory testdata_ci_hsc should contain 2200 files and 1400 directories for a total volume of 8 GB. Setup the lsst_distrib version v27.0.0 and the EUPS packages ‘testdata_ci_hsc’ and ‘ci_hsc_gen3’ we cloned above:

source /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/v27.0.0/loadLSST.bash
setup lsst_distrib
setup --root testdata_ci_hsc
setup --keep --root ci_hsc_gen3

Prepare a PostgreSQL database schema for this butler repo:

psql --host=ccpglsstdev.in2p3.fr --port=6553 --username=$(whoami) --dbname=$(whoami)
CREATE EXTENSION IF NOT EXISTS btree_gist;
CREATE SCHEMA ci_hsc_bps;

Create a butler seed butler-seed.yaml with contents (adapt your login):

registry:
   db: postgresql://ccpglsstdev.in2p3.fr:6553/<login>
   namespace: ci_hsc_bps
datastore:
   name: ci_hsc_bps

Create the butler repository and ingest data into it:

cd ${BPS_RUNDIR}/ci_hsc_gen3
scons --butler-config=${BPS_RUNDIR}/butler-seed.yaml ingest

The butler repository is created in ${BPS_RUNDIR}/ci_hsc_gen3/DATA and ingestion should take a few minutes. Copy the BPS configuration file from /pbs/throng/lsst/software/bps-parsl/prod/examples/ci_hsc_isr.yaml, then setup and launch bps_submit.sh:

cd ${BPS_RUNDIR}
cp /pbs/throng/lsst/users/leboulch/ci_hsc_isr.yaml .
source /pbs/throng/lsst/software/bps-parsl/prod/setup.sh
bps_submit.sh --interactive ci_hsc_isr.yaml

bps_submit.sh will launch BPS that will generate a Quantum Graph and execute the corresponding tasks on the Slurm farm using Parsl. You may see Slurm jobs submitted by bps_submit.sh on your behalf via squeue -u $USER. Execution should complete after a few minutes ore more, depending on the availability of computing resources. Messages produced by bps_submit.sh are of the form:

2024-10-11T11:46:16+02:00 bps_runner.sh: execution of workflow "/sps/lsst/users/login/test_ci_hsc_bps/ci_hsc_isr.yaml" took 00:02:22
2024-10-11T11:46:16+02:00 bps_runner.sh: exit code: 0

Log file of the bps_submit.sh execution is under the ${BPS_RUNDIR}/bps_parsl_logs directory. Details of execution of pipetasks, including task logs can be found under ${BPS_RUNDIR}/submit/run/ci_hsc as specified in the workflow configuration file (entry submitPath).

After running this tutorial, the butler repository can be deleted, as well as the database registry.