How to create your own data butler

The LSST science pipelines include a module known as the Butler which provides an abstracted data access interface. It is used by other components of the pipelines to read and write data without having to know the details of file formats or locations. It is composed of a Python module lsst.daf.butler and the butler command.

In this tutorial you find information to get you started using the butler for storing and accessing data.

Overview of a butler data repository

A butler repository includes two major components:

data store

a location where the datasets of your repository are physically located (e.g. a directory on a POSIX file system, an S3 bucket, etc.)

registry database

a database which records information about the contents of the data store.

Currently, you can use either a SQLite database or a PostgreSQL database as a butler registry. We recommend using a SQLite registry database for getting familiar with the butler because it is easier to use. However, please note that it may not scale well for larger scale production campaigns. For that case, we recommend you use PostgreSQL, which needs a set up phase.

Using a SQLite registry database

Step 1: Preparing your repository location

For this tutorial, your butler repository will be located under the directory /sps/lsst/users/$(whoami)/my-butler-repo, so you need to ensure that directory exists:

$ export REPO=/sps/lsst/users/$(whoami)/my-butler-repo

$ mkdir -p $REPO

Step 2: Creating an empty butler repository

In this step you create an empty butler repository using a SQLite registry database (the default), using the butler command. For this tutorial we use a specific weekly release of the LSST science pipelines, but you can use any reasonable recent release (see How to use the LSST Science Pipelines for details):

$ # Set up the desired release of the LSST Science Pipelines.
$ # In this example we use weekly 'w_2022_27' but you are encouraged to use
$ # a more recent release.

$ source /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_27/loadLSST.bash
$ setup lsst_distrib

The command to initialize the butler repository is:

$ butler create $REPO

After sucessful execution of this command, you will find under the directory $REPO a file named butler.yaml which contains configuration information for this repository and a file named gen3.sqlite3 which contains the initialized registry database.

Step 3: Populating your repository

Once the butler repository is created, the commands for populating it are similar regardless you use PostgreSQL or SQLite for hosting the registry database. So to populate your repository you do:

$ butler register-instrument $REPO 'lsst.obs.lsst.LsstCam'
$ butler ingest-raws --transfer symlink $REPO /path/to/raw/data

Using a PostgreSQL registry database

To use a PostgreSQL database for hosting the butler registry database you need a one-time process to set up your database environment.

Step 1: Getting your PostgreSQL account

For storing your butler registry databases, as a member of the LSST-France community you can use the PostgreSQL server

ccpglsstdev.in2p3.fr

For using the server above you need your individual account. If you don’t have one yet, you can request for it to be created by contacting the help desk (see How to Get Help for details) and providing all relevant details. For your convenience, we prepared the template below for you to consider:

Hint

Adapt the text below for your particular case when requesting the creation of your individual PostgreSQL account via the help desk:

My name is <your name> and my account for both CC-IN2P3’s login and compute farms is <your account>. I am a member of the group lsst and would like to kindly ask you to create an account for my individual usage on the PostgreSQL database server ccpglsstdev.in2p3.fr.

As a result, an account for your individual usage will be created and you will be provided with your credentials: the PostgreSQL user name and password on that particular server. The PostgreSQL user name is typically the same as your UNIX user name (i.e. the result of executing the command whoami) and the password is specific to this server and is unrelated to the password you use to connect to the Login Farm.

In order for the butler to find your database credentials, you need to store them in a file. The default path of that file is $HOME/.lsst/db-auth.yaml. For instance, the contents of that file for the hypothetical user messier would be:

$ cat $HOME/.lsst/db-auth.yaml
- url: "postgresql://ccpglsstdev.in2p3.fr:6553"
  username: "messier"
  password: "very-secret-postgresql-password"

This file tells the butler that it must use the user name messier and password very-secret-postgresql-password for connecting to any database hosted in server ccpglsstdev.in2p3.fr:6553.

Important

Your database credentials file must be protected, so make sure you do:

$ chmod 0600 $HOME/.lsst/db-auth.yaml

Note

If you need to store your PostgreSQL credentials in another location, you can initialize the environment variable LSST_DB_AUTH with the path to the credential file.

Step 2: Preparing your registry database

You can use your individual PostgreSQL database to create as many butler registry databases as you need. For this, you must create a PostgreSQL schema for each butler registry database you intend to use. From one host in the Login Farm connect to your database via the command:

$ psql --host=ccpglsstdev.in2p3.fr --port=6553 --username=$(whoami) --dbname=$(whoami)

You will be prompted for your PostgreSQL password. Once authenticated the psql command will show a prompt for you to type in the command below:

CREATE EXTENSION IF NOT EXISTS btree_gist;

This command activates the extension btree_gist in your individual database which is required for the butler. It only needs to be enabled once for your database account.

For each butler repository you must create a specific PostgreSQL schema with the command:

CREATE SCHEMA my_first_butler_repo;

This command creates a PostgreSQL schema named my_first_butler_repo which we will instruct the butler to use in the next step. Type \q at the psql command prompt to quit.

Important

To remove a schema you no longer need use the SQL command below:

% This command deletes ALL the contents in the schema named 'my_first_butler_repo'
DROP SCHEMA my_first_butler_repo CASCADE;

⚠️ Please note that removing a schema deletes all its contents, so if you alread have a butler registry in that schema it will be destroyed. So proceed with caution.

Step 3: Configuring your repository

Before you can create a new butler repository, you need to provide an initial configuration via a seed configuration file. In that file you instruct the butler what registry database your want it to use for this particular repository.

In this tutorial, you will create a butler repository under the directory /sps/lsst/users/$(whoami)/my-butler-repo. Create that directory populate the file butler-seed.yaml as shown below:

$ export REPO=/sps/lsst/users/$(whoami)/my-butler-repo

$ mkdir -p $REPO

$ cat $REPO/butler-seed.yaml
registry:
    db: "postgresql://ccpglsstdev.in2p3.fr:6553/messier"
    namespace: "my_first_butler_repo"

The value of the key db above specifies the endpoint to use. That endpoint includes the hostname and port number of the PostgreSQL server and the name of the database, which is identical to the result of executing the command whoami. The value of the key namespace must match the name of the PostgreSQL schema you created in step 2 above. You can choose any name meaningful to you: a namespace can only contain a registry database for a single butler repository, but in your individual database you may host as many repositories as you need.

Step 4: Creating an empty butler repository

In this step you create an empty butler repository, using the butler command. For this tutorial we use a specific weekly release of the LSST science pipelines, but you can use any reasonable recent release (see How to use the LSST Science Pipelines for details):

$ source /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_27/loadLSST.bash
$ setup lsst_distrib

This butler repository will be located under the directory pointed to by the variable REPO initialized above:

$ butler create --seed-config $REPO/butler-seed.yaml --override $REPO

As a result, the butler command created some files and directories under the directory $REPO. In particular, the file butler.yaml there in contains all the information the butler needs to use your newly created repository. You can use the option --outfile to specify an alternative name and location of that file (see the documentation of the butler create command for details).

Step 5: Populating your repository

Now you can start populating your butler repository. First, you register an instrument with the command:

$ butler register-instrument $REPO 'lsst.obs.lsst.LsstCam'

lsst.obs.lsst.LsstCam is the identifier of the instrument corresponding to the full Rubin-LSST focal plane. The LSST science pipelines include identifiers for other instruments such as lsst.obs.lsst.LsstComCam (for ComCam data) and lsst.obs.lsst.Latiss (for AuxTel LATISS data). All the instrument identifiers currently known to the pipelines can be found in this file.

To ingest raw images use the butler ingest-raws command as shown below:

$ butler ingest-raws --transfer symlink --processes 4 $REPO /path/to/raw/data

Step 6 (optional): Sharing your butler repository with your collaborators

For working with your team, you may need to share a butler repository with one or more of your collaborators. Each member of your team must have their own account at CC-IN2P3 and their individual account in the PostgreSQL server ccpglsstdev.in2p3.fr (see Step 1: Getting your PostgreSQL account above).

In this section we present the one-time procedure you need to follow to authorize your collaborators to use your butler registry database. You must follow this procedure for each butler repository you want to share and for every collaborator you want to share it with.

If you want to authorize user laplace to acces your butler repository named my_first_butler_repo you must proceed as follows.

Connect to the database server:

# Connect to PostgreSQL server using your credentials
$ psql --host=ccpglsstdev.in2p3.fr --port=6553 --username=$(whoami) --dbname=$(whoami)

You will be prompted for your PostgreSQL password. Once authenticated the psql command will show a prompt for you to type in the SQL commands below, including the ; at the end of each line:

GRANT ALL PRIVILEGES ON SCHEMA my_first_butler_repo TO laplace;

GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA my_first_butler_repo TO laplace;
GRANT ALL PRIVILEGES ON ALL TABLES    IN SCHEMA my_first_butler_repo TO laplace;

These commands grant user laplace read and write privileges to your butler registry which is contained in the PostgreSQL schema named my_first_butler_repo. Type \q at the psql command prompt to quit.

Now, user laplace can access your butler repository. Please note that you have to share with laplace the YAML file which contains the description of your repository. That file is typically named butler.yaml and is typically located at the repository’s top directory, e.g. the location pointed to by the environment variable $REPO in Step 3: Configuring your repository above.

Important

User laplace must have their own database credentials file (e.g. ~laplace/.lsst/db-auth.yaml) as shown in Step 1: Getting your PostgreSQL account. The butler needs that information to authenticate to the database server.

⚠️ You must not share your PostgreSQL credentials with anyone: laplace will use their credentials for accessing your butler and you will continue using your own credentials.

See also

Acknowledgements

An initial version of the material used for writing this tutorial was prepared by D. Boutigny.