How to create your own data butler
The LSST science pipelines include a module known as the Butler which provides an abstracted data access interface. It is used by other components of the pipelines to read and write data without having to know the details of file formats or locations. It is composed of a Python module lsst.daf.butler and the butler command.
In this tutorial you find information to get you started using the butler for storing and accessing data.
Overview of a butler data repository
A butler repository includes two major components:
- data store
a location where the datasets of your repository are physically located (e.g. a directory on a POSIX file system, an S3 bucket, etc.)
- registry database
a database which records information about the contents of the data store.
Currently, you can use either a SQLite database or a PostgreSQL database as a butler registry. We recommend using a SQLite registry database for getting familiar with the butler because it is easier to use. However, please note that it may not scale well for larger scale production campaigns. For that case, we recommend you use PostgreSQL, which needs a set up phase.
Using a SQLite registry database
Step 1: Preparing your repository location
For this tutorial, your butler repository will be located under the directory /sps/lsst/users/$(whoami)/my-butler-repo
, so you need to ensure that directory exists:
$ export REPO=/sps/lsst/users/$(whoami)/my-butler-repo
$ mkdir -p $REPO
Step 2: Creating an empty butler repository
In this step you create an empty butler repository using a SQLite registry database (the default), using the butler command. For this tutorial we use a specific weekly release of the LSST science pipelines, but you can use any reasonable recent release (see How to use the LSST Science Pipelines for details):
$ # Set up the desired release of the LSST Science Pipelines.
$ # In this example we use weekly 'w_2022_27' but you are encouraged to use
$ # a more recent release.
$ source /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_27/loadLSST.bash
$ setup lsst_distrib
The command to initialize the butler repository is:
$ butler create $REPO
After sucessful execution of this command, you will find under the directory $REPO
a file named butler.yaml
which contains configuration information for this repository and a file named gen3.sqlite3
which contains the initialized registry database.
Step 3: Populating your repository
Once the butler repository is created, the commands for populating it are similar regardless you use PostgreSQL or SQLite for hosting the registry database. So to populate your repository you do:
$ butler register-instrument $REPO 'lsst.obs.lsst.LsstCam'
$ butler ingest-raws --transfer symlink $REPO /path/to/raw/data
Using a PostgreSQL registry database
To use a PostgreSQL database for hosting the butler registry database you need a one-time process to set up your database environment.
Step 1: Getting your PostgreSQL account
For storing your butler registry databases, as a member of the LSST-France community you can use the PostgreSQL server
ccpglsstdev.in2p3.fr
For using the server above you need your individual account. If you don’t have one yet, you can request for it to be created by contacting the help desk (see How to Get Help for details) and providing all relevant details. For your convenience, we prepared the template below for you to consider:
Hint
Adapt the text below for your particular case when requesting the creation of your individual PostgreSQL account via the help desk:
My name is <your name> and my account for both CC-IN2P3’s login and compute farms is <your account>. I am a member of the group lsst
and would like to kindly ask you to create an account for my individual usage on the PostgreSQL database server ccpglsstdev.in2p3.fr
.
As a result, an account for your individual usage will be created and you will be provided with your credentials: the PostgreSQL user name and password on that particular server. The PostgreSQL user name is typically the same as your UNIX user name (i.e. the result of executing the command whoami
) and the password is specific to this server and is unrelated to the password you use to connect to the Login Farm.
In order for the butler to find your database credentials, you need to store them in a file. The default path of that file is $HOME/.lsst/db-auth.yaml
. For instance, the contents of that file for the hypothetical user messier
would be:
$ cat $HOME/.lsst/db-auth.yaml
- url: "postgresql://ccpglsstdev.in2p3.fr:6553"
username: "messier"
password: "very-secret-postgresql-password"
This file tells the butler that it must use the user name messier
and password very-secret-postgresql-password
for connecting to any database hosted in server ccpglsstdev.in2p3.fr:6553
.
Important
Your database credentials file must be protected, so make sure you do:
$ chmod 0600 $HOME/.lsst/db-auth.yaml
Note
If you need to store your PostgreSQL credentials in another location, you can initialize the environment variable LSST_DB_AUTH
with the path to the credential file.
Step 2: Preparing your registry database
You can use your individual PostgreSQL database to create as many butler registry databases as you need. For this, you must create a PostgreSQL schema for each butler registry database you intend to use. From one host in the Login Farm connect to your database via the command:
$ psql --host=ccpglsstdev.in2p3.fr --port=6553 --username=$(whoami) --dbname=$(whoami)
You will be prompted for your PostgreSQL password. Once authenticated the psql command will show a prompt for you to type in the command below:
CREATE EXTENSION IF NOT EXISTS btree_gist;
This command activates the extension btree_gist
in your individual database which is required for the butler. It only needs to be enabled once for your database account.
For each butler repository you must create a specific PostgreSQL schema with the command:
CREATE SCHEMA my_first_butler_repo;
This command creates a PostgreSQL schema named my_first_butler_repo
which we will instruct the butler to use in the next step. Type \q
at the psql command prompt to quit.
Important
To remove a schema you no longer need use the SQL command below:
% This command deletes ALL the contents in the schema named 'my_first_butler_repo'
DROP SCHEMA my_first_butler_repo CASCADE;
⚠️ Please note that removing a schema deletes all its contents, so if you alread have a butler registry in that schema it will be destroyed. So proceed with caution.
Step 3: Configuring your repository
Before you can create a new butler repository, you need to provide an initial configuration via a seed configuration file. In that file you instruct the butler what registry database your want it to use for this particular repository.
In this tutorial, you will create a butler repository under the directory /sps/lsst/users/$(whoami)/my-butler-repo
. Create that directory populate the file butler-seed.yaml
as shown below:
$ export REPO=/sps/lsst/users/$(whoami)/my-butler-repo
$ mkdir -p $REPO
$ cat $REPO/butler-seed.yaml
registry:
db: "postgresql://ccpglsstdev.in2p3.fr:6553/messier"
namespace: "my_first_butler_repo"
The value of the key db
above specifies the endpoint to use. That endpoint includes the hostname and port number of the PostgreSQL server and the name of the database, which is identical to the result of executing the command whoami
. The value of the key namespace
must match the name of the PostgreSQL schema you created in step 2 above. You can choose any name meaningful to you: a namespace can only contain a registry database for a single butler repository, but in your individual database you may host as many repositories as you need.
Step 4: Creating an empty butler repository
In this step you create an empty butler repository, using the butler command. For this tutorial we use a specific weekly release of the LSST science pipelines, but you can use any reasonable recent release (see How to use the LSST Science Pipelines for details):
$ source /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_27/loadLSST.bash
$ setup lsst_distrib
This butler repository will be located under the directory pointed to by the variable REPO
initialized above:
$ butler create --seed-config $REPO/butler-seed.yaml --override $REPO
As a result, the butler
command created some files and directories under the directory $REPO
. In particular, the file butler.yaml
there in contains all the information the butler needs to use your newly created repository. You can use the option --outfile
to specify an alternative name and location of that file (see the documentation of the butler create command for details).
Step 5: Populating your repository
Now you can start populating your butler repository. First, you register an instrument with the command:
$ butler register-instrument $REPO 'lsst.obs.lsst.LsstCam'
lsst.obs.lsst.LsstCam
is the identifier of the instrument corresponding to the full Rubin-LSST focal plane. The LSST science pipelines include identifiers for other instruments such as lsst.obs.lsst.LsstComCam
(for ComCam data) and lsst.obs.lsst.Latiss
(for AuxTel LATISS data). All the instrument identifiers currently known to the pipelines can be found in this file.
To ingest raw images use the butler ingest-raws command as shown below:
$ butler ingest-raws --transfer symlink --processes 4 $REPO /path/to/raw/data
Step 6 (optional): Sharing your butler repository with your collaborators
For working with your team, you may need to share a butler repository with one or more of your collaborators. Each member of your team must have their own account at CC-IN2P3 and their individual account in the PostgreSQL server ccpglsstdev.in2p3.fr
(see Step 1: Getting your PostgreSQL account above).
In this section we present the one-time procedure you need to follow to authorize your collaborators to use your butler registry database. You must follow this procedure for each butler repository you want to share and for every collaborator you want to share it with.
If you want to authorize user laplace
to acces your butler repository named my_first_butler_repo
you must proceed as follows.
Connect to the database server:
# Connect to PostgreSQL server using your credentials
$ psql --host=ccpglsstdev.in2p3.fr --port=6553 --username=$(whoami) --dbname=$(whoami)
You will be prompted for your PostgreSQL password. Once authenticated the psql command will show a prompt for you to type in the SQL commands below, including the ;
at the end of each line:
GRANT ALL PRIVILEGES ON SCHEMA my_first_butler_repo TO laplace;
GRANT ALL PRIVILEGES ON ALL SEQUENCES IN SCHEMA my_first_butler_repo TO laplace;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA my_first_butler_repo TO laplace;
These commands grant user laplace
read and write privileges to your butler registry which is contained in the PostgreSQL schema named my_first_butler_repo
. Type \q
at the psql command prompt to quit.
Now, user laplace
can access your butler repository. Please note that you have to share with laplace
the YAML file which contains the description of your repository. That file is typically named butler.yaml
and is typically located at the repository’s top directory, e.g. the location pointed to by the environment variable $REPO
in Step 3: Configuring your repository above.
Important
User laplace
must have their own database credentials file (e.g. ~laplace/.lsst/db-auth.yaml
) as shown in Step 1: Getting your PostgreSQL account. The butler needs that information to authenticate to the database server.
⚠️ You must not share your PostgreSQL credentials with anyone: laplace
will use their credentials for accessing your butler and you will continue using your own credentials.
See also
Acknowledgements
An initial version of the material used for writing this tutorial was prepared by D. Boutigny.