Services and Environments¶
metalab includes an environment system for provisioning and managing infrastructure services -- such as PostgreSQL and Atlas -- across different deployment targets. Whether you are developing locally or running on an HPC cluster, the same configuration drives service lifecycle, connectivity, and teardown.
Overview¶
The environment system handles three concerns:
- Configuration --
.metalab.tomldefines project-level settings, named environment profiles, and service declarations. - Provisioning --
metalab services upstarts the right services for the selected environment (subprocess locally, SLURM jobs on a cluster). - Connectivity --
metalab tunnelopens SSH tunnels so remote services appear onlocalhost.
Supported deployment targets:
| Target | Environment type | How services run |
|---|---|---|
| Local workstation | local |
Subprocesses |
| SLURM / HPC | slurm |
sbatch jobs on compute nodes |
| (future) | kubernetes, cloud |
Pods, managed services |
Project Configuration (.metalab.toml)¶
Project configuration lives in .metalab.toml at the project root. metalab walks up from the current working directory to find it, so you can run commands from any subdirectory.
File format¶
The config uses TOML and has four top-level sections:
| Section | Purpose |
|---|---|
[project] |
Project name and default environment |
[services.*] |
Service declarations (Postgres, Atlas) -- app config only |
[environments.*] |
Named deployment profiles (identity, connectivity, resources) |
Configuration layers¶
metalab separates configuration into three layers, each with a single responsibility:
| Layer | Example section | What it contains | Who consumes it |
|---|---|---|---|
| Identity | [environments.slurm] |
Gateway, user, file_root | Orchestrator, SSH tunnel |
| Services resources | [environments.slurm.services] |
Partition, time, memory for the services node | Environment implementation |
| Executor resources | [environments.slurm.executor] |
Partition, time, memory, GPUs for experiment jobs | resolve_executor() |
| App config | [services.postgres] |
Ports, auth, database names | Orchestrator, service code |
The [services.*] section is backend-agnostic and portable across all environments. Scheduler-specific resources live under the environment profile as nested sub-tables, so they only appear for backends that need them.
Full example¶
[project]
name = "my-project"
default_env = "slurm"
# Backend-agnostic service config (ports, auth, databases)
[services.postgres]
auth_method = "scram-sha-256"
database = "metalab"
[services.atlas]
port = 8000
# Local environment -- no resource specs needed
[environments.local]
type = "local"
file_root = "./runs"
# SLURM environment -- identity and connectivity
[environments.slurm]
type = "slurm"
gateway = "hpc.university.edu"
user = "researcher"
file_root = "/shared/experiments"
# SLURM allocation for the services node (postgres + atlas co-located)
[environments.slurm.services]
partition = "cpu2019"
time = "7-00:00:00"
memory = "10G"
# SLURM allocation for experiment array jobs
[environments.slurm.executor]
partition = "gpu"
time = "1:00:00"
memory = "32G"
gpus = 1
modules = ["cuda/12.0"]
conda_env = "ml-env"
max_concurrent = 200
Local overrides (.metalab.local.toml)¶
Personal or sensitive values go in .metalab.local.toml, which sits next to .metalab.toml and should be gitignored. It uses the same structure and is deep-merged on top of the base config.
# .metalab.local.toml -- gitignored
[environments.slurm]
user = "jsmith"
ssh_key = "~/.ssh/cluster_key"
[environments.slurm.services]
partition = "my-lab-partition"
[environments.slurm.executor]
setup = ["source ~/my_setup.sh"]
mail_user = "jsmith@university.edu"
mail_type = "end,fail"
[services.postgres]
password = "my-secret-password"
Resolution order¶
When a config is resolved for a specific environment, values are merged in this order (last wins):
- Base
[services]and[environments]sections - Named environment profile (e.g.,
[environments.slurm]) - Local overrides from
.metalab.local.toml - CLI flags (e.g.,
--env) - Environment variables (e.g.,
METALAB_ENV)
Environment Profiles¶
An environment profile is a named deployment target defined under [environments] in your config. Each profile specifies a type (the backend) and backend-specific settings.
Listing profiles¶
Inspecting a profile¶
Environment: slurm
Type: slurm
File root: /shared/experiments
Config:
gateway: hpc.university.edu
user: researcher
services: {'partition': 'cpu2019', 'time': '7-00:00:00', 'memory': '10G'}
executor: {'partition': 'gpu', 'time': '1:00:00', 'memory': '32G', 'gpus': 1, ...}
Services:
postgres: {'auth_method': 'scram-sha-256', 'database': 'metalab'}
atlas: {'port': 8000}
Selecting an environment¶
The active environment is determined by (in priority order):
--env <name>flag on any commandMETALAB_ENVenvironment variabledefault_envin[project]
# Explicit flag
metalab services up --env local
# Environment variable
export METALAB_ENV=slurm
metalab services up
# Falls back to default_env in .metalab.toml
metalab services up
Service Provisioning¶
The metalab services commands manage the full service lifecycle. The orchestrator reads your resolved config and provisions only the services you have declared.
metalab services up¶
Provisions services for the selected environment.
What happens:
- Checks for an existing service bundle -- if all services are still alive, reuses it.
- Creates the appropriate
ServiceEnvironment(local subprocess manager or SLURM job submitter). - Starts PostgreSQL if
[services.postgres]is present in your config. - Starts Atlas if
[services.atlas]is present and PostgreSQL is configured. Atlas requires a PostgreSQL backend. - Saves a
bundle.jsonwith connection details for all running services.
The --tunnel flag opens an SSH tunnel immediately after provisioning:
Three provisioning scenarios¶
| Scenario | Config needed | What starts |
|---|---|---|
| Postgres + Atlas | [services.postgres] + [services.atlas] |
PostgreSQL + Atlas (Atlas queries PG, optionally serves files from file_root) |
| Postgres only | [services.postgres], no [services.atlas] |
PostgreSQL only (experiment data indexed, no web UI) |
| Reuse existing | Same as above | Nothing new -- existing bundle is reused if healthy |
Note
Atlas requires a PostgreSQL backend. If you only have a file_root without [services.postgres], Atlas will not be provisioned.
metalab services status¶
Check health of running services:
Use --json for machine-readable output.
metalab services down¶
Stop all services and clean up:
Services are stopped in reverse dependency order (Atlas first, then PostgreSQL). On SLURM, jobs are cancelled with scancel. The bundle.json file is removed.
Store Discovery¶
When services are running, experiment configs can use store: "discover" to automatically locate the active store without hardcoding URIs.
How it works¶
metalab services upprovisions services and saves abundle.jsoncontaining the store locator.- When metalab encounters
store: "discover", it walks up from the current directory looking forservices/bundle.json. - The
store_locatorfield from the bundle is used as the store URI.
Before and after¶
Without discovery, you must embed connection details:
metalab.run(
exp,
store="postgresql://researcher@cn001:5432/metalab?file_root=/shared/experiments",
)
With discovery:
The locator is resolved at runtime from the running service bundle. This also works in experiment YAML configs:
SSH Authentication and Tunneling¶
When services run on remote hosts (e.g., SLURM compute nodes), metalab establishes SSH tunnels to make them accessible on your local machine.
Default behavior¶
metalab uses your existing SSH configuration. If you can run ssh user@gateway without a password prompt, metalab tunnel works with zero additional config.
Under the hood, it spawns:
This leverages your ~/.ssh/config, SSH agent, and any keys already loaded.
Setting up SSH key authentication¶
If you have not already configured key-based authentication for your cluster:
-
Generate a key (if you don't have one):
-
Copy it to the remote host:
-
Verify (should not prompt for a password):
On macOS, the Keychain-backed SSH agent handles passphrase caching automatically. On Linux, ensure ssh-agent is running and your key is added (ssh-add).
Explicit key overrides¶
If a specific environment requires a different key, set it in .metalab.local.toml:
This adds -i ~/.ssh/special_cluster_key to the SSH command.
The metalab tunnel command¶
Open a tunnel to running services:
The tunnel reads bundle.json to determine the remote host and port, then forwards them to localhost. The process runs in the foreground until you press Ctrl+C.
Note
Always prefer key-based authentication over passwords. Password-based SSH is not supported by metalab tunnel.
Workflow Guides¶
Local Development¶
The simplest setup -- services run as local subprocesses.
Atlas is immediately available at http://localhost:8000, querying data from PostgreSQL. When file_root is configured, Atlas can also serve artifact downloads and log content. No tunnel is needed.
SLURM / HPC with PostgreSQL¶
Full-featured setup with a Postgres query index.
# Provision PostgreSQL and Atlas on a compute node
metalab services up --env slurm
# Open an SSH tunnel to access Atlas locally
metalab tunnel
# Atlas is now available at http://localhost:8000
# Run experiments -- they write to the shared filesystem and PG index
# When finished, clean up
metalab services down --env slurm
Or provision and tunnel in one step:
Local Development -- Atlas reading from file_root¶
When running locally with both Postgres and Atlas, Atlas serves query results from Postgres and can also serve artifact/log content when file_root is configured in the environment profile.
Executor Configuration¶
Executors (which run experiment tasks) can be created from configuration dicts using the executor_from_config() factory. Executor defaults live in the [environments.*.executor] sub-table of your environment profile:
- TOML (
[environments.slurm.executor]) defines infrastructure defaults -- partitions, walltime, memory. - YAML (per-experiment) specifies only what varies per experiment -- worker counts, GPUs.
executor_from_config()¶
from metalab.executor.config import executor_from_config
# Create from type name and config dict
executor = executor_from_config("slurm", {
"partition": "gpu",
"time": "2:00:00",
"memory": "16G",
"gpus": 1,
})
# Local executor with multiple workers
executor = executor_from_config("local", {"workers": 4})
# Single-threaded (returns None, metalab runs in-process)
executor = executor_from_config("local", {"workers": 1})
Supported executor types¶
| Type | Config class | What it creates |
|---|---|---|
local |
LocalExecutorConfig |
ProcessExecutor (or None for serial) |
slurm |
SlurmExecutorConfig |
SlurmExecutor with job array support |
SLURM executor options¶
| Field | Default | Description |
|---|---|---|
partition |
"default" |
SLURM partition |
time |
"1:00:00" |
Walltime limit |
cpus |
1 |
CPUs per task |
memory |
"4G" |
Memory per task |
gpus |
0 |
GPUs per task |
max_concurrent |
None |
Max simultaneous array tasks |
modules |
[] |
module load commands |
conda_env |
None |
Conda environment to activate |
setup |
[] |
Extra shell commands before execution |
Simplified experiment config¶
With executor_from_config, experiment YAML files become concise:
The executor inherits partition, memory, and other defaults from [environments.*.executor] in .metalab.toml, while the experiment only overrides what it needs.
How Configuration Flows into metalab.run()¶
When you call metalab.run(), multiple configuration layers merge transparently:
.metalab.toml shared infrastructure defaults
└─ .metalab.local.toml machine-specific overrides (gitignored)
└─ experiment config per-experiment overrides
└─ metalab.run() final resolution at runtime
Store resolution¶
The store argument accepts several forms:
| Value | What happens |
|---|---|
None (default) |
FileStore at ./experiments/{experiment_id}/ |
"runs/" |
FileStore at the given path |
"postgresql://..." |
PostgresStore (pass file_root for artifacts) |
"discover" |
Auto-detect from the nearest running service bundle |
StoreConfig object |
Used directly (auto-scoped to experiment) |
Executor resolution¶
metalab.resolve_executor(platform, overrides) merges TOML defaults with per-experiment overrides:
- Auto-discovers
.metalab.toml(walks up from cwd) - Resolves the environment profile matching
platform - Reads the
[environments.*.executor]sub-table as defaults - Merges per-experiment overrides on top (overrides win)
- Creates the executor via the plugin registry
# [environments.slurm.executor] provides partition, time, memory, modules, conda_env...
# Your experiment only overrides what differs:
executor = metalab.resolve_executor("slurm", {"gpus": 1, "time": "4:00:00"})
If no .metalab.toml exists, only the overrides dict is used — your code still works without a project config file.
Inspecting the resolved config¶
Use the CLI to verify what metalab will use before running experiments:
metalab env list # List available environment profiles
metalab env show slurm # Show fully merged config for a profile
When metalab.run() executes, it logs the resolved store, executor, and run counts at INFO level so you can confirm what was picked up.
Teardown and Cleanup¶
metalab services down¶
Stops all services tracked in the service bundle:
- SLURM jobs are cancelled via
scancel. - Local processes receive
SIGTERM, thenSIGKILLif they don't exit within 5 seconds. - The
bundle.jsonfile is removed.
Services are stopped in reverse dependency order (Atlas before PostgreSQL) so that dependents are torn down before the services they rely on.
Orphan detection¶
If a previous session was not cleanly shut down, metalab services status will detect stale bundles. If the services referenced in the bundle are unreachable, metalab services up will discard the stale bundle and provision fresh services.
# Check for orphaned services
metalab services status --env slurm
# If services show as unreachable, re-provision
metalab services up --env slurm
Bundle location¶
The service bundle is stored at:
{file_root}/services/bundle.json-- iffile_rootis set in the environment config.~/.metalab/services/{env_name}/bundle.json-- otherwise.
The bundle file has 0o600 permissions since it may contain credentials.