Services and Environments¶

metalab includes an environment system for provisioning and managing infrastructure services -- such as PostgreSQL and Atlas -- across different deployment targets. Whether you are developing locally or running on an HPC cluster, the same configuration drives service lifecycle, connectivity, and teardown.

Overview¶

The environment system handles three concerns:

Configuration -- .metalab.toml defines project-level settings, named environment profiles, and service declarations.
Provisioning -- metalab services up starts the right services for the selected environment (subprocess locally, SLURM jobs on a cluster).
Connectivity -- metalab tunnel opens SSH tunnels so remote services appear on localhost.

Supported deployment targets:

Target	Environment type	How services run
Local workstation	`local`	Subprocesses
SLURM / HPC	`slurm`	`sbatch` jobs on compute nodes
(future)	`kubernetes`, cloud	Pods, managed services

Project Configuration (`.metalab.toml`)¶

Project configuration lives in .metalab.toml at the project root. metalab walks up from the current working directory to find it, so you can run commands from any subdirectory.

File format¶

The config uses TOML and has four top-level sections:

Section	Purpose
`[project]`	Project name and default environment
`[services.*]`	Service declarations (Postgres, Atlas) -- app config only
`[environments.*]`	Named deployment profiles (identity, connectivity, resources)

Configuration layers¶

metalab separates configuration into three layers, each with a single responsibility:

Layer	Example section	What it contains	Who consumes it
Identity	`[environments.slurm]`	Gateway, user, file_root	Orchestrator, SSH tunnel
Services resources	`[environments.slurm.services]`	Partition, time, memory for the services node	Environment implementation
Executor resources	`[environments.slurm.executor]`	Partition, time, memory, GPUs for experiment jobs	`resolve_executor()`
App config	`[services.postgres]`	Ports, auth, database names	Orchestrator, service code

The [services.*] section is backend-agnostic and portable across all environments. Scheduler-specific resources live under the environment profile as nested sub-tables, so they only appear for backends that need them.

Full example¶

[project]
name = "my-project"
default_env = "slurm"

# Backend-agnostic service config (ports, auth, databases)
[services.postgres]
auth_method = "scram-sha-256"
database = "metalab"

[services.atlas]
port = 8000

# Local environment -- no resource specs needed
[environments.local]
type = "local"
file_root = "./runs"

# SLURM environment -- identity and connectivity
[environments.slurm]
type = "slurm"
gateway = "hpc.university.edu"
user = "researcher"
file_root = "/shared/experiments"

# SLURM allocation for the services node (postgres + atlas co-located)
[environments.slurm.services]
partition = "cpu2019"
time = "7-00:00:00"
memory = "10G"

# SLURM allocation for experiment array jobs
[environments.slurm.executor]
partition = "gpu"
time = "1:00:00"
memory = "32G"
gpus = 1
modules = ["cuda/12.0"]
conda_env = "ml-env"
max_concurrent = 200

Local overrides (`.metalab.local.toml`)¶

Personal or sensitive values go in .metalab.local.toml, which sits next to .metalab.toml and should be gitignored. It uses the same structure and is deep-merged on top of the base config.

# .metalab.local.toml -- gitignored
[environments.slurm]
user = "jsmith"
ssh_key = "~/.ssh/cluster_key"

[environments.slurm.services]
partition = "my-lab-partition"

[environments.slurm.executor]
setup = ["source ~/my_setup.sh"]
mail_user = "jsmith@university.edu"
mail_type = "end,fail"

[services.postgres]
password = "my-secret-password"

Resolution order¶

When a config is resolved for a specific environment, values are merged in this order (last wins):

Base [services] and [environments] sections
Named environment profile (e.g., [environments.slurm])
Local overrides from .metalab.local.toml
CLI flags (e.g., --env)
Environment variables (e.g., METALAB_ENV)

Environment Profiles¶

An environment profile is a named deployment target defined under [environments] in your config. Each profile specifies a type (the backend) and backend-specific settings.

Listing profiles¶

metalab env list

  local                local
  slurm                slurm *

  * = default (set via [project] default_env)

Inspecting a profile¶

metalab env show slurm

Environment: slurm
  Type: slurm
  File root: /shared/experiments
  Config:
    gateway: hpc.university.edu
    user: researcher
    services: {'partition': 'cpu2019', 'time': '7-00:00:00', 'memory': '10G'}
    executor: {'partition': 'gpu', 'time': '1:00:00', 'memory': '32G', 'gpus': 1, ...}
  Services:
    postgres: {'auth_method': 'scram-sha-256', 'database': 'metalab'}
    atlas: {'port': 8000}

Selecting an environment¶

The active environment is determined by (in priority order):

--env <name> flag on any command
METALAB_ENV environment variable
default_env in [project]

# Explicit flag
metalab services up --env local

# Environment variable
export METALAB_ENV=slurm
metalab services up

# Falls back to default_env in .metalab.toml
metalab services up

Service Provisioning¶

The metalab services commands manage the full service lifecycle. The orchestrator reads your resolved config and provisions only the services you have declared.

`metalab services up`¶

Provisions services for the selected environment.

metalab services up --env slurm

What happens:

Checks for an existing service bundle -- if all services are still alive, reuses it.
Creates the appropriate ServiceEnvironment (local subprocess manager or SLURM job submitter).
Starts PostgreSQL if [services.postgres] is present in your config.
Starts Atlas if [services.atlas] is present and PostgreSQL is configured. Atlas requires a PostgreSQL backend.
Saves a bundle.json with connection details for all running services.

The --tunnel flag opens an SSH tunnel immediately after provisioning:

metalab services up --env slurm --tunnel

Three provisioning scenarios¶

Scenario	Config needed	What starts
Postgres + Atlas	`[services.postgres]` + `[services.atlas]`	PostgreSQL + Atlas (Atlas queries PG, optionally serves files from `file_root`)
Postgres only	`[services.postgres]`, no `[services.atlas]`	PostgreSQL only (experiment data indexed, no web UI)
Reuse existing	Same as above	Nothing new -- existing bundle is reused if healthy

Note

Atlas requires a PostgreSQL backend. If you only have a file_root without [services.postgres], Atlas will not be provisioned.

`metalab services status`¶

Check health of running services:

metalab services status --env slurm

  ✓ postgres: cn001:5432 (running)
  ✓ atlas: cn001:8000 (running)

Use --json for machine-readable output.

`metalab services down`¶

Stop all services and clean up:

metalab services down --env slurm

Services are stopped in reverse dependency order (Atlas first, then PostgreSQL). On SLURM, jobs are cancelled with scancel. The bundle.json file is removed.

Store Discovery¶

When services are running, experiment configs can use store: "discover" to automatically locate the active store without hardcoding URIs.

How it works¶

metalab services up provisions services and saves a bundle.json containing the store locator.
When metalab encounters store: "discover", it walks up from the current directory looking for services/bundle.json.
The store_locator field from the bundle is used as the store URI.

Before and after¶

Without discovery, you must embed connection details:

metalab.run(
    exp,
    store="postgresql://researcher@cn001:5432/metalab?file_root=/shared/experiments",
)

With discovery:

metalab.run(exp, store="discover")

The locator is resolved at runtime from the running service bundle. This also works in experiment YAML configs:

store: discover

SSH Authentication and Tunneling¶

When services run on remote hosts (e.g., SLURM compute nodes), metalab establishes SSH tunnels to make them accessible on your local machine.

Default behavior¶

metalab uses your existing SSH configuration. If you can run ssh user@gateway without a password prompt, metalab tunnel works with zero additional config.

Under the hood, it spawns:

ssh -N -L local_port:127.0.0.1:remote_port [-J user@gateway] [user@]remote_host

This leverages your ~/.ssh/config, SSH agent, and any keys already loaded.

Setting up SSH key authentication¶

If you have not already configured key-based authentication for your cluster:

Generate a key (if you don't have one):

ssh-keygen -t ed25519 -C "your_email@example.com"

Copy it to the remote host:

ssh-copy-id researcher@hpc.university.edu

Verify (should not prompt for a password):

ssh researcher@hpc.university.edu hostname

On macOS, the Keychain-backed SSH agent handles passphrase caching automatically. On Linux, ensure ssh-agent is running and your key is added (ssh-add).

Explicit key overrides¶

If a specific environment requires a different key, set it in .metalab.local.toml:

[environments.slurm]
ssh_key = "~/.ssh/special_cluster_key"

This adds -i ~/.ssh/special_cluster_key to the SSH command.

The `metalab tunnel` command¶

Open a tunnel to running services:

metalab tunnel --env slurm

Tunnel established: http://127.0.0.1:8000
Press Ctrl+C to close.

The tunnel reads bundle.json to determine the remote host and port, then forwards them to localhost. The process runs in the foreground until you press Ctrl+C.

Note

Always prefer key-based authentication over passwords. Password-based SSH is not supported by metalab tunnel.

Workflow Guides¶

Local Development¶

The simplest setup -- services run as local subprocesses.

metalab services up --env local

Services started (local):
  atlas: 127.0.0.1:8000

Atlas is immediately available at http://localhost:8000, querying data from PostgreSQL. When file_root is configured, Atlas can also serve artifact downloads and log content. No tunnel is needed.

SLURM / HPC with PostgreSQL¶

Full-featured setup with a Postgres query index.

# Provision PostgreSQL and Atlas on a compute node
metalab services up --env slurm

# Open an SSH tunnel to access Atlas locally
metalab tunnel

# Atlas is now available at http://localhost:8000
# Run experiments -- they write to the shared filesystem and PG index

# When finished, clean up
metalab services down --env slurm

Or provision and tunnel in one step:

metalab services up --env slurm --tunnel

Local Development -- Atlas reading from `file_root`¶

When running locally with both Postgres and Atlas, Atlas serves query results from Postgres and can also serve artifact/log content when file_root is configured in the environment profile.

metalab services up --env local
# Atlas at http://localhost:8000 with full artifact/log access

Executor Configuration¶

Executors (which run experiment tasks) can be created from configuration dicts using the executor_from_config() factory. Executor defaults live in the [environments.*.executor] sub-table of your environment profile:

TOML ([environments.slurm.executor]) defines infrastructure defaults -- partitions, walltime, memory.
YAML (per-experiment) specifies only what varies per experiment -- worker counts, GPUs.

`executor_from_config()`¶

from metalab.executor.config import executor_from_config

# Create from type name and config dict
executor = executor_from_config("slurm", {
    "partition": "gpu",
    "time": "2:00:00",
    "memory": "16G",
    "gpus": 1,
})

# Local executor with multiple workers
executor = executor_from_config("local", {"workers": 4})

# Single-threaded (returns None, metalab runs in-process)
executor = executor_from_config("local", {"workers": 1})

Supported executor types¶

Type	Config class	What it creates
`local`	`LocalExecutorConfig`	`ProcessExecutor` (or `None` for serial)
`slurm`	`SlurmExecutorConfig`	`SlurmExecutor` with job array support

SLURM executor options¶

Field	Default	Description
`partition`	`"default"`	SLURM partition
`time`	`"1:00:00"`	Walltime limit
`cpus`	`1`	CPUs per task
`memory`	`"4G"`	Memory per task
`gpus`	`0`	GPUs per task
`max_concurrent`	`None`	Max simultaneous array tasks
`modules`	`[]`	`module load` commands
`conda_env`	`None`	Conda environment to activate
`setup`	`[]`	Extra shell commands before execution

Simplified experiment config¶

With executor_from_config, experiment YAML files become concise:

# experiment.yaml
executor:
  type: slurm
  gpus: 1
  time: "4:00:00"

store: discover

The executor inherits partition, memory, and other defaults from [environments.*.executor] in .metalab.toml, while the experiment only overrides what it needs.

How Configuration Flows into `metalab.run()`¶

When you call metalab.run(), multiple configuration layers merge transparently:

.metalab.toml               shared infrastructure defaults
  └─ .metalab.local.toml    machine-specific overrides (gitignored)
      └─ experiment config   per-experiment overrides
          └─ metalab.run()   final resolution at runtime

Store resolution¶

The store argument accepts several forms:

Value	What happens
`None` (default)	FileStore at `./experiments/{experiment_id}/`
`"runs/"`	FileStore at the given path
`"postgresql://..."`	PostgresStore (pass `file_root` for artifacts)
`"discover"`	Auto-detect from the nearest running service bundle
`StoreConfig` object	Used directly (auto-scoped to experiment)

Executor resolution¶

metalab.resolve_executor(platform, overrides) merges TOML defaults with per-experiment overrides:

Auto-discovers .metalab.toml (walks up from cwd)
Resolves the environment profile matching platform
Reads the [environments.*.executor] sub-table as defaults
Merges per-experiment overrides on top (overrides win)
Creates the executor via the plugin registry

# [environments.slurm.executor] provides partition, time, memory, modules, conda_env...
# Your experiment only overrides what differs:
executor = metalab.resolve_executor("slurm", {"gpus": 1, "time": "4:00:00"})

If no .metalab.toml exists, only the overrides dict is used — your code still works without a project config file.

Inspecting the resolved config¶

Use the CLI to verify what metalab will use before running experiments:

metalab env list           # List available environment profiles
metalab env show slurm     # Show fully merged config for a profile

When metalab.run() executes, it logs the resolved store, executor, and run counts at INFO level so you can confirm what was picked up.

Teardown and Cleanup¶

`metalab services down`¶

Stops all services tracked in the service bundle:

SLURM jobs are cancelled via scancel.
Local processes receive SIGTERM, then SIGKILL if they don't exit within 5 seconds.
The bundle.json file is removed.

Services are stopped in reverse dependency order (Atlas before PostgreSQL) so that dependents are torn down before the services they rely on.

Orphan detection¶

If a previous session was not cleanly shut down, metalab services status will detect stale bundles. If the services referenced in the bundle are unreachable, metalab services up will discard the stale bundle and provision fresh services.

# Check for orphaned services
metalab services status --env slurm

# If services show as unreachable, re-provision
metalab services up --env slurm

Bundle location¶

The service bundle is stored at:

{file_root}/services/bundle.json -- if file_root is set in the environment config.
~/.metalab/services/{env_name}/bundle.json -- otherwise.

The bundle file has 0o600 permissions since it may contain credentials.

Services and Environments¶

Overview¶

Project Configuration (.metalab.toml)¶

File format¶

Configuration layers¶

Full example¶

Local overrides (.metalab.local.toml)¶

Resolution order¶

Environment Profiles¶

Listing profiles¶

Inspecting a profile¶

Selecting an environment¶

Service Provisioning¶

metalab services up¶

Three provisioning scenarios¶

metalab services status¶

metalab services down¶

Store Discovery¶

How it works¶

Before and after¶

SSH Authentication and Tunneling¶

Default behavior¶

Setting up SSH key authentication¶

Explicit key overrides¶

The metalab tunnel command¶

Workflow Guides¶

Local Development¶

SLURM / HPC with PostgreSQL¶

Local Development -- Atlas reading from file_root¶

Executor Configuration¶

executor_from_config()¶

Supported executor types¶

SLURM executor options¶

Simplified experiment config¶

How Configuration Flows into metalab.run()¶

Store resolution¶

Executor resolution¶

Inspecting the resolved config¶

Teardown and Cleanup¶

metalab services down¶

Orphan detection¶

Bundle location¶

Project Configuration (`.metalab.toml`)¶

Local overrides (`.metalab.local.toml`)¶

`metalab services up`¶

`metalab services status`¶

`metalab services down`¶

The `metalab tunnel` command¶

Local Development -- Atlas reading from `file_root`¶

`executor_from_config()`¶

How Configuration Flows into `metalab.run()`¶

`metalab services down`¶