Architecture¶

An overview of how metalab is structured internally: the core abstractions, plugin system, orchestration flow, and design patterns that tie everything together.

Central Contract¶

Everything in metalab revolves around a single idea:

(ContextSpec, Params, SeedBundle) → RunRecord + Artifacts

Domain logic lives entirely in user-defined Operations. Everything else—execution, storage, services—is pluggable.

Layers at a Glance¶

Layer	Key Types	Purpose
Definition	`Experiment`, `ContextSpec`, `ParamSource`, `SeedPlan`	Declarative experiment specification
Execution	`Executor`, `RunPayload`, `RunHandle`	Where and how work runs (threads, processes, SLURM)
Operation	`OperationWrapper`, `Capture`, `Runtime`	User computation + injected plumbing
Storage	`Store`, `StoreConfig`, `ArtifactDescriptor`	Persisting run records, artifacts, and logs
Results	`Results`, `Run`, `DerivedMetricFn`	Querying and post-processing completed runs
Services	`ServicePlugin`, `ServiceEnvironment`, `ServiceOrchestrator`	Managing infrastructure (Postgres, Atlas)

Orchestration Flow¶

The entry point is metalab.run(experiment, store=..., executor=...). Here is what happens under the hood:

1. Resolve store¶

The store argument (a path string, URI, or StoreConfig) is parsed via parse_to_config(), scoped to the experiment, and connected:

"./experiments" → FileStoreConfig(root="./experiments")
                    → .scoped("my_exp:1.0")
                    → .connect() → FileStore

2. Resolve executor¶

Defaults to ThreadExecutor(max_workers=1). When a SLURM environment is configured, the executor is resolved from project config instead.

3. Generate payloads¶

The runner iterates the cartesian product of params × seeds and builds a RunPayload for each combination:

Resolves context (computes lazy file hashes for FilePath/DirPath)
Computes a deterministic run_id from sha256(experiment_id + context_fp + params_fp + seed_fp + code_hash)
Skips runs that already exist as SUCCESS in the store (resume / dedupe)
Writes an experiment manifest for Atlas

4. Submit¶

executor.submit(payloads, store, operation) → RunHandle

5. Execute (inside worker)¶

Each payload is executed in executor/core.py:

Create Runtime (logger, scratch directory, cancel token)
Create Capture (metric / artifact / log emission)
Write a RUNNING record to the store (crash resilience)
Call operation.run(context, params, seeds, runtime, capture)
On success — build final RunRecord, compute derived metrics, persist
On failure — build failed RunRecord, persist error

6. Collect results¶

handle.result() → Results

Blocks until all runs complete, then returns a queryable Results object.

Plugin System¶

metalab uses three independent plugin registries, all discovered via Python entry points in pyproject.toml and loaded lazily on first access.

Executor plugins (`metalab.executors`)¶

Entry point	Config class
`local`	`LocalExecutorConfig`
`slurm`	`SlurmExecutorConfig`

Each ExecutorConfig subclass provides:

create() → Executor — instantiate the executor
from_dict() / to_dict() — serialization
handle_class() — optional, for reconnection support

Store plugins (`metalab.stores`)¶

Entry point	Config class
`file`	`FileStoreConfig`
`postgresql`	`PostgresStoreConfig`

Each StoreConfig subclass provides:

connect() → Store — create a connected store instance
scoped(experiment_id) — return a new config scoped to an experiment
from_locator(info) — parse from a URI

Service plugins (`metalab.service_plugins`)¶

Entry point	Plugin class
`postgres`	`PostgresPlugin`
`atlas`	`AtlasPlugin`

Each ServicePlugin subclass dispatches via plan(spec, env_type) to environment-specific methods (plan_slurm, plan_local), returning platform-specific fragments the environment knows how to execute.

Environment registry¶

A fourth registry maps environment types to ServiceEnvironment implementations via import-time registration:

Type	Implementation
`local`	`LocalEnvironment`
`slurm`	`SlurmEnvironment`

Adding a plugin¶

All entry-point registries follow the same pattern: Config (pure data, serializable) → Factory method → Instance.

# pyproject.toml
[project.entry-points."metalab.stores"]
myscheme = "my_package.store:MyStoreConfig"

Once the entry point is installed, create_store("myscheme://...") discovers and loads it automatically.

Protocol-Based Design¶

All major abstractions are defined as [typing.Protocol][] classes—structural subtyping with no inheritance required. This keeps implementations fully decoupled from the core:

Executor — submit payloads, get handles
Store — persist and retrieve run records and artifacts
RunHandle — track running work, cancel, collect results
ParamSource — iterate parameter cases
Serializer — encode and decode artifacts
ServiceEnvironment — start, stop, inspect services
Connector — establish tunnels to remote services

Capability protocols¶

Not every store supports every feature. Rather than hasattr() checks, metalab uses @runtime_checkable capability protocols defined in metalab.store.capabilities:

Protocol	Meaning
`SupportsWorkingDirectory`	Has a local filesystem root
`SupportsArtifactOpen`	Can open artifacts for reading
`SupportsLogPath`	Can provide filesystem paths for streaming logs
`SupportsStructuredResults`	Supports inline structured data queries
`SupportsLogListing`	Can list and retrieve log files
`SupportsExperimentManifests`	Stores versioned experiment manifests

Code checks capabilities with isinstance():

if isinstance(store, SupportsWorkingDirectory):
    path = store.get_working_directory()

Services Layer¶

Services like Postgres and Atlas are infrastructure that experiments depend on. They are managed through a layered stack:

CLI (metalab services up / down / status)
  └── ServiceOrchestrator          config-driven, service-agnostic
        ├── ServiceEnvironment     LocalEnvironment | SlurmEnvironment
        ├── ServicePlugin.plan()   returns platform-specific fragments
        └── ServiceBundle          persisted state (~/.metalab/services/)

The orchestrator reads project config to determine which services are needed
Each plugin produces platform-specific fragments (bash scripts, subprocess commands)
The environment composes fragments into jobs and manages their lifecycle
Service handles are persisted in a bundle on disk

The bundle's store_locator auto-wires the Postgres connection as the default store via a "discover" URI scheme, so experiments connect without hardcoded connection strings.

Configuration¶

Project configuration lives in .metalab.toml:

[project]
name = "myproject"
default_env = "local"

[environments.local]
type = "local"

[environments.slurm]
type = "slurm"
gateway = "cluster.example.com"
[environments.slurm.executor]
partition = "gpu"
time = "2:00:00"

[services.postgres]
database = "metalab"

[services.atlas]
port = 8000

ProjectConfig.resolve(env_name) → ResolvedConfig merges base sections, named profiles, and .metalab.local.toml overrides into a single flat config that drives both the runner and the service orchestrator.

Design Patterns¶

Fingerprint-based identity : run_id = sha256(experiment_id + context + params + seed + code) enables deterministic resume and deduplication. Derived metrics are excluded from fingerprints.

Config → Instance separation : Configs are pure frozen dataclasses, serializable across process boundaries. Instances are stateful connections created via .connect() or .create().

Lazy imports : Optional dependencies (submitit, psycopg, rich) are deferred via module-level __getattr__, so the core stays lightweight.

Signature inspection : The @operation decorator wraps user functions and only injects the arguments they declare (context, params, seeds, runtime, capture).

Index-addressed SLURM arrays : Instead of pickling per task, a single manifest is written. Each SLURM array task reconstructs its payload from SLURM_ARRAY_TASK_ID using O(1) index access into the parameter source.