Architecture¶
An overview of how metalab is structured internally: the core abstractions, plugin system, orchestration flow, and design patterns that tie everything together.
Central Contract¶
Everything in metalab revolves around a single idea:
Domain logic lives entirely in user-defined Operations. Everything else—execution, storage, services—is pluggable.
Layers at a Glance¶
| Layer | Key Types | Purpose |
|---|---|---|
| Definition | Experiment, ContextSpec, ParamSource, SeedPlan |
Declarative experiment specification |
| Execution | Executor, RunPayload, RunHandle |
Where and how work runs (threads, processes, SLURM) |
| Operation | OperationWrapper, Capture, Runtime |
User computation + injected plumbing |
| Storage | Store, StoreConfig, ArtifactDescriptor |
Persisting run records, artifacts, and logs |
| Results | Results, Run, DerivedMetricFn |
Querying and post-processing completed runs |
| Services | ServicePlugin, ServiceEnvironment, ServiceOrchestrator |
Managing infrastructure (Postgres, Atlas) |
Orchestration Flow¶
The entry point is metalab.run(experiment, store=..., executor=...). Here is what happens under the hood:
1. Resolve store¶
The store argument (a path string, URI, or StoreConfig) is parsed via parse_to_config(), scoped to the experiment, and connected:
"./experiments" → FileStoreConfig(root="./experiments")
→ .scoped("my_exp:1.0")
→ .connect() → FileStore
2. Resolve executor¶
Defaults to ThreadExecutor(max_workers=1). When a SLURM environment is configured, the executor is resolved from project config instead.
3. Generate payloads¶
The runner iterates the cartesian product of params × seeds and builds a RunPayload for each combination:
- Resolves context (computes lazy file hashes for
FilePath/DirPath) - Computes a deterministic
run_idfromsha256(experiment_id + context_fp + params_fp + seed_fp + code_hash) - Skips runs that already exist as
SUCCESSin the store (resume / dedupe) - Writes an experiment manifest for Atlas
4. Submit¶
5. Execute (inside worker)¶
Each payload is executed in executor/core.py:
- Create
Runtime(logger, scratch directory, cancel token) - Create
Capture(metric / artifact / log emission) - Write a
RUNNINGrecord to the store (crash resilience) - Call
operation.run(context, params, seeds, runtime, capture) - On success — build final
RunRecord, compute derived metrics, persist - On failure — build failed
RunRecord, persist error
6. Collect results¶
Blocks until all runs complete, then returns a queryable Results object.
Plugin System¶
metalab uses three independent plugin registries, all discovered via Python entry points in pyproject.toml and loaded lazily on first access.
Executor plugins (metalab.executors)¶
| Entry point | Config class |
|---|---|
local |
LocalExecutorConfig |
slurm |
SlurmExecutorConfig |
Each ExecutorConfig subclass provides:
create() → Executor— instantiate the executorfrom_dict()/to_dict()— serializationhandle_class()— optional, for reconnection support
Store plugins (metalab.stores)¶
| Entry point | Config class |
|---|---|
file |
FileStoreConfig |
postgresql |
PostgresStoreConfig |
Each StoreConfig subclass provides:
connect() → Store— create a connected store instancescoped(experiment_id)— return a new config scoped to an experimentfrom_locator(info)— parse from a URI
Service plugins (metalab.service_plugins)¶
| Entry point | Plugin class |
|---|---|
postgres |
PostgresPlugin |
atlas |
AtlasPlugin |
Each ServicePlugin subclass dispatches via plan(spec, env_type) to environment-specific methods (plan_slurm, plan_local), returning platform-specific fragments the environment knows how to execute.
Environment registry¶
A fourth registry maps environment types to ServiceEnvironment implementations via import-time registration:
| Type | Implementation |
|---|---|
local |
LocalEnvironment |
slurm |
SlurmEnvironment |
Adding a plugin¶
All entry-point registries follow the same pattern: Config (pure data, serializable) → Factory method → Instance.
# pyproject.toml
[project.entry-points."metalab.stores"]
myscheme = "my_package.store:MyStoreConfig"
Once the entry point is installed, create_store("myscheme://...") discovers and loads it automatically.
Protocol-Based Design¶
All major abstractions are defined as [typing.Protocol][] classes—structural subtyping with no inheritance required. This keeps implementations fully decoupled from the core:
Executor— submit payloads, get handlesStore— persist and retrieve run records and artifactsRunHandle— track running work, cancel, collect resultsParamSource— iterate parameter casesSerializer— encode and decode artifactsServiceEnvironment— start, stop, inspect servicesConnector— establish tunnels to remote services
Capability protocols¶
Not every store supports every feature. Rather than hasattr() checks, metalab uses @runtime_checkable capability protocols defined in metalab.store.capabilities:
| Protocol | Meaning |
|---|---|
SupportsWorkingDirectory |
Has a local filesystem root |
SupportsArtifactOpen |
Can open artifacts for reading |
SupportsLogPath |
Can provide filesystem paths for streaming logs |
SupportsStructuredResults |
Supports inline structured data queries |
SupportsLogListing |
Can list and retrieve log files |
SupportsExperimentManifests |
Stores versioned experiment manifests |
Code checks capabilities with isinstance():
Services Layer¶
Services like Postgres and Atlas are infrastructure that experiments depend on. They are managed through a layered stack:
CLI (metalab services up / down / status)
└── ServiceOrchestrator config-driven, service-agnostic
├── ServiceEnvironment LocalEnvironment | SlurmEnvironment
├── ServicePlugin.plan() returns platform-specific fragments
└── ServiceBundle persisted state (~/.metalab/services/)
- The orchestrator reads project config to determine which services are needed
- Each plugin produces platform-specific fragments (bash scripts, subprocess commands)
- The environment composes fragments into jobs and manages their lifecycle
- Service handles are persisted in a bundle on disk
The bundle's store_locator auto-wires the Postgres connection as the default store via a "discover" URI scheme, so experiments connect without hardcoded connection strings.
Configuration¶
Project configuration lives in .metalab.toml:
[project]
name = "myproject"
default_env = "local"
[environments.local]
type = "local"
[environments.slurm]
type = "slurm"
gateway = "cluster.example.com"
[environments.slurm.executor]
partition = "gpu"
time = "2:00:00"
[services.postgres]
database = "metalab"
[services.atlas]
port = 8000
ProjectConfig.resolve(env_name) → ResolvedConfig merges base sections, named profiles, and .metalab.local.toml overrides into a single flat config that drives both the runner and the service orchestrator.
Design Patterns¶
Fingerprint-based identity
: run_id = sha256(experiment_id + context + params + seed + code) enables deterministic resume and deduplication. Derived metrics are excluded from fingerprints.
Config → Instance separation
: Configs are pure frozen dataclasses, serializable across process boundaries. Instances are stateful connections created via .connect() or .create().
Lazy imports
: Optional dependencies (submitit, psycopg, rich) are deferred via module-level __getattr__, so the core stays lightweight.
Signature inspection
: The @operation decorator wraps user functions and only injects the arguments they declare (context, params, seeds, runtime, capture).
Index-addressed SLURM arrays
: Instead of pickling per task, a single manifest is written. Each SLURM array task reconstructs its payload from SLURM_ARRAY_TASK_ID using O(1) index access into the parameter source.