PostgreSQL Artifact Registry
Overview
Model Management persists all model artifacts in a PostgreSQL-backed artifact registry that it owns. The schema is created and migrated automatically by Flyway on startup; the application requires a datasource to run.
The registry stores:
- the logical identity of each artifact (format-independent)
- immutable artifact versions, each with an authored primary format
- per-version, per-format representations — JSON Schema, raw XSD, core-json
- the XSD namespace lookup
- the reference graph between artifacts
API DTOs are projections built at the boundary from the stored artifact, not
ORM entities. This avoids duplicated truth: MappingDto.source is derived from
config.source, PipelineDto.mappingRefs from the pipeline nodes, etc. — the
artifact document remains canonical and the projections cannot drift from it.
Design principles
- CORE URNs are the stable public identity of every artifact — the format is never part of the identity. The same DataStructure may be authored as JSON Schema in one version and as XSD in another.
- Artifact payloads are canonical JSON documents, stored as
jsonb. - XSD payloads are stored as raw text; a JSON Schema representation is generated from them on read (and cached).
- Each version records its authored primary format; the format-specific content
lives in
artifact_representation(one row per format). - Versioning is explicit, backend-owned and modelled in the database.
- References are stored relationally for fast dependency/impact queries, and the complete graph is kept — cycles included.
- Persistence access uses a thin
JdbcClientrepository layer; large dynamic JSON trees are stored asjsonb/text rather than decomposed into entity fields (no JPA/Hibernate).
Tables
The schema lives in db/migration/V1__artifact_registry.sql (Flyway).
artifact — logical identity
create table artifact (
id uuid primary key,
logical_urn text not null unique,
artifact_type text not null, -- datastructure | dataset | mapping | pipeline | datasource | datasink
name text not null,
title text,
description text,
current_version text, -- the default read version
created_at timestamptz not null,
updated_at timestamptz not null
);
artifact_type is the functional CORE category. Identity is format-agnostic, so
the format is not stored here. Both JSON Schema and XSD DataStructures have
artifact_type = 'datastructure'; the format is recorded per version
(artifact_version.primary_format) and per representation (below). The
logical_urn always uses the datastructure segment; a :xsd: artifact-type
URN is rejected with HTTP 400.
Indexes: artifact_type, name, updated_at.
artifact_version — immutable version metadata
create table artifact_version (
id uuid primary key,
artifact_id uuid not null references artifact(id) on delete cascade,
version text not null,
primary_format text not null, -- jsonschema | core-json | xsd
title text, -- versioned metadata (set on rename)
description text,
created_at timestamptz not null,
created_by text,
unique (artifact_id, version)
);
- A version records its authored (primary) format and points at its
representations; the content lives in
artifact_representation. primary_formatdrivesDataStructureDto.formatand the default ofGET /schema(without aformatparameter).title/descriptionare versioned metadata set by a rename: a rename creates a new patch version with the new title, format-independently (so XSD versions are renamable too); older versions keep their own title.DataStructureDto.titleis the version's title when set, else the URN name segment.artifact.current_versionpoints at the latest version.
artifact_representation — per-version, per-format content
create table artifact_representation (
id uuid primary key,
version_id uuid not null references artifact_version(id) on delete cascade,
format text not null, -- jsonschema | core-json | xsd
content_type text,
content_jsonb jsonb,
content_text text,
content_hash text,
generation text not null default 'stored', -- stored | generated
created_at timestamptz not null,
unique (version_id, format)
);
- One row per format of a version. JSON content goes into
content_jsonb; raw XSD goes intocontent_text. - Each version has exactly one authored representation (
generation = 'stored'), in itsprimary_format. A JSON Schema derived from an XSD may later be persisted withgeneration = 'generated'; a generated representation never counts as authored content. content_hashmakes repeated imports of identical authored content idempotent (no spurious new version).unique (version_id, format)permits at most one representation per format per version; a different format in a new version is always allowed (v1 XSD, v2 JSON-only is valid).- A GIN index on
content_jsonbsupports payload-level search.
artifact_reference — dependency edges
create table artifact_reference (
id uuid primary key,
from_version_id uuid not null references artifact_version(id) on delete cascade,
target_urn text not null,
target_artifact_id uuid references artifact(id) on delete set null,
target_version_id uuid references artifact_version(id) on delete set null,
reference_type text not null, -- schema-ref | dataset-ref | pipeline-node | xsd-import | mapping-source | mapping-target | datasource-ref | datasink-ref
reference_name text,
sort_order int,
created_at timestamptz not null
);
- The complete reference graph is stored, including cycles.
target_urnis stored verbatim as authored — a pinned…:1.0.0or the…:latesttoken — never normalised to the logical form.target_artifact_idresolves the reference to the target artifact's logical identity;target_version_idresolves a pinned reference to its concrete version (null forlatest, logical, or a not-yet-imported pin). Both are nullable /on delete set null, and are back-filled when a previously-missing target artifact or target version is later created.
xsd_namespace — namespace lookup
create table xsd_namespace (
namespace text primary key,
artifact_id uuid not null references artifact(id) on delete cascade,
version_id uuid references artifact_version(id) on delete set null,
updated_at timestamptz not null
);
A durable namespace→artifact index for XSD-backed DataStructures, so
xs:import resolution does not depend on a startup scan completing.
DTO mapping
DTOs are built from artifact plus the selected artifact_version.
| DTO | Stored | Derived on read |
|---|---|---|
DataStructureDto | the version's representation (content_jsonb/content_text, content_type), artifact_version.primary_format and title | id (versioned URN), logicalId, version, title (version title metadata when set, else the URN name segment), format (= primary format), availableFormats (stored + derivable — an XSD version also lists jsonschema) |
DataSetDto | manifest in content_jsonb (id/title/version/*Refs) | dataset-ref rows are a derived index of the manifest's *Refs |
MappingDto | config = full document | source = config.source, target = config.target |
PipelineDto | config = full document | dataSourceRefs/dataSinkRefs/mappingRefs from config.nodes[] by kind |
DataSourceDto / DataSinkDto | config = full document | connectionType = config.connectionType |
Envelope<T> | — | wraps a DTO with dependencies/dependents (graph), versions, and a compatibility marker (always "NONE") |
Reference extraction
Edges are derived from the artifact content when it is written and persisted as
artifact_reference rows:
reference_type | Source |
|---|---|
schema-ref | $ref values (CORE URNs) in a JSON Schema |
xsd-import | xs:import/@namespace: a CORE-URN namespace links that DataStructure directly (any format); a classic XML namespace is resolved via xsd_namespace |
mapping-source / mapping-target | config.source / config.target of a Mapping |
pipeline-node | source/sink/mapping refs of config.nodes[] |
dataset-ref, datasource-ref, datasink-ref | the DataSet manifest's *Refs arrays |
The DataSet manifest stays the canonical document; the relational rows are a derived index, written in the same transaction.
Versioning
Versioning is owned by Model Management — clients never choose the next version directly, they declare how it should be bumped.
-
The logical URN identifies the artifact independent of version; the versioned URN identifies one immutable version.
-
The first version of a new artifact is
1.0.0. -
An update creates a new
artifact_version; the number is computed fromcurrent_versionand the requested bump:1.2.3 + patch -> 1.2.41.2.3 + minor -> 1.3.01.2.3 + major -> 2.0.0 -
Older versions are never overwritten;
current_versionis the default read. -
The bump is request metadata, not part of the payload — a query parameter on the update endpoints:
PUT /api/v1/datastructures?id=urn:core:…&versionBump=minor -
A request targeting a versioned URN resolves the logical identity and bumps from
current_version; the URN's version segment is ignored as the new number. -
A rename (
PATCH) is a patch-version bump that records the newtitle/descriptionas versioned metadata, format-independently — so XSD DataStructures are renamable too (the XSD content is unchanged); older versions keep their title. -
A read id (and a reference) may use the
:latesttoken, resolving to the current (highest-SemVer) version.:latestis never a writable identity — a client-supplied:latest(like a legacy:xsd:) identity is rejected with 400.
There is no schema-compatibility calculation: the Envelope.compatibility field
is always "NONE" and there are no compatibility endpoints.
Write flow
Every artifact write runs in one transaction:
- Normalise the incoming ID to a logical CORE URN.
- Validate the payload with the existing validation services.
- Reject a
:xsd:artifact-type URN with HTTP 400 (before any write). - Upsert
artifact(insert on first sight; first version1.0.0). - If the current version's authored representation in the same format has an identical content hash → no-op (idempotent). A different format is always a new version.
- Otherwise compute the next SemVer version and insert a new
artifact_versionwithprimary_format= the written format. - Insert the version's authored
artifact_representation(generation = 'stored'). - Replace the version's
artifact_referencerows from the extracted references. - Advance
artifact.current_version. - For XSDs, upsert the
xsd_namespacerow (scoped to the version).
A storage failure rolls the whole write back and surfaces as HTTP 502.
Read flow
- Versioned URN → load that
artifact_version. - Logical URN → load
artifact.current_version's version. - Pick the representation:
GET /schemawithout aformatreturns the authored (primary) representation;?format=json-schemareturns the JSON Schema (an XSD version is converted on read);?format=xsdreturns the raw XSD (404 if the version has no XSD representation).formatsOf(urn)reports the stored formats plus the derivablejsonschemafor XSD versions. - Convert to the API DTO (or raw JSON / raw XSD) at the boundary.
Conditional requests (ETags)
The same content_hash that makes writes idempotent doubles as the HTTP ETag.
ArtifactRegistryClient.contentHash(urn, format) returns the hash of the stored
representation a read would serve (the requested format, or the authored format
when omitted); a derived representation has no stored hash and so no ETag.
The API layer turns this into the conditional-request contract — a strong ETag
"sha256:<hex>" on single-artifact reads, If-None-Match → 304, and an optional
If-Match → 412 on writes — see
Conditional requests (ETags).
Because the hash is over the serialised authored content it is byte-stable but not
canonical: two semantically equal JSON documents with different key order can hash
differently (acceptable for caching and lost-update detection).
Stage workflow (M3)
artifact_version.state carries a per-version lifecycle state (default released).
ArtifactRegistryClient.transitionVersion moves a version along the fixed
draft → review → approved → released → deprecated → retired path in a single transaction,
appending an artifact_version_event audit row and — on release — advancing the artifact's
released_version pointer (read back by resolveReleased). Transitions are idempotent (a no-op
when already in the target state) and reject disallowed moves.
The workflow is opt-in (modelforge.workflow.enabled, default false): off, every write is stored
released, so current_version is always the released version (status quo); on, writes start as
draft. See Stage workflow (M3).
Dependency graph & cache
The durable source of dependency edges is artifact_reference. Two in-memory
indexes serve runtime queries and are rebuilt from PostgreSQL on startup
(on ApplicationReadyEvent, after Flyway has migrated):
DependencyGraphService— a version-precise graph whose nodes are versioned URNs and whose edges are kept verbatim (pinned or:latest); rebuilt from each artifact's current version on startup. It serves dependencies, dependents, impact and transitive queries — reporting concrete versioned IDs (a logical or:latestquery resolves to the current version) — and tolerates cycles.ModelStore— the DataSet read cache, reloaded from the registry.
Both survive a restart because the data lives in PostgreSQL.
Service boundary
The single seam between the domain services and storage is the
ArtifactRegistryClient interface. Exactly one implementation is wired:
PostgresArtifactRegistryClient— active when a datasource is configured.NoopArtifactRegistryClient— degraded Null-Object when no datasource is configured (model-forge.registry.backendother thanpostgres, or no datasource): reads return empty, writes return HTTP 503.
Domain behaviour (validation, URN creation, reference extraction, graph rebuild, dereferenced/bundled views, DataSet resolution) stays in the services.
Implementation notes
- Access: a thin
JdbcClientrepository layer;jsonbis written withcast(:json as jsonb)and read back as text. No JPA/Hibernate. - Migrations: plain-SQL Flyway migrations under
db/migration. - Search: scalar metadata (URN, name, title, description, type) first; a GIN
index on
content_jsonbis available for payload-level queries. - Tests: Testcontainers-PostgreSQL integration tests (
*IT, run in thepostgres-itCI job) exercise store/fetch/list/version/delete for every type, backend SemVer, content-hash idempotency, XSD import + namespace lookup, the full cyclic graph, transactional rollback, DataSet resolution, the dependency queries, pipeline publication, format-agnostic identity (:xsd:rejection,formatsOf, a per-version format change), cross-format links via a CORE-URNxs:importnamespace, and durability across a Model Management and PostgreSQL restart.