Skip to main content
Version: V2-Next

PostgreSQL Artifact Registry

Overview

Model Management persists all model artifacts in a PostgreSQL-backed artifact registry that it owns. The schema is created and migrated automatically by Flyway on startup; the application requires a datasource to run.

The registry stores:

  • the logical identity of each artifact (format-independent)
  • immutable artifact versions, each with an authored primary format
  • per-version, per-format representations — JSON Schema, raw XSD, core-json
  • the XSD namespace lookup
  • the reference graph between artifacts

API DTOs are projections built at the boundary from the stored artifact, not ORM entities. This avoids duplicated truth: MappingDto.source is derived from config.source, PipelineDto.mappingRefs from the pipeline nodes, etc. — the artifact document remains canonical and the projections cannot drift from it.

Design principles

  • CORE URNs are the stable public identity of every artifact — the format is never part of the identity. The same DataStructure may be authored as JSON Schema in one version and as XSD in another.
  • Artifact payloads are canonical JSON documents, stored as jsonb.
  • XSD payloads are stored as raw text; a JSON Schema representation is generated from them on read (and cached).
  • Each version records its authored primary format; the format-specific content lives in artifact_representation (one row per format).
  • Versioning is explicit, backend-owned and modelled in the database.
  • References are stored relationally for fast dependency/impact queries, and the complete graph is kept — cycles included.
  • Persistence access uses a thin JdbcClient repository layer; large dynamic JSON trees are stored as jsonb/text rather than decomposed into entity fields (no JPA/Hibernate).

Tables

The schema lives in db/migration/V1__artifact_registry.sql (Flyway).

artifact — logical identity

create table artifact (
id uuid primary key,
logical_urn text not null unique,
artifact_type text not null, -- datastructure | dataset | mapping | pipeline | datasource | datasink
name text not null,
title text,
description text,
current_version text, -- the default read version
created_at timestamptz not null,
updated_at timestamptz not null
);

artifact_type is the functional CORE category. Identity is format-agnostic, so the format is not stored here. Both JSON Schema and XSD DataStructures have artifact_type = 'datastructure'; the format is recorded per version (artifact_version.primary_format) and per representation (below). The logical_urn always uses the datastructure segment; a :xsd: artifact-type URN is rejected with HTTP 400.

Indexes: artifact_type, name, updated_at.

artifact_version — immutable version metadata

create table artifact_version (
id uuid primary key,
artifact_id uuid not null references artifact(id) on delete cascade,
version text not null,
primary_format text not null, -- jsonschema | core-json | xsd
title text, -- versioned metadata (set on rename)
description text,
created_at timestamptz not null,
created_by text,
unique (artifact_id, version)
);
  • A version records its authored (primary) format and points at its representations; the content lives in artifact_representation.
  • primary_format drives DataStructureDto.format and the default of GET /schema (without a format parameter).
  • title/description are versioned metadata set by a rename: a rename creates a new patch version with the new title, format-independently (so XSD versions are renamable too); older versions keep their own title. DataStructureDto.title is the version's title when set, else the URN name segment.
  • artifact.current_version points at the latest version.

artifact_representation — per-version, per-format content

create table artifact_representation (
id uuid primary key,
version_id uuid not null references artifact_version(id) on delete cascade,
format text not null, -- jsonschema | core-json | xsd
content_type text,
content_jsonb jsonb,
content_text text,
content_hash text,
generation text not null default 'stored', -- stored | generated
created_at timestamptz not null,
unique (version_id, format)
);
  • One row per format of a version. JSON content goes into content_jsonb; raw XSD goes into content_text.
  • Each version has exactly one authored representation (generation = 'stored'), in its primary_format. A JSON Schema derived from an XSD may later be persisted with generation = 'generated'; a generated representation never counts as authored content.
  • content_hash makes repeated imports of identical authored content idempotent (no spurious new version).
  • unique (version_id, format) permits at most one representation per format per version; a different format in a new version is always allowed (v1 XSD, v2 JSON-only is valid).
  • A GIN index on content_jsonb supports payload-level search.

artifact_reference — dependency edges

create table artifact_reference (
id uuid primary key,
from_version_id uuid not null references artifact_version(id) on delete cascade,
target_urn text not null,
target_artifact_id uuid references artifact(id) on delete set null,
target_version_id uuid references artifact_version(id) on delete set null,
reference_type text not null, -- schema-ref | dataset-ref | pipeline-node | xsd-import | mapping-source | mapping-target | datasource-ref | datasink-ref
reference_name text,
sort_order int,
created_at timestamptz not null
);
  • The complete reference graph is stored, including cycles.
  • target_urn is stored verbatim as authored — a pinned …:1.0.0 or the …:latest token — never normalised to the logical form.
  • target_artifact_id resolves the reference to the target artifact's logical identity; target_version_id resolves a pinned reference to its concrete version (null for latest, logical, or a not-yet-imported pin). Both are nullable / on delete set null, and are back-filled when a previously-missing target artifact or target version is later created.

xsd_namespace — namespace lookup

create table xsd_namespace (
namespace text primary key,
artifact_id uuid not null references artifact(id) on delete cascade,
version_id uuid references artifact_version(id) on delete set null,
updated_at timestamptz not null
);

A durable namespace→artifact index for XSD-backed DataStructures, so xs:import resolution does not depend on a startup scan completing.

DTO mapping

DTOs are built from artifact plus the selected artifact_version.

DTOStoredDerived on read
DataStructureDtothe version's representation (content_jsonb/content_text, content_type), artifact_version.primary_format and titleid (versioned URN), logicalId, version, title (version title metadata when set, else the URN name segment), format (= primary format), availableFormats (stored + derivable — an XSD version also lists jsonschema)
DataSetDtomanifest in content_jsonb (id/title/version/*Refs)dataset-ref rows are a derived index of the manifest's *Refs
MappingDtoconfig = full documentsource = config.source, target = config.target
PipelineDtoconfig = full documentdataSourceRefs/dataSinkRefs/mappingRefs from config.nodes[] by kind
DataSourceDto / DataSinkDtoconfig = full documentconnectionType = config.connectionType
Envelope<T>wraps a DTO with dependencies/dependents (graph), versions, and a compatibility marker (always "NONE")

Reference extraction

Edges are derived from the artifact content when it is written and persisted as artifact_reference rows:

reference_typeSource
schema-ref$ref values (CORE URNs) in a JSON Schema
xsd-importxs:import/@namespace: a CORE-URN namespace links that DataStructure directly (any format); a classic XML namespace is resolved via xsd_namespace
mapping-source / mapping-targetconfig.source / config.target of a Mapping
pipeline-nodesource/sink/mapping refs of config.nodes[]
dataset-ref, datasource-ref, datasink-refthe DataSet manifest's *Refs arrays

The DataSet manifest stays the canonical document; the relational rows are a derived index, written in the same transaction.

Versioning

Versioning is owned by Model Management — clients never choose the next version directly, they declare how it should be bumped.

  • The logical URN identifies the artifact independent of version; the versioned URN identifies one immutable version.

  • The first version of a new artifact is 1.0.0.

  • An update creates a new artifact_version; the number is computed from current_version and the requested bump:

    1.2.3 + patch -> 1.2.4
    1.2.3 + minor -> 1.3.0
    1.2.3 + major -> 2.0.0
  • Older versions are never overwritten; current_version is the default read.

  • The bump is request metadata, not part of the payload — a query parameter on the update endpoints:

    PUT /api/v1/datastructures?id=urn:core:…&versionBump=minor
  • A request targeting a versioned URN resolves the logical identity and bumps from current_version; the URN's version segment is ignored as the new number.

  • A rename (PATCH) is a patch-version bump that records the new title/description as versioned metadata, format-independently — so XSD DataStructures are renamable too (the XSD content is unchanged); older versions keep their title.

  • A read id (and a reference) may use the :latest token, resolving to the current (highest-SemVer) version. :latest is never a writable identity — a client-supplied :latest (like a legacy :xsd:) identity is rejected with 400.

There is no schema-compatibility calculation: the Envelope.compatibility field is always "NONE" and there are no compatibility endpoints.

Write flow

Every artifact write runs in one transaction:

  1. Normalise the incoming ID to a logical CORE URN.
  2. Validate the payload with the existing validation services.
  3. Reject a :xsd: artifact-type URN with HTTP 400 (before any write).
  4. Upsert artifact (insert on first sight; first version 1.0.0).
  5. If the current version's authored representation in the same format has an identical content hash → no-op (idempotent). A different format is always a new version.
  6. Otherwise compute the next SemVer version and insert a new artifact_version with primary_format = the written format.
  7. Insert the version's authored artifact_representation (generation = 'stored').
  8. Replace the version's artifact_reference rows from the extracted references.
  9. Advance artifact.current_version.
  10. For XSDs, upsert the xsd_namespace row (scoped to the version).

A storage failure rolls the whole write back and surfaces as HTTP 502.

Read flow

  • Versioned URN → load that artifact_version.
  • Logical URN → load artifact.current_version's version.
  • Pick the representation: GET /schema without a format returns the authored (primary) representation; ?format=json-schema returns the JSON Schema (an XSD version is converted on read); ?format=xsd returns the raw XSD (404 if the version has no XSD representation). formatsOf(urn) reports the stored formats plus the derivable jsonschema for XSD versions.
  • Convert to the API DTO (or raw JSON / raw XSD) at the boundary.

Conditional requests (ETags)

The same content_hash that makes writes idempotent doubles as the HTTP ETag. ArtifactRegistryClient.contentHash(urn, format) returns the hash of the stored representation a read would serve (the requested format, or the authored format when omitted); a derived representation has no stored hash and so no ETag.

The API layer turns this into the conditional-request contract — a strong ETag "sha256:<hex>" on single-artifact reads, If-None-Match304, and an optional If-Match412 on writes — see Conditional requests (ETags). Because the hash is over the serialised authored content it is byte-stable but not canonical: two semantically equal JSON documents with different key order can hash differently (acceptable for caching and lost-update detection).

Stage workflow (M3)

artifact_version.state carries a per-version lifecycle state (default released). ArtifactRegistryClient.transitionVersion moves a version along the fixed draft → review → approved → released → deprecated → retired path in a single transaction, appending an artifact_version_event audit row and — on release — advancing the artifact's released_version pointer (read back by resolveReleased). Transitions are idempotent (a no-op when already in the target state) and reject disallowed moves.

The workflow is opt-in (modelforge.workflow.enabled, default false): off, every write is stored released, so current_version is always the released version (status quo); on, writes start as draft. See Stage workflow (M3).

Dependency graph & cache

The durable source of dependency edges is artifact_reference. Two in-memory indexes serve runtime queries and are rebuilt from PostgreSQL on startup (on ApplicationReadyEvent, after Flyway has migrated):

  • DependencyGraphService — a version-precise graph whose nodes are versioned URNs and whose edges are kept verbatim (pinned or :latest); rebuilt from each artifact's current version on startup. It serves dependencies, dependents, impact and transitive queries — reporting concrete versioned IDs (a logical or :latest query resolves to the current version) — and tolerates cycles.
  • ModelStore — the DataSet read cache, reloaded from the registry.

Both survive a restart because the data lives in PostgreSQL.

Service boundary

The single seam between the domain services and storage is the ArtifactRegistryClient interface. Exactly one implementation is wired:

  • PostgresArtifactRegistryClient — active when a datasource is configured.
  • NoopArtifactRegistryClient — degraded Null-Object when no datasource is configured (model-forge.registry.backend other than postgres, or no datasource): reads return empty, writes return HTTP 503.

Domain behaviour (validation, URN creation, reference extraction, graph rebuild, dereferenced/bundled views, DataSet resolution) stays in the services.

Implementation notes

  • Access: a thin JdbcClient repository layer; jsonb is written with cast(:json as jsonb) and read back as text. No JPA/Hibernate.
  • Migrations: plain-SQL Flyway migrations under db/migration.
  • Search: scalar metadata (URN, name, title, description, type) first; a GIN index on content_jsonb is available for payload-level queries.
  • Tests: Testcontainers-PostgreSQL integration tests (*IT, run in the postgres-it CI job) exercise store/fetch/list/version/delete for every type, backend SemVer, content-hash idempotency, XSD import + namespace lookup, the full cyclic graph, transactional rollback, DataSet resolution, the dependency queries, pipeline publication, format-agnostic identity (:xsd: rejection, formatsOf, a per-version format change), cross-format links via a CORE-URN xs:import namespace, and durability across a Model Management and PostgreSQL restart.