Gå til hovedinnhold

AD-003: Identity

Summary

The framework defines two complementary identity primitives. Entity IDs are 8-character base62 strings generated from crypto/rand via the core/id package — short, URL-safe, and dependency-free. Block identity (core/model.BlockIdentity) is content-addressable: a content hash plus a context hash that let blocks be deduplicated and change-detected without any allocated identifier. The block-addressed store (core/blockstore) keys overlays on the content hash. Platforms layered on the framework add their own file-scoped addressing on top of these primitives; that platform-layer scheme is described in its own section below and cross-linked, not specified here.

Context

IDs appear everywhere: URLs, REST responses, CLI output, log lines, internal references. The framework and the platforms built on it need IDs for projects, workspaces, users, blocks, events, jobs, tokens, and more.

UUID v4 is globally unique and collision-proof, but 36 characters in canonical form (f47ac10b-58cc-4372-a567-0e02b2c3d479) are excessive where IDs appear in URLs, API responses, and CLI output. They are also not URL-friendly: hyphens and length make them awkward in REST paths and browser address bars.

A second problem is specific to blocks. The framework processes the same content from many sources, so an identifier derived from the content itself — rather than an allocated number — lets identical blocks deduplicate and lets edits be detected by hash comparison. Format readers, meanwhile, assign IDs from the source format (XLIFF tu1, tu2, etc.), but these are not unique across files. Reconciling format-native IDs with a stable store identifier is a concern for the persistence layer that consumes the framework, not for the framework itself.

Decision

Short base62 IDs

All entity IDs are generated by core/id.New():

// core/id/id.go
const base62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

func New() string {
var buf [8]byte
if _, err := rand.Read(buf[:]); err != nil {
panic("crypto/rand failed: " + err.Error())
}
out := make([]byte, 8)
for i, b := range buf {
out[i] = base62[int(b)%len(base62)]
}
return string(out)
}

Properties:

  • 8 characters, base62 alphabet (0–9, A–Z, a–z)
  • ~47.6 bits of entropy per ID (enough for millions of entities per scope before collision concern becomes meaningful)
  • crypto/rand for cryptographic randomness — no predictable IDs, no enumeration leaks
  • Zero external dependencies — the implementation is ten lines of Go and does not justify importing a UUID or nanoid library
  • URL-safe — no special characters, no percent-encoding needed

id.New() is used for every entity kind that needs a stable identifier: projects, workspaces, users, jobs, events, tokens, credentials, and more.

Content-addressable block identity

Blocks do not carry an allocated entity ID at the framework level. Instead, core/model.BlockIdentity derives identity from the block's own content:

// core/model/identity.go
type BlockIdentity struct {
ContentHash string // SHA-256 of normalized source text
ContextHash string // SHA-256 of name, type, and sorted properties
}

func ComputeIdentity(b *Block) *BlockIdentity

ComputeContentHash trims and the content hash covers the normalized source text, so identical strings collapse to one identity and deduplicate across sources. ComputeContextHash folds in the block name, type, and deterministically-sorted properties, so a structural change is detectable as a context-hash change even when the source text is unchanged.

The block-addressed store (core/blockstore) keys on this content hash. A Store is opened once per process and hands out Session transactions; blocks are written content-addressed and, once written, are immutable — tools append Overlay layers (targets, annotations, skeletons) keyed by (kind, blockHash) rather than rewriting blocks. This is the substrate kapi flows run against; see AD-008: Project Model.

Platform-layer block addressing

Platform layer. The remainder of this section describes how a separately-licensed platform reconciles content identity with file-scoped format IDs in its persistent content store. It is not part of the Apache-2.0 framework; the types named here (StoredBlock, StoreBlocksForItem, the blocks SQL table) live under the platform's store package, not under core/. It is documented here for context. The authoritative reference is the platform's Content Store decision record.

A persistent store that ingests many files needs a stable, store-unique key for each block even though format readers assign IDs that repeat across files (twenty XLIFF files each containing tu1). The platform layer solves this with a two-level scheme that separates internal IDs (store-unique, allocated by id.New()) from source IDs (format-reader assigned, file-scoped):

PropertyInternal IDSource ID
Generatorid.New() (8-char base62)Format reader (e.g., tu1)
UniquenessPer projectPer file within project
Stored inblocks.id (primary key)blocks.source_id
Used byAPI responses, editor, internal refsExport roundtrip matching

Ingestion. When blocks arrive from a format reader keyed by file path, the incoming reader ID is treated as a source ID. The store looks up whether the (scope, file, sourceID) triple has been seen before; if so it reuses the existing internal ID, otherwise it allocates a new one. Both IDs are persisted. (In the platform store this is StoreBlocksForItem, persisted in a blocks table whose internal ID is the primary key and whose (scope, item, source_id) triple is uniquely indexed.)

Re-save. When blocks are stored after translation, TM leverage, or editor edits, they already carry an internal ID from a previous read, so no source-ID remapping occurs.

Export. When writing translated files, the system re-parses the source to recover format-reader IDs, then matches them against stored source IDs to inject translations into the correct positions.

Why two IDs

The fundamental property is that source IDs are file-scoped. Inside one XLIFF file, tu1, tu2, tu3 are fine — the XLIFF spec guarantees uniqueness within a single document. But a store that ingests twenty XLIFF files sees tu1 twenty times. Without disambiguation, a translation stored against tu1 would be ambiguous between the twenty candidates.

Internal IDs solve this with a single store-unique identifier per block, regardless of which file it came from. Downstream consumers — REST API callers, the editor, the TM writer — use the internal ID exclusively. The source ID only matters during export roundtrip, where the system must match re-parsed blocks back to stored translations by their original format-level identifier. This is the minimal structure that lets files with overlapping reader IDs coexist without three-part composite keys everywhere in the stack.

Consequences

  • URLs and API responses are short and readable (/projects/aB3xK9mL vs /projects/f47ac10b-58cc-4372-a567-0e02b2c3d479).
  • No external dependency for ID generation — crypto/rand from the standard library is the only requirement.
  • Content-addressable BlockIdentity deduplicates identical content across sources and detects edits by hash comparison, with no allocated block ID at the framework level.
  • Collision probability at 47.6 bits is negligible for the per-scope populations these IDs address (millions of entities before concern), and random IDs avoid the enumeration leaks that sequential IDs produce.
  • A platform that persists blocks from many files reconciles file-scoped reader IDs with a stable internal ID outside the framework, keeping core/ free of store-specific addressing.