AD-003: Identity
Summary
The framework defines two complementary identity primitives. Entity IDs
are 8-character base62 strings generated from crypto/rand via the core/id
package — short, URL-safe, and dependency-free. Block identity
(core/model.BlockIdentity) is content-addressable: a content hash plus a
context hash that let blocks be deduplicated and change-detected without any
allocated identifier. The block-addressed store (core/blockstore) keys
overlays on the content hash. Platforms layered on the framework add their own
file-scoped addressing on top of these primitives; that platform-layer scheme
is described in its own section below and cross-linked, not specified here.
Context
IDs appear everywhere: URLs, REST responses, CLI output, log lines, internal references. The framework and the platforms built on it need IDs for projects, workspaces, users, blocks, events, jobs, tokens, and more.
UUID v4 is globally unique and collision-proof, but 36 characters in
canonical form (f47ac10b-58cc-4372-a567-0e02b2c3d479) are excessive where
IDs appear in URLs, API responses, and CLI output. They are also not
URL-friendly: hyphens and length make them awkward in REST paths and browser
address bars.
A second problem is specific to blocks. The framework processes the same
content from many sources, so an identifier derived from the content itself —
rather than an allocated number — lets identical blocks deduplicate and lets
edits be detected by hash comparison. Format readers, meanwhile, assign IDs
from the source format (XLIFF tu1, tu2, etc.), but these are not unique
across files. Reconciling format-native IDs with a stable store identifier is
a concern for the persistence layer that consumes the framework, not for the
framework itself.
Decision
Short base62 IDs
All entity IDs are generated by core/id.New():
// core/id/id.go
const base62 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
func New() string {
var buf [8]byte
if _, err := rand.Read(buf[:]); err != nil {
panic("crypto/rand failed: " + err.Error())
}
out := make([]byte, 8)
for i, b := range buf {
out[i] = base62[int(b)%len(base62)]
}
return string(out)
}
Properties:
- 8 characters, base62 alphabet (0–9, A–Z, a–z)
- ~47.6 bits of entropy per ID (enough for millions of entities per scope before collision concern becomes meaningful)
crypto/randfor cryptographic randomness — no predictable IDs, no enumeration leaks- Zero external dependencies — the implementation is ten lines of Go and does not justify importing a UUID or nanoid library
- URL-safe — no special characters, no percent-encoding needed
id.New() is used for every entity kind that needs a stable identifier:
projects, workspaces, users, jobs, events, tokens, credentials, and more.
Content-addressable block identity
Blocks do not carry an allocated entity ID at the framework level. Instead,
core/model.BlockIdentity derives identity from the block's own content:
// core/model/identity.go
type BlockIdentity struct {
ContentHash string // SHA-256 of normalized source text
ContextHash string // SHA-256 of name, type, and sorted properties
}
func ComputeIdentity(b *Block) *BlockIdentity
ComputeContentHash trims and the content hash covers the normalized source
text, so identical strings collapse to one identity and deduplicate across
sources. ComputeContextHash folds in the block name, type, and
deterministically-sorted properties, so a structural change is detectable as a
context-hash change even when the source text is unchanged.
The block-addressed store (core/blockstore) keys on this content hash. A
Store is opened once per process and hands out Session transactions;
blocks are written content-addressed and, once written, are immutable —
tools append Overlay layers (targets, annotations, skeletons) keyed by
(kind, blockHash) rather than rewriting blocks. This is the substrate kapi
flows run against; see AD-008: Project Model.
Platform-layer block addressing
Platform layer. The remainder of this section describes how a separately-licensed platform reconciles content identity with file-scoped format IDs in its persistent content store. It is not part of the Apache-2.0 framework; the types named here (
StoredBlock,StoreBlocksForItem, theblocksSQL table) live under the platform's store package, not undercore/. It is documented here for context. The authoritative reference is the platform's Content Store decision record.
A persistent store that ingests many files needs a stable, store-unique key
for each block even though format readers assign IDs that repeat across files
(twenty XLIFF files each containing tu1). The platform layer solves this with
a two-level scheme that separates internal IDs (store-unique, allocated by
id.New()) from source IDs (format-reader assigned, file-scoped):
| Property | Internal ID | Source ID |
|---|---|---|
| Generator | id.New() (8-char base62) | Format reader (e.g., tu1) |
| Uniqueness | Per project | Per file within project |
| Stored in | blocks.id (primary key) | blocks.source_id |
| Used by | API responses, editor, internal refs | Export roundtrip matching |
Ingestion. When blocks arrive from a format reader keyed by file path, the
incoming reader ID is treated as a source ID. The store looks up whether the
(scope, file, sourceID) triple has been seen before; if so it reuses the
existing internal ID, otherwise it allocates a new one. Both IDs are
persisted. (In the platform store this is StoreBlocksForItem, persisted in a
blocks table whose internal ID is the primary key and whose
(scope, item, source_id) triple is uniquely indexed.)
Re-save. When blocks are stored after translation, TM leverage, or editor edits, they already carry an internal ID from a previous read, so no source-ID remapping occurs.
Export. When writing translated files, the system re-parses the source to recover format-reader IDs, then matches them against stored source IDs to inject translations into the correct positions.
Why two IDs
The fundamental property is that source IDs are file-scoped. Inside one
XLIFF file, tu1, tu2, tu3 are fine — the XLIFF spec guarantees
uniqueness within a single document. But a store that ingests twenty XLIFF
files sees tu1 twenty times. Without disambiguation, a translation stored
against tu1 would be ambiguous between the twenty candidates.
Internal IDs solve this with a single store-unique identifier per block, regardless of which file it came from. Downstream consumers — REST API callers, the editor, the TM writer — use the internal ID exclusively. The source ID only matters during export roundtrip, where the system must match re-parsed blocks back to stored translations by their original format-level identifier. This is the minimal structure that lets files with overlapping reader IDs coexist without three-part composite keys everywhere in the stack.
Consequences
- URLs and API responses are short and readable
(
/projects/aB3xK9mLvs/projects/f47ac10b-58cc-4372-a567-0e02b2c3d479). - No external dependency for ID generation —
crypto/randfrom the standard library is the only requirement. - Content-addressable
BlockIdentitydeduplicates identical content across sources and detects edits by hash comparison, with no allocated block ID at the framework level. - Collision probability at 47.6 bits is negligible for the per-scope populations these IDs address (millions of entities before concern), and random IDs avoid the enumeration leaks that sequential IDs produce.
- A platform that persists blocks from many files reconciles file-scoped
reader IDs with a stable internal ID outside the framework, keeping
core/free of store-specific addressing.
Related
- AD-001: Vision and Module Architecture
- AD-002: Content Model —
BlockIdentityand content-hash identity - AD-005: Format System — where source IDs originate
- AD-008: Project Model — the
core/blockstoresubstrate that keys on the content hash