Content-Addressing & Merkle Trees

Cloudillo implements a merkle tree structure using content-addressed identifiers throughout its architecture. Every action, file, and data blob is identified by the cryptographic hash of its content, creating an immutable, verifiable chain of trust.

What is Content-Addressing?

Content-addressing means identifying data by what it is (its content) rather than where it is (its location). Instead of using arbitrary IDs or URLs, Cloudillo computes a cryptographic hash of the content itself and uses that hash as the identifier.

Benefits

✅ Immutable: Content cannot change without changing its identifier ✅ Tamper-Evident: Any modification is immediately detectable ✅ Deduplicatable: Identical content produces identical identifiers ✅ Verifiable: Anyone can recompute and verify hashes independently ✅ Cacheable: Content-addressed data can be cached forever ✅ Trustless: No need to trust storage providers—verify the hash

Hash Function

Cloudillo uses SHA-256 for all content-addressing:

Algorithm: SHA-256 (256-bit Secure Hash Algorithm)
Encoding: Base64url without padding (URL-safe)
Output: 43-character base64-encoded string
Collision Resistance: Cryptographically secure

compute_hash(prefix, data):
    hash = SHA256(data)
    encoded = base64url_encode(hash)  // URL-safe, no padding
    return "{prefix}1~{encoded}"

// Example:
compute_hash("b", blob_bytes) → "b1~abc123def456..." (43 chars)
compute_hash("f", descriptor)  → "f1~Qo2E3G8TJZ..." (43 chars)
compute_hash("a", jwt_token)   → "a1~8kR3mN9pQ2vL..." (43 chars)

Merkle Tree Structure

Cloudillo’s content-addressing creates a variable-depth merkle tree where actions can reference other actions recursively. The example below shows a six-level hierarchy for a POST action with image attachments:

┌─────────────────────────────────────────────────────────┐
│ Level 6: Action ID (a1~8kR3mN9pQ2vL6xW...)              │
│   ↑ SHA-256 hash of ↓                                   │
├─────────────────────────────────────────────────────────┤
│ Level 5: Action Token (JWT)                             │
│   Header: {"alg":"ES384","typ":"JWT"}                   │
│   Payload: {                                            │
│     "iss": "alice.example.com",                         │
│     "t": "POST:IMG",                                    │
│     "c": "Amazing photo!",                              │
│     "a": ["f1~Qo2E3G8TJZ..."],                          │
│     "iat": 1738483100                                   │
│   }                                                     │
│   Signature: <ES384 signature>                          │
│   ↓ references ↓                                        │
├─────────────────────────────────────────────────────────┤
│ Level 4: File ID (f1~Qo2E3G8TJZ2HTGhVlrtTDBp...)        │
│   ↑ SHA-256 hash of ↓                                   │
├─────────────────────────────────────────────────────────┤
│ Level 3: File Descriptor (d1~...)                       │
│   "d1~tn:b1~abc:f=AVIF:s=4096:r=150x150,                │
│        sd:b1~def:f=AVIF:s=32768:r=640x480,              │
│        md:b1~ghi:f=AVIF:s=262144:r=1920x1080"           │
│   ↓ references ↓                                        │
├─────────────────────────────────────────────────────────┤
│ Level 2: Variant IDs                                    │
│   b1~abc123... (tn)                                     │
│   b1~def456... (sd)                                     │
│   b1~ghi789... (md)                                     │
│   ↑ SHA-256 hash of ↓                                   │
├─────────────────────────────────────────────────────────┤
│ Level 1: Blob Data (raw bytes)                          │
│   <AVIF encoded image data - tn variant>                │
│   <AVIF encoded image data - sd variant>                │
│   <AVIF encoded image data - md variant>                │
└─────────────────────────────────────────────────────────┘

Level 1: Blob Data

Raw bytes of actual content (images, videos, documents).

No identifier yet—just the binary data
This is what gets hashed to create variant IDs

Level 2: Variant IDs (b1~…)

Content-addressed blob identifiers computed as SHA256(blob_bytes).

Properties:

Identifies a single image/video variant
Changing even one byte changes the entire ID
Multiple actions can reference the same variant_id (deduplication)

Level 3: File Descriptor (d1~…)

Encoded string listing all available variants of a file.

Format:

d1~{variant}:{variant_id}:f={format}:s={size}:r={width}x{height},...

Example:

d1~tn:b1~abc123:f=AVIF:s=4096:r=150x150,sd:b1~def456:f=AVIF:s=32768:r=640x480

This descriptor says:

Thumbnail variant: b1~abc123, AVIF format, 4KB, 150×150px
Standard variant: b1~def456, AVIF format, 32KB, 640×480px

Level 4: File ID (f1~…)

Content-addressed file identifier computed as SHA256(descriptor_string).

Properties:

Identifies the complete file with all its variants
Changing any variant changes the descriptor, which changes the file_id
Used in action token attachments

Level 5: Action Token (JWT)

Cryptographically signed JSON Web Token representing a user action.

Structure:

{
  "header": {
    "alg": "ES384",
    "typ": "JWT"
  },
  "payload": {
    "iss": "alice.example.com",
    "k": "20250101",
    "t": "POST:IMG",
    "c": "Check out this amazing photo!",
    "a": ["f1~Qo2E3G8TJZ..."],
    "iat": 1738483100
  },
  "signature": "<ES384 signature with alice's private key>"
}

Encoding: {base64(header)}.{base64(payload)}.{base64(signature)}

Level 6: Action ID (a1~…)

Content-addressed action identifier computed as SHA256(complete_jwt_token).

Properties:

Identifies the complete action immutably
Changing any field (even whitespace) changes the action_id
Used as parent references in replies/reactions

Hash Versioning Scheme

All identifiers use a versioned prefix format for future-proofing:

{prefix}{version}~{base64_encoded_hash}

Current Prefixes (Version 1)

Prefix	Resource Type	Hash Input	Example
`a1~`	Action	Entire JWT token	`a1~8kR3mN9pQ2vL6xW...`
`f1~`	File	File descriptor string	`f1~Qo2E3G8TJZ2HTGh...`
`d1~`	Descriptor	(not a hash, the encoded format itself)	`d1~tn:b1~abc:f=AVIF:...`
`b1~`	Blob	Blob bytes (raw data)	`b1~abc123def456ghi...`

Version Scheme

Version 1: SHA-256 with base64url encoding (no padding)
Future versions: Can upgrade to SHA-3, BLAKE3, etc.
Backward compatibility: Old content remains valid forever
Algorithm agility: Migrate to new algorithms without breaking existing references

Example upgrade path:

a1~...  (SHA-256)
a2~...  (SHA-3)
a3~...  (BLAKE3)

Merkle Tree Properties

Cloudillo’s content-addressing creates a merkle tree with these properties:

Immutability

Once created, content cannot be modified without changing its identifier.

Example:

Original Post:
  content: "Hello World"
  action_id: a1~abc123...

Modified Post:
  content: "Hello Universe"
  action_id: a1~xyz789...  ← DIFFERENT ID!

Any attempt to modify content results in a completely new action with a different ID. The original remains unchanged.

Tamper-Evidence

Any modification anywhere in the tree is immediately detectable.

Example:

Post (a1~abc...)
  └─ Attachment: f1~Qo2...
      └─ Variants: b1~tn..., b1~sd..., b1~md...

If someone modifies the thumbnail image:
  ✗ New variant_id (b1~xyz...)
  ✗ Descriptor changes
  ✗ New file_id (f1~uvw...)
  ✗ Post attachment no longer matches
  ✗ Verification fails!

Deduplication

Identical content produces identical identifiers, enabling automatic deduplication.

Example:

Alice posts photo.jpg → file_id: f1~abc123...
Bob posts photo.jpg (same file) → file_id: f1~abc123... (same!)

Storage: Only one copy of the blob bytes needed
Bandwidth: Can skip downloading if already cached

Verifiability

Anyone can independently verify the entire chain.

Process:

Download action token
Verify JWT signature (proves author identity)
Recompute action_id = SHA256(JWT)
Compare with claimed action_id
For each attachment:
- Download file descriptor
- Download all variant blobs
- Verify each variant_id = SHA256(blob)
- Recompute file_id = SHA256(descriptor)
- Compare with attachment reference
✓ Complete verification

No trust required: Pure mathematics ensures integrity.

Chain of Trust

Parent references create an immutable chain.

Example:

Post (a1~abc...)
  ↑ parent_id
Comment (a1~def...)
  ↑ parent_id
Reply (a1~ghi...)

Reply references Comment by content hash
Comment references Post by content hash
Cannot modify Post without breaking Comment reference
Cannot modify Comment without breaking Reply reference
Entire thread is cryptographically bound together

Proof of Authenticity

Cloudillo provides two complementary layers of proof:

Layer 1: Author Identity (Cryptographic Signatures)

Action tokens are signed with ES384 (ECDSA with P-384 curve and SHA-384 hash):

Action Token = Header.Payload.Signature

Signature = ECDSA_sign(SHA384(Header.Payload), alice_private_key)

Verification:

Fetch alice’s public key from https://cl-o.alice.example.com/api/me/keys
Verify signature using public key
✓ Proves Alice created this action

Layer 2: Content Integrity (Content Hashes)

All identifiers are SHA-256 hashes:

action_id = SHA256(entire JWT token)
file_id = SHA256(descriptor string)
variant_id = SHA256(blob bytes)

Verification:

Download content
Recompute hash
Compare with claimed identifier
✓ Proves content hasn’t been tampered with

Combined Proof

Author + Content = Complete Authenticity

For a post with image attachment:

✓ Signature proves Alice authored the post
✓ Action hash proves post content is intact
✓ File hash proves descriptor is intact
✓ Variant hashes prove image bytes are intact
✓ Complete chain of authenticity established

No trusted intermediaries needed—pure cryptographic proof.

Verification Example

Scenario: Verify a LIKE action on a POST with image attachment.

verify_like_action(like_id):
    // 1. Verify LIKE action
    like_token = fetch_action_token(like_id)
    verify_signature(like_token, bob_public_key)
    verify like_id == SHA256(like_token)
    ✓ Bob authored this LIKE, content is intact

    // 2. Verify parent POST action
    post_id = like_token.parent_id
    post_token = fetch_action_token(post_id)
    verify_signature(post_token, alice_public_key)
    verify post_id == SHA256(post_token)
    ✓ Alice authored this POST, content is intact

    // 3. Verify file attachment
    file_id = post_token.attachments[0]
    descriptor = fetch_descriptor(file_id)
    verify file_id == SHA256(descriptor)
    ✓ File descriptor is intact

    // 4. Verify each variant blob
    variants = parse_descriptor(descriptor)
    for each variant:
        blob_data = fetch_blob(variant.blob_id)
        verify variant.blob_id == SHA256(blob_data)
        verify blob_data.size == variant.size
    ✓ All image variants are intact

    RESULT: Complete chain verified!

What this proves:

Bob signed the LIKE (authentication)
Alice signed the POST (authentication)
No content was tampered with at any level (integrity)
The LIKE references this specific POST (linkage)
The POST references this specific image file (linkage)
All image bytes are authentic (end-to-end verification)

DAG Structure

Cloudillo’s action system forms a Directed Acyclic Graph (DAG) with these properties:

Multiple Roots

Unlike a traditional tree with one root, Cloudillo has multiple independent threads:

Post 1 (a1~abc...)                Post 2 (a1~xyz...)
│                                  │
├─ Comment 1.1 (a1~def...)        ├─ Comment 2.1 (a1~uvw...)
│  │                               │
│  ├─ Reply 1.1.1 (a1~ghi...)     └─ Like 2.1 (a1~rst...)
│  │                                  (parent: a1~uvw...)
│  └─ Reply 1.1.2 (a1~jkl...)
│
├─ Comment 1.2 (a1~mno...)
│
└─ Like 1 (a1~pqr...)
   (parent: a1~abc...)

Each top-level post is a root node. Comments and reactions form child nodes.

Shared Attachments

Multiple actions can reference the same file:

Post 1 (a1~abc...)
  └─ Attachment: f1~photo123

Post 2 (a1~xyz...)
  └─ Attachment: f1~photo123  ← SAME FILE!

Repost (a1~uvw...)
  └─ Attachment: f1~photo123  ← SAME FILE!

Benefits:

Storage efficiency: Only one copy of blob data needed
Bandwidth efficiency: Download once, use everywhere
Consistency: Everyone sees exactly the same image
Verification: Single verification proves authenticity for all uses

Acyclic Property

The graph has no cycles (no circular references):

✓ Valid:
  Post → Comment → Reply (linear chain)
  Post → Comment1, Post → Comment2 (branching)

✗ Invalid:
  Post → Comment → Reply → Post (cycle!)

Cycles are prevented because:

Parent references use content hashes
Cannot reference an action that doesn’t exist yet
Cannot create action hash without knowing full content
Mathematical impossibility to create circular references

Efficient Traversal

Forward traversal (find children):

-- Find all comments on a post
SELECT * FROM actions
WHERE parent_id = 'a1~abc123...'
AND type LIKE 'CMNT%';

Backward traversal (find parents):

find_root(action_id):
    action = fetch_action(action_id)
    if action.parent_id exists:
        return find_root(action.parent_id)  // Recursive
    else:
        return action  // Found root

Root ID optimization: To avoid repeated traversal, the root_id is computed once and stored in the database (see Actions: Root ID Handling).

Performance Implications

Caching Strategy

Content-addressed data is immutable, enabling aggressive caching:

GET /api/files/f1~Qo2E3G8TJZ...

Response Headers:
  Cache-Control: public, max-age=31536000, immutable
  Content-Type: image/avif

Benefits:

Browsers cache forever (max-age = 1 year)
CDN cache forever
No cache invalidation needed
Reduces bandwidth for federated instances

Storage Deduplication

Identical content is stored only once:

Alice uploads photo.jpg → b1~abc123 (1MB stored)
Bob uploads photo.jpg → b1~abc123 (0MB stored, reuse!)
Carol uploads photo.jpg → b1~abc123 (0MB stored, reuse!)

Total storage: 1MB instead of 3MB

Automatic deduplication at multiple levels:

Variant level (same image blob)
File level (same set of variants)
Action level (impossible to duplicate due to signatures and timestamps)

Security Considerations

Collision Resistance

SHA-256 provides 256-bit security:

Probability of collision: ~2^-256 (effectively zero)
Preimage attack: computationally infeasible
Second preimage attack: computationally infeasible

In practice: Finding a collision would require more energy than exists in the observable universe.

Tamper Detection

Any modification anywhere in the tree is immediately detectable:

Attack scenario: Attacker tries to modify an image in alice’s post

1. Attacker modifies image blob
   → New variant_id (hash mismatch!)
2. Attacker updates descriptor with new variant_id
   → New file_id (hash mismatch!)
3. Attacker updates post attachment with new file_id
   → Breaks JWT signature (alice didn't sign this!)
4. Attacker creates new JWT with new attachment
   → New action_id (different post!)

Result: Attacker cannot tamper without detection. They can only create NEW actions, not modify existing ones.

Trust Model

Cloudillo’s merkle tree creates a trustless verification model:

What	Verification Method	Trust Required
Author identity	JWT signature (ES384)	DNS + Public key infrastructure
Content integrity	SHA-256 hash	None (pure mathematics)
Parent references	Content hashes	None (pure mathematics)
Attachment integrity	SHA-256 hash chain	None (pure mathematics)

Storage providers don’t need to be trusted: Even if a storage provider is malicious, they cannot:

Modify content without breaking hashes ✗
Forge signatures without private keys ✗
Create false parent references ✗
Tamper with attachments without detection ✗

Users verify everything cryptographically—no trust required.

Attack Resistance

Known attacks and mitigations:

Attack	Mitigation
Modify action content	Hash mismatch detected
Forge author signature	Signature verification fails
Swap file attachment	Hash mismatch detected
Modify parent reference	Breaks cryptographic chain
Replay old actions	Timestamp validation, deduplication
Storage provider tampering	Hash verification fails

Content-Addressing & Merkle Trees

What is Content-Addressing?

Benefits

Hash Function

Merkle Tree Structure

Level 1: Blob Data

Level 2: Variant IDs (b1~…)

Level 3: File Descriptor (d1~…)

Level 4: File ID (f1~…)

Level 5: Action Token (JWT)

Level 6: Action ID (a1~…)

Hash Versioning Scheme

Current Prefixes (Version 1)

Version Scheme

Merkle Tree Properties

Immutability

Tamper-Evidence

Deduplication

Verifiability

Chain of Trust

Proof of Authenticity

Layer 1: Author Identity (Cryptographic Signatures)

Layer 2: Content Integrity (Content Hashes)

Combined Proof

Verification Example

DAG Structure

Multiple Roots

Shared Attachments

Acyclic Property

Efficient Traversal

Performance Implications

Caching Strategy

Storage Deduplication

Security Considerations

Collision Resistance

Tamper Detection

Trust Model

Attack Resistance

See Also