Blob Storage

Cloudillo’s blob storage uses content-addressed storage for immutable binary data (files, images, videos). Intelligent variant generation for images ensures data integrity, deduplication, and efficient delivery across different use cases.

Why Content-Addressed Storage?

Traditional file storage asks “where is this file?” Content-addressed storage asks “what is this file?” This simple shift enables powerful features:

Real-world analogy: Imagine a library where books are organized by their content fingerprint rather than shelf location. Two copies of the same book have the same fingerprint—you only need to store one. If someone claims to have “the original,” you can verify it instantly by checking the fingerprint.

Benefits you’ll notice:

  • Upload once, access anywhere: The same image uploaded by different users is stored only once
  • Verification without trust: Anyone can confirm a file hasn’t been modified
  • Efficient caching: Files can be cached forever—they never change
  • Automatic deduplication: Storage costs decrease as the network grows

Benefits for developers:

  • Simple URLs: File ID = file content = permanent reference
  • No cache invalidation: Content-addressed files are immutable
  • Built-in integrity: Hash verification catches corruption instantly

Content-Addressed Storage

Concept

Files are identified by the SHA256 hash of their content, making identifiers:

  • Immutable: Content cannot change without changing the ID
  • Verifiable: Recipients can verify integrity
  • Deduplicat able: Identical content gets same ID
  • Tamper-proof: Any modification is immediately detectable

File Identifier Format

Cloudillo uses multiple identifier types in its content-addressing system:

{prefix}{version}~{base64url_hash}

Components:

  • {prefix}: Resource type indicator (a, f, b, d)
  • {version}: Hash algorithm version (currently 1 = SHA-256)
  • ~: Separator
  • {base64url_hash}: Base64url-encoded hash (43 characters, no padding)

Identifier Types

Prefix Resource Type Hash Input Example
b1~ Blob Blob bytes (raw image/video data) b1~abc123def456...
f1~ File File descriptor string f1~QoEYeG8TJZ2HTGh...
d2, Descriptor (not a hash, the encoded format itself) d2,vis.tn:b1~abc:f=avif:...
a1~ Action Complete JWT token a1~8kR3mN9pQ2vL...

Important: d2, is not a content-addressed identifier—it’s the actual encoded descriptor string. The file ID (f1~) is the hash of this descriptor.

Examples

Blob ID:       b1~QoEYeG8TJZ2HTGhVlrtTDBpvBGOp6gfGhq4QmD6Z46w
File ID:       f1~m8Z35EIa3prvb3bhjsVjdg9SG98xd0bkoWomOHQAwCM
Descriptor:    d2,vis.tn:b1~xRAVuQtgBx_kLqZnoOSd5XqCK_aQolhq1XeXk73Zn8U:f=avif:s=1960:r=90x128
Action ID:     a1~8kR3mN9pQ2vL6xWpYzT4BjN5FqGxCmK9RsH2VwLnD8P

All file and blob IDs use SHA-256 content-addressing. See Content-Addressing & Merkle Trees for hash computation details.

File Types

Cloudillo supports four file types, each handled by different adapters based on mutability and use case:

Type Adapter Mutability Description
BLOB BlobAdapter Immutable Binary content (images, videos, documents)
CRDT CrdtAdapter Mutable Collaborative documents (Yjs-based real-time editing)
RTDB RtdbAdapter Mutable Real-time database files for app state
FLDR MetaAdapter Mutable Folder/directory metadata

Why Different File Types?

Different collaboration scenarios require different storage strategies:

  • BLOB: When you upload a photo or video, it never changes—if you want a different version, you upload a new file. This immutability enables powerful caching and deduplication across the network.
  • CRDT: When editing a document with others in real-time (like Google Docs), changes from all participants must merge seamlessly. CRDTs (Conflict-free Replicated Data Types) make this possible.
  • RTDB: Apps need to store changing state (todo lists, game scores, form data). RTDB provides real-time synchronization with WebSocket subscriptions.
  • FLDR: Organizing files into folders requires mutable metadata without changing the files themselves.

File Type Selection

Files are created via different API endpoints based on their type:

Endpoint File Type Use Case
/api/files/image/* BLOB Image uploads with auto-variant generation
/api/files/raw/* BLOB Raw file uploads (no processing)
/api/files/crdt/* CRDT Collaborative document creation
/api/files/rtdb/* RTDB Real-time database file creation

Per-File Access Control

Each file has independent access control:

Field Values Description
Visibility P/V/F/C/null Who can discover this file
Access Level R/W Read-only vs read-write access

See Access Control for detailed permission handling.

File Variants

Concept

A single uploaded image automatically generates multiple variants optimized for different use cases:

  • tn (thumbnail): Tiny preview (~128x96px)
  • sd (standard definition): Social media size (~640x480px)
  • md (medium definition): Web display (~1280x720px)
  • hd (high definition): Full screen (~1920x1080px)
  • xd (extra definition): Original/4K+ (~3840x2160px+)

File Descriptor Encoding

A file descriptor encodes all available variants in a compact format.

File Descriptor Format Specification

Format

d2,{class}.{variant}:{blob_id}:f={format}:s={size}:r={width}x{height}[:{optional}];...

Components

  • d2, - Descriptor prefix (version 2)
  • {class} - Media class:
    • vis - Visual (images: jpeg, png, webp, avif)
    • vid - Video (mp4/h264)
    • aud - Audio (opus, mp3)
    • doc - Documents (pdf)
    • raw - Original unprocessed file
  • {variant} - Quality tier: pf, tn, sd, md, hd, xd, or orig
  • {blob_id} - Content-addressed ID of the blob (b1~...)
  • f={format} - Format: avif, webp, jpeg, png, mp4, opus, pdf
  • s={size} - File size in bytes (integer, no separators)
  • r={width}x{height} - Resolution in pixels (width × height)
  • ; - Semicolon separator between variants (no spaces)

Optional Fields

For video, audio, and document files:

  • dur={seconds} - Duration in seconds (floating point, video/audio only)
  • br={kbps} - Bitrate in kbps (integer, video/audio only)
  • pg={count} - Page count (integer, documents only)

Example

d2,vis.tn:b1~abc123:f=avif:s=4096:r=150x150;vis.sd:b1~def456:f=avif:s=32768:r=640x480

This descriptor encodes two variants:

  • Thumbnail: AVIF format, 4096 bytes, 150×150 pixels, blob ID b1~abc123
  • Standard: AVIF format, 32768 bytes, 640×480 pixels, blob ID b1~def456

Video Example

d2,vis.tn:b1~abc:f=avif:s=4096:r=150x84;vid.sd:b1~def:f=mp4:s=5242880:r=720x404:dur=120.5:br=350;vid.hd:b1~ghi:f=mp4:s=20971520:r=1920x1080:dur=120.5:br=1400

This descriptor includes:

  • Thumbnail: AVIF image preview
  • SD Video: 720p MP4, 120.5 seconds, 350 kbps
  • HD Video: 1080p MP4, 120.5 seconds, 1400 kbps

Parsing Rules

  1. Check prefix: Verify descriptor starts with d2,
  2. Split by semicolon (;): Get individual variant entries
  3. For each variant, split by colon (:) to get components:
    • Component [0] = class.variant (vis.tn, vis.sd, vid.hd)
    • Component [1] = blob_id (b1~...)
    • Components [2..] = key=value pairs
  4. Parse key=value pairs:
    • f={format} → Format string
    • s={size} → Parse as u64 (bytes)
    • r={width}x{height} → Split by x, parse as u32 × u32
    • dur={seconds} → Parse as f64 (optional)
    • br={kbps} → Parse as u32 (optional)
    • pg={count} → Parse as u32 (optional)

Parsing logic: split by semicolons for variants, then by colons for fields, then parse key=value pairs.

Variant Size Classes - Exact Specifications

Cloudillo generates image variants at specific size targets to optimize bandwidth and storage:

Class Name Target Resolution Max Dimension Use Case
tn Thumbnail ~150×150px 200px List views, previews, avatars
sd Standard Definition ~640×480px 800px Mobile devices, low bandwidth
md Medium Definition ~1920×1080px 2000px Desktop viewing, full screen
hd High Definition ~3840×2160px 4000px 4K displays, high quality
xd Extra Definition Original size No limit Archival, original quality

Generation Rules

Generated variants based on maximum dimension (largest of width or height):

  • max_dim ≥ 3840px: tn, sd, md, hd, xd (all variants)
  • max_dim ≥ 1920px: tn, sd, md, hd
  • max_dim ≥ 1280px: tn, sd, md
  • max_dim < 1280px: tn, sd

Properties:

  • Each variant maintains the original aspect ratio
  • Uses Lanczos3 filter for high-quality downscaling
  • Maximum dimension constraint prevents oversizing
  • Smaller originals don’t get upscaled

Variant Selection

Clients request a specific variant:

GET /api/files/f1~Qo2E3G8TJZ...?variant=hd

Response: Returns HD variant if available, otherwise falls back to smaller variants.

Automatic Fallback

If the requested variant doesn’t exist, the server returns the best available:

  1. Try requested variant (e.g., hd)
  2. Fall back to next smaller (e.g., md)
  3. Continue until variant found
  4. Return smallest if none larger

Fallback order: xdhdmdsdtn

Content-Addressing Flow

File storage uses a three-level content-addressing hierarchy:

Level 1: Blob Storage

Upload image → Save as blob → Compute SHA256 of blob bytes → Store blob with ID: b1~{hash}

blob_data = read_file("thumbnail.avif")
blob_id = compute_hash("b", blob_data)
// Result: "b1~abc123..." (thumbnail blob ID)

Example: b1~abc123... identifies the thumbnail AVIF blob

See Content-Addressing & Merkle Trees for hash computation details.

Level 2: Variant Collection

Generate all variants (tn, sd, md, hd) → Each variant gets its own blob ID (b1~...) → Collect all variant metadata → Create descriptor string encoding all variants

variants = [
    { class: "vis.tn", blob_id: "b1~abc123", format: "avif", size: 4096, width: 150, height: 150 },
    { class: "vis.sd", blob_id: "b1~def456", format: "avif", size: 32768, width: 640, height: 480 },
    { class: "vis.md", blob_id: "b1~ghi789", format: "avif", size: 262144, width: 1920, height: 1080 },
]

descriptor = build_descriptor(variants)
// Result: "d2,vis.tn:b1~abc123:f=avif:s=4096:r=150x150;vis.sd:b1~def456:f=avif:s=32768:r=640x480;vis.md:b1~ghi789:f=avif:s=262144:r=1920x1080"

Level 3: File Descriptor

Build descriptor → Compute SHA256 of descriptor string → Final file ID: f1~{hash} → This file ID goes into action attachments

descriptor = "d2,vis.tn:b1~abc:f=avif:s=4096:r=150x150;vis.sd:b1~def:f=avif:s=32768:r=640x480"
file_id = compute_hash("f", descriptor.as_bytes())
// Result: "f1~Qo2E3G8TJZ..." (file ID)

Example Complete Flow

1. User uploads photo.jpg (3MB, 3024x4032px)

2. System generates variants:
   vis.tn:  150x200px → 4KB   → b1~abc123
   vis.sd:  600x800px → 32KB  → b1~def456
   vis.md:  1440x1920px → 256KB → b1~ghi789
   vis.hd:  2880x3840px → 1MB → b1~jkl012

3. System builds descriptor:
   "d2,vis.tn:b1~abc123:f=avif:s=4096:r=150x200;
       vis.sd:b1~def456:f=avif:s=32768:r=600x800;
       vis.md:b1~ghi789:f=avif:s=262144:r=1440x1920;
       vis.hd:b1~jkl012:f=avif:s=1048576:r=2880x3840"

4. System hashes descriptor:
   file_id = f1~Qo2E3G8TJZ2... = SHA256(descriptor)

5. Action references file:
   POST action attachments = ["f1~Qo2E3G8TJZ2..."]

6. Anyone can verify:
   - Download all variants
   - Verify each blob_id = SHA256(blob)
   - Rebuild descriptor
   - Verify file_id = SHA256(descriptor)
   - Cryptographic proof established ✓

Integration with Action Merkle Tree

File attachments create an extended merkle tree:

Action (a1~8kR...)
  ├─ Signed by user (ES384)
  ├─ Content-addressed (SHA256 of JWT)
  └─ Attachments: [f1~Qo2...]
       └─ File (f1~Qo2...)
            ├─ Content-addressed (SHA256 of descriptor)
            └─ Descriptor: "d2,vis.tn:b1~abc...;vis.sd:b1~def..."
                 ├─ Blob vis.tn (b1~abc...)
                 │   └─ Content-addressed (SHA256 of blob)
                 ├─ Blob vis.sd (b1~def...)
                 │   └─ Content-addressed (SHA256 of blob)
                 └─ Blob vis.md (b1~ghi...)
                     └─ Content-addressed (SHA256 of blob)

Benefits:

  • Entire tree is cryptographically verifiable
  • Cannot modify image without changing all parent hashes
  • Deduplication: same image = same file_id
  • Federation: remote instances can verify integrity

See Content-Addressing & Merkle Trees for how file content-addressing integrates with the action system.

Image Processing Pipeline

Upload Flow

When a client uploads an image:

  1. Client Request
POST /api/files/image/profile-picture.jpg
Authorization: Bearer <access_token>
Content-Type: image/jpeg
Content-Length: 2458624

<binary image data>
  1. Dimension Extraction

Extract image dimensions to determine which variants to generate:

img = load_image_from_memory(data)
(width, height) = img.dimensions()
max_dim = max(width, height)

if max_dim >= 3840:
    variants = ["tn", "sd", "md", "hd", "xd"]
else if max_dim >= 1920:
    variants = ["tn", "sd", "md", "hd"]
else if max_dim >= 1280:
    variants = ["tn", "sd", "md"]
else:
    variants = ["tn", "sd"]
  1. FileIdGeneratorTask

Create a task to generate the content-addressed ID:

task = FileIdGeneratorTask(
    tn_id,
    temp_file_path="/tmp/upload-abc123",
    original_filename="profile-picture.jpg"
)

task_id = scheduler.schedule(task)
  1. ImageResizerTask (Multiple)

For each variant, create a resize task:

for variant in variants:
    task = ImageResizerTask(
        tn_id,
        source_file_id=original_id,
        variant=variant,
        target_dimensions=get_variant_dimensions(variant),
        format="avif",  # Primary format
        quality=85,
        dependencies=[file_id_task_id]  # Wait for ID generation
    )

    scheduler.schedule(task)
  1. Hash Computation

FileIdGeneratorTask computes SHA256 hash:

file_id = compute_content_hash("f", file_contents)
# See merkle-tree.md for hash computation details
  1. Blob Storage

Store original in BlobAdapter:

blob_adapter.create_blob_stream(tn_id, file_id, file_stream)
  1. Variant Generation

Each ImageResizerTask runs in worker pool (CPU-intensive):

# Execute in worker pool
img = load_image(source_path)

# Resize with Lanczos3 filter (high quality)
resized = img.resize(target_width, target_height, filter=Lanczos3)

# Encode to AVIF
buffer = encode_avif(resized, quality)

# Store variant
variant_id = compute_file_id(buffer)
blob_adapter.create_blob(tn_id, variant_id, buffer)
  1. Metadata Storage

Store file metadata with all variants:

file_metadata = FileMetadata(
    tn_id,
    file_id=descriptor_id,
    original_filename="profile-picture.jpg",
    mime_type="image/jpeg",
    size=original_size,
    variants=[
        Variant(name="tn", blob_id="b1~QoE...46w", format="avif",
                size=4096, width=200, height=200),
        Variant(name="sd", blob_id="b1~xyz...789", format="webp",
                size=32768, width=640, height=480),
        # ... more variants
    ],
    created_at=current_timestamp()
)

meta_adapter.create_file_metadata(tn_id, file_metadata)
  1. Response

Return descriptor ID to client:

{
  "file_id": "f1~QoE...46w",
  "descriptor": "d2,vis.tn:b1~QoE...46w:f=avif:s=4096:r=128x96;vis.sd:b1~xyz...789:f=avif:s=8192:r=640x480",
  "variants": [
    {"name": "vis.tn", "format": "avif", "size": 4096, "dimensions": "128x96"},
    {"name": "vis.sd", "format": "avif", "size": 8192, "dimensions": "640x480"}
  ],
  "processing": true
}

Complete Upload Flow Diagram

Client uploads image
  ↓
POST /api/files/image/filename.jpg
  ↓
Save to temp file
  ↓
Extract dimensions
  ↓
Determine variants to generate
  ↓
Create FileIdGeneratorTask
  ├─ Compute SHA256 hash
  ├─ Move to permanent storage (BlobAdapter)
  └─ Generate file_id
  ↓
Create ImageResizerTask (for each variant)
  ├─ Depends on FileIdGeneratorTask
  ├─ Load source image
  ├─ Resize with Lanczos3
  ├─ Encode to AVIF/WebP/JPEG
  ├─ Compute variant ID (SHA256)
  └─ Store in BlobAdapter
  ↓
Create file descriptor
  ├─ Collect all variant IDs
  ├─ Encode as descriptor
  └─ Store metadata in MetaAdapter
  ↓
Return descriptor ID to client

Download Flow

Client Request

GET /api/files/f1~...?variant=hd
Authorization: Bearer <access_token>

Server Processing

  1. Parse Descriptor
variants = parse_file_descriptor(file_id)
# Returns list of VariantInfo
  1. Select Best Variant
selected = select_best_variant(
    variants,
    requested_variant,   # "hd"
)

# Falls back if exact match not available:
# hd/avif → hd/webp → md/avif → md/webp → sd/avif → ...
  1. Stream from BlobAdapter
stream = blob_adapter.read_blob_stream(tn_id, selected.file_id)

# Set response headers
response.headers["Content-Type"] = f"image/{selected.format}"
response.headers["X-Cloudillo-Variant"] = selected.blob_id
response.headers["X-Cloudillo-Descriptor"] = descriptor
response.headers["Content-Length"] = selected.size

# Stream response
return stream_response(stream)

Response

HTTP/1.1 200 OK
Content-Type: image/avif
Content-Length: 16384
X-Cloudillo-Variant: b1~m8Z35EIa3prvb3bhjsVjdg9SG98xd0bkoWomOHQAwCM
X-Cloudillo-Descriptor: d2,vis.tn:b1~xRAVuQtgBx_kLqZnoOSd5XqCK_aQolhq1XeXk73Zn8U:f=avif:s=1960:r=90x128;vis.sd:b1~m8Z35EIa3prvb3bhjsVjdg9SG98xd0bkoWomOHQAwCM:f=avif:s=8137:r=256x364;vis.orig:b1~5gU72rRGiaogZuYhJy853pBd6PsqjPOjS__Kim9-qE0:f=avif:s=15012:r=256x364
Cache-Control: public, max-age=31536000, immutable

<binary image data>

Note: Content-addressed files are immutable, so can be cached forever.

Metadata Structure

FileMetadata

Stored in MetaAdapter:

FileMetadata {
    tn_id: TnId
    file_id: String           # Descriptor ID
    original_filename: String
    mime_type: String
    size: u64                 # Original size
    width: Optional[u32]
    height: Optional[u32]
    variants: List[VariantInfo]
    created_at: i64
    owner: String             # Identity tag
    permissions: FilePermissions
}

VariantInfo {
    name: String              # "tn", "sd", "md", "hd", "xd"
    file_id: String           # Content-addressed ID
    format: String            # "avif", "webp", "jpeg", "png"
    size: u64                 # Bytes
    width: u32
    height: u32
}

FilePermissions {
    public_read: bool
    shared_with: List[String]  # Identity tags
}

File Presets

Concept

Presets define how files should be processed:

FilePreset:
    Image      # Auto-generate variants
    Video      # Future: transcode, thumbnails
    Document   # Future: preview generation
    Database   # RTDB database files
    Raw        # No processing, store as-is

Upload with Preset

POST /api/files/{preset}/{filename}

Examples:
POST /api/files/image/avatar.jpg      // Generate image variants
POST /api/files/raw/document.pdf      // Store as-is

Storage Organization

BlobAdapter Layout

{data_dir}/
├── blobs/
│   ├── {tn_id}/
│   │   ├── f1~QoE...46w           // Original file
│   │   ├── f1~xyz...789           // Variant 1
│   │   ├── f1~abc...123           // Variant 2
│   │   └── ...
│   └── {other_tn_id}/
│       └── ...

MetaAdapter (SQLite)

CREATE TABLE files (
    id INTEGER PRIMARY KEY,
    tn_id INTEGER NOT NULL,
    file_id TEXT NOT NULL,
    original_filename TEXT,
    mime_type TEXT,
    size INTEGER,
    width INTEGER,
    height INTEGER,
    variants TEXT,  -- JSON array
    created_at INTEGER,
    owner TEXT,
    permissions TEXT,  -- JSON object
    UNIQUE(tn_id, file_id)
);

CREATE INDEX idx_files_owner ON files(owner);
CREATE INDEX idx_files_created ON files(created_at);

Performance Considerations

Worker Pool Usage

Image processing is CPU-intensive, so uses worker pool:

# Priority levels
Priority.High   → User-facing operations (thumbnail)
Priority.Medium → Background tasks (other image variants)
Priority.Low    → Longer operations (video upload)

Parallel Processing

Multiple variants can be generated in parallel:

# Create all resize tasks at once
task_ids = []

for variant in ["tn", "sd", "md", "hd"]:
    task_id = scheduler.schedule(ImageResizerTask(
        variant=variant,
        # ...
    ))

    task_ids.append(task_id)

# Wait for all to complete
scheduler.wait_all(task_ids)

Caching Strategy

Content-addressed files are immutable:

Cache-Control: public, max-age=31536000, immutable
  • Browsers cache forever
  • CDN can cache forever
  • No cache invalidation needed

See Also