databaseDemonstration Data

Every aspect of the Clones platform is designed to generate high-quality training data for multimodal computer-use AI agents. From AI-generated tasks to expert demonstrations, we capture comprehensive data that helps train AI to understand and replicate human computer interactions.

Overview

Each recorded demonstration consists of multiple files that work together to provide a complete picture of the user's interaction with the computer during a specific task. The system captures user inputs, screen recording, accessibility tree snapshots, and training annotations.

File Structure

Each demonstration contains five core files:

  1. meta.json - Demonstration metadata and configuration

  2. input_log.jsonl - Detailed event log of all user interactions

  3. input_log_meta.json - Metadata for the input log file

  4. recording.mp4 - Video recording of the screen during the demonstration

  5. sft.json - Supervised fine-tuning annotations for AI training

Additional Files (Storage)

  • checksums.json - File integrity verification with SHA-256 hashes and overall checksum

  • manifest.json - Extended demonstration manifest with task and environment metadata


Schema Versioning

All demonstration files use semantic versioning to ensure backward compatibility:

  • major: Breaking changes (not backward compatible)

  • minor: New features (backward compatible)

  • patch: Bug fixes and minor changes


File Specifications

1. meta.json

Core metadata about the demonstration session, system configuration, and task details.

Key Fields:

  • id: Timestamp-based unique identifier (YYYYMMDD_HHMMSS) generated when recording starts

  • timestamp: ISO 8601 timestamp of when the demonstration was created (includes timezone)

  • duration_seconds: Total demonstration duration calculated from actual video file using FFprobe

  • status: Recording state - "recording" (in progress), "completed" (finished successfully), "failed" (error occurred)

  • reason: End condition - "done" (task completed), "fail" (task failed), or custom termination reason

  • title: Human-readable task name displayed to user

  • description: Detailed task instructions provided to the demonstrator

  • platform: Operating system - "macos" or "windows" (Linux not currently supported)

  • arch: System architecture - "aarch64" (Apple Silicon), "x86_64" (Intel/AMD)

  • version: Operating system version string

  • locale: System language and region setting (e.g., "fr-FR", "en-US")

  • keyboard_layout: Current keyboard layout identifier (e.g., "us-qwerty", "fr-azerty", "de-qwertz" or null if detection fails)

  • primary_monitor: Physical screen dimensions used for video recording (includes scale factor for high-DPI displays)

  • quest: Complete task definition including objectives, reward information, and source pool

2. input_log.jsonl

Event log capturing all user interactions in JSON Lines format. Each line represents a single event.

Event Structure:

  • event: Type of interaction

  • data: Event-specific details

  • time: Relative timestamp in milliseconds since recording start

Event Types:

Mouse Events:

  • mousemove: Cursor position changes with dual coordinate system

  • mousedown/mouseup: Button press/release with normalized positions

  • mousewheel: Scroll events with delta values

Keyboard Events:

  • keydown/keyup: Key press/release with layout-aware detection

System Events:

  • ffmpeg_stderr/ffmpeg_stdout: Video encoding logs

  • axtree_interaction: UI accessibility tree snapshots

Mouse Coordinate System

The system provides dual coordinate tracking for comprehensive mouse event capture:

Normalized Coordinates (x, y):

  • Reference Frame: Relative to primary monitor (0,0 = top-left corner)

  • Units: Logical pixels (DPI-independent)

  • Bounds: Clamped to monitor boundaries [0, monitor_width] × [0, monitor_height]

  • Multi-Monitor: Always positive values, even with secondary displays

  • Use Case: Training models with consistent coordinate space

Raw Coordinates (raw_x, raw_y):

  • Reference Frame: System's global coordinate space

  • Units: Physical pixels (DPI-dependent)

  • Bounds: May be negative or exceed primary monitor bounds

  • Multi-Monitor: Reflects actual system coordinates across all displays

  • Use Case: Debugging, multi-monitor analysis, system-level automation

Processing in Quality Agent: The extraction pipeline (simple-extractor.ts) prioritizes normalized coordinates for model training:

Training Data Format: The message formatter (message-formatter.ts) outputs normalized coordinates for SFT:

Examples:

Coordinate Normalization Examples:

Accessibility Tree Events (axtree_interaction)

Captures comprehensive UI structure snapshots triggered by significant user interactions. This is the core mechanism for training AI agents to understand application context and verify correct app usage during grading.

Triggering Logic: AXTree events are intelligently triggered only for meaningful interactions:

  • Mouse clicks (mousedown/mouseup)

  • Navigation keys (Tab, Arrow keys, Enter, Escape, Space)

  • Function keys (F1-F12)

  • Page navigation (PageUp, PageDown, Home, End)

  • Significant scroll events (delta > 1.0 to filter micro-movements)

  • 500ms debouncing prevents excessive captures from rapid interactions

Focus Detection & App Status: The system implements sophisticated focus tracking to determine application readiness:

App Status Determination: The app_status field is crucial for training and grading accuracy:

  • "ready": Application is fully loaded and functional

    • Focused app name matches an entry in the available apps tree

    • All UI elements are accessible and interactive

    • Window content is fully rendered

  • "launching": Application is starting but not yet functional

    • Focused app is detected but not yet in the accessibility tree

    • UI elements may be partially loaded or unresponsive

    • Transitional state during app startup

  • "unknown": Cannot determine application state

    • No focused app detected

    • Accessibility tree is empty or inaccessible

    • System is in an indeterminate state

Focus Detection Algorithm:

  1. Get Focused App: Uses platform-specific APIs (AX on macOS, UI Automation on Windows)

  2. Cross-Reference: Checks if focused app appears in the accessibility tree

  3. String Matching: Performs case-insensitive substring matching between focused app name and tree entries

  4. Status Assignment: Determines readiness based on presence and accessibility

Grading Integration: This data enables precise task validation:

  • Correct App Usage: Verifies the demonstrated app matches task requirements

  • Interaction Timing: Ensures actions occur when apps are ready, not during loading

  • UI Element Validation: Confirms interactions target appropriate interface elements

  • Focus Tracking: Detects context switches and multi-app workflows

Anti-Cheating Measures:

  • Clones App Filtering: Automatically excludes Clones desktop app from captures

  • System App Exclusion: Filters out Window Server, Dock, Spotlight, and other system processes

  • Focus Validation: Ensures demonstrated actions occur in the intended application context

Keyboard Layout Detection:

The system captures both physical key presses and their semantic meaning across different keyboard layouts:

Physical Key Detection:

  • Captures the actual physical key pressed (e.g., the key in the "A" position)

  • Records raw key codes independent of layout

  • Provides detection_method field indicating capture technique used

Layout-Aware Processing:

  • QWERTY: Standard US/UK layouts with Key format ("KeyA", "KeyB")

  • AZERTY: French layout where A/Q keys are swapped

  • QWERTZ: German layout where Y/Z keys are swapped

  • Dvorak: Alternative efficiency-focused layouts

Detection Methods:

  • "rdev_cross_platform": Cross-platform detection using rdev library

  • "multiinput_windows": Windows-specific hardware-level capture

  • Layout-dependent flag: Indicates whether the key output varies by keyboard layout

Key Event Pair Structure: Each key interaction generates two events with complementary information:

keydown Event (Character Production):

keyup Event (Physical Release):

Character Extraction Process: The actual_char field uses Unicode format parsing:

Layout Demonstration (AZERTY keyboard):

  • Physical Key: KeyQ (Q key location on QWERTY)

  • Produced Character: "A" (because on AZERTY, the Q position produces A)

  • Training Benefit: Models learn both the physical action AND the semantic result

Important Distinctions:

  • key: Physical key location (hardware-level, layout-independent)

  • actual_char: Character that appears on screen (layout-dependent, semantic result)

  • keydown: Captures character production and typing intent

  • keyup: Captures physical key release timing (duration calculation)

This dual-event system allows precise reconstruction of both typing mechanics and semantic content across any keyboard layout.

Key Features:

  • Cross-platform compatibility: Handles Windows and macOS with unified event format

  • Layout-aware keyboard detection: Distinguishes between physical keys and produced characters

  • Coordinate normalization: Clamps mouse coordinates to screen bounds with raw coordinate preservation

  • Intelligent UI triggering: Only captures accessibility snapshots for significant interactions

  • Focus-aware grading: App status detection enables accurate task validation

  • Anti-tampering: Built-in filtering of system apps and recording software

3. input_log_meta.json

Metadata describing the input log file structure and content.

  • format: Always "jsonl" (JSON Lines) but later could be Parquet, Arrow, ...

  • event_count: Total number of events in the log

  • timestamp_type: "relative" (milliseconds since recording start) or "absolute"

4. recording.mp4

High-quality screen recording captured during the demonstration:

  • Codec: H.264 (libx264)

  • Container: MP4

  • Frame Rate: Variable (typically 30fps)

  • Resolution: Matches physical screen resolution (e.g., 2880x1800 for Retina displays)

  • Bitrate: Adaptive based on content

The recording captures:

  • All visual changes on the screen

  • Mouse cursor movements and clicks

  • Window focus changes

  • Application launches and interactions

5. sft.json

Supervised Fine-Tuning annotations that convert raw demonstrations into structured training data for AI models. This file contains the conversation format with user messages, screenshots, and AI responses.

Key Features:

  • Multi-turn conversations: Sequences of user instructions and AI responses

  • Screenshot integration: Base64-encoded images at key interaction points

  • Timestamp alignment: Synchronized with input_log.jsonl events

  • Training-ready format: Compatible with instruction-tuning frameworks


Storage and Distribution

File Integrity

checksums.json - Integrity Verification

Each demonstration includes a comprehensive integrity manifest that ensures data authenticity and detects tampering:

Key Fields:

  • demoHash: Content-addressable identifier generated from submission ID, user address, and timestamp

  • submissionId: Unique identifier linking to the demonstration submission in the database

  • userAddress: Ethereum address of the user who created the demonstration (lowercase)

  • timestamp: Unix timestamp (milliseconds) when the demonstration was stored

  • files: Array of file integrity records with SHA-256 hashes, sizes, and modification times

  • overallHash: SHA-256 hash computed from all individual file hashes for tamper detection

Integrity Verification Process:

  1. Individual File Verification: Each file's current SHA-256 is compared against stored hash

  2. Size Validation: File sizes are checked against recorded values

  3. Overall Hash Check: Combined hash of all file hashes is verified

  4. Accessibility Test: Ensures all files remain accessible in storage

manifest.json - Extended Metadata

Provides additional context and metadata beyond the core meta.json file:

Key Fields:

  • demonstration_id: Same as demoHash, provides content-addressable reference

  • task.type: Classification of task type ("computer_use", "web_navigation", etc.)

  • task.description: User-friendly task description for training context

  • task.url: Icon URL or reference URL for the primary application

  • environment.os: Operating system for platform-specific analysis

  • environment.browser: Application context ("desktop_app", browser name, etc.)

  • environment.screen_resolution: Display resolution for coordinate context

Use Cases:

  • Dataset Organization: Categorizing demonstrations by task type and environment

  • Quality Metrics: Tracking performance across different platforms and applications

  • Training Optimization: Filtering datasets by environment characteristics

  • Research Analysis: Understanding task distribution and completion patterns

API Access

⚠️ Access Control: Demonstration data requires authentication and authorization. Users can only access their own demonstration files through wallet signature verification.

Demonstrations are accessible via RESTful API endpoints:

  • List files: GET /api/v1/forge/demo-files/{submissionId} (requires session auth)

  • Download file: GET /api/v1/forge/demo-files/{submissionId}/{filename} (requires session auth)

  • Verify integrity: GET /api/v1/forge/demo-files/{submissionId}/verify (requires session auth)

Authentication Requirements:

  • Valid wallet session with matching user address

  • Case-insensitive address matching for submission ownership

  • Rate limiting applied to all endpoints

Available Files via API:

  • All core demonstration files (meta.json, input_log.jsonl, input_log_meta.json, recording.mp4, sft.json)

  • Integrity verification file (checksums.json)

  • Extended metadata (manifest.json)

  • Files listed with download URLs, sizes, and SHA-256 hashes for client-side verification

Response Formats:

  • JSON files: application/json

  • Video files: video/mp4 or text/plain (base64)

  • Input logs: application/x-ndjson


Platform Specifics

Windows

  • Input capture: Multiinput library for hardware-level input capture of keyboards, mice, and joysticks

  • Screen recording: GDI capture (gdigrab) with desktop input

  • Accessibility: Windows UI Automation (planned future implementation)

  • Keyboard layouts: KLID registry detection with caching for performance

macOS

  • Input capture: rdev library combined with Core Foundation for absolute mouse positioning

  • Screen recording: AVFoundation with automatic device detection

  • Accessibility: Native AX API with CGWindowList fallback for comprehensive UI tree capture

  • Permissions: Requires Accessibility permission in System Settings → Privacy & Security → Accessibility

  • Keyboard layouts: TIS (Text Input Source) API with intelligent detection and fallback to null on errors


Quality Metrics

The system tracks several quality indicators:

  • Completion rate: Whether the task was successfully completed

  • Interaction efficiency: Number of actions relative to task complexity

  • Error rate: Failed interactions or corrections during demonstration

  • Timestamp accuracy: Synchronization between events and video

  • File integrity: SHA-256 verification of all components


Use Cases

AI Training

  • Multimodal learning: Screenshots paired with interaction events

  • Instruction following: User commands with corresponding actions

  • Error correction: Learning from failed attempts and corrections

Research and Analysis

  • UI interaction patterns: Understanding how users navigate applications

  • Cross-platform behavior: Comparing interaction patterns across OS

  • Accessibility analysis: How users interact with different UI elements

Quality Assurance

  • Automated testing: Replay interactions for regression testing

  • User experience: Analyzing task completion efficiency

  • Application usability: Identifying common user difficulties


Technical Implementation

Recording Pipeline

  1. Initialization: Set up FFmpeg, input listeners, and accessibility monitoring

  2. Event capture: Parallel threads for input events, screen recording, and UI snapshots

  3. Synchronization: Relative timestamps ensure precise event alignment

  4. Finalization: Generate metadata, verify integrity, and package files

Event Processing

  • Debouncing: Prevents excessive UI snapshots (500ms minimum interval)

  • Filtering: Ignores system applications and focus changes to Clones app

  • Normalization: Coordinate clamping and cross-platform key mapping

Storage Architecture

  • Distributed storage: Object storage with content-addressable hashing

  • Deduplication: File-level deduplication based on SHA-256

  • Compression: Efficient storage of large video files and JSON data

Conclusion: Toward Comprehensive Computer Use Agent Training

Together, these seven core files define a complete demonstration that captures every aspect of human-computer interaction with unprecedented detail and precision. When multiple high-quality demonstrations are combined, they form comprehensive training datasets specifically designed for Computer Use Agent (CUA) development.

Dataset Formation and AI Training Pipeline

The ultimate objective of this demonstration data format is to enable the creation of robust training datasets through the following pipeline:

  1. Quality Assessment: Each demonstration is evaluated by the Clones Quality Agent service, which scores demonstrations based on task completion, interaction efficiency, and data integrity

  2. Dataset Curation: High-scoring demonstrations are aggregated into structured datasets, filtered by task complexity, application domains, and success metrics

  3. Model Training: These curated datasets are used for supervised fine-tuning of AI models, enabling them to learn multimodal computer interaction patterns

  4. Agent Deployment: Trained models power Computer Use Agents that can autonomously execute complex tasks on users' computers with human-level proficiency

Evaluation Framework for CUA Training Efficacy

We assess whether this data format provides sufficient structure and richness for effective multitasking Computer Use Agent training through six critical dimensions:

1. Data Integrity and Synchronization

  • Hash Consistency: SHA-256 verification ensures data authenticity and tamper detection

  • Temporal Alignment: Precise synchronization between input logs, video frames, and accessibility snapshots

  • Schema Versioning: Backward-compatible evolution prevents data corruption during dataset updates

2. Event Log Richness and Precision

  • Complete User Actions: Comprehensive capture of clicks, keystrokes, scrolls, and drag operations with millisecond precision

  • Coordinate Systems: Dual tracking (normalized/raw) ensures training consistency across display configurations

  • Layout Awareness: Physical key mapping combined with semantic character output supports universal keyboard compatibility

3. Task Context and Clarity

  • Objective Definition: Clear task descriptions with structured step-by-step objectives

  • Application Context: Detailed metadata about target applications, environments, and system configurations

  • Instruction Quality: Natural language prompts that establish clear expectations for agent behavior

4. Video-Event Correlation

  • Frame Synchronization: Video timestamps precisely aligned with input events for multimodal learning

  • Visual Validation: Screen recordings provide ground truth for verifying logged interactions

  • State Transitions: Visual evidence of UI changes corresponding to recorded user actions

5. SFT File Structure and Coherence

  • Conversation Flow: Logical sequence of user instructions and assistant actions in training format

  • Screenshot Integration: Base64-encoded images at critical interaction points for visual context

  • Focus Transitions: Accurate tracking of application switches and window management

  • Action Annotations: Python-style function calls that translate human actions into executable agent commands

6. Action Sequence Reproducibility

  • Deterministic Reconstruction: Sufficient detail to replay interactions with identical outcomes

  • State Consistency: AXTree snapshots provide UI state verification at key interaction points

  • Error Recovery: Documentation of failed actions and corrections for robust agent training

  • Multi-Application Workflows: Support for complex tasks spanning multiple applications and contexts

Training Dataset Characteristics

This format produces datasets with several key advantages for CUA development:

  • Multimodal Learning: Screenshots paired with precise interaction events enable visual-motor learning

  • Cross-Platform Generalization: Consistent data structure across Windows and macOS environments

  • Task Diversity: Support for web browsing, document creation, file management, and application-specific workflows

  • Quality Assurance: Built-in integrity verification and grading systems ensure high-quality training examples

  • Scalable Architecture: Content-addressable storage and API access support large-scale dataset management

This comprehensive data format establishes a foundation for developing Computer Use Agents that can reliably execute complex tasks in real-world computing environments, ensuring that every demonstration captures the full context needed to train capable computer-use AI agents while maintaining compatibility across different platforms and use cases.

Last updated