Demonstration Data
Every aspect of the Clones platform is designed to generate high-quality training data for multimodal computer-use AI agents. From AI-generated tasks to expert demonstrations, we capture comprehensive data that helps train AI to understand and replicate human computer interactions.
Overview
Each recorded demonstration consists of multiple files that work together to provide a complete picture of the user's interaction with the computer during a specific task. The system captures user inputs, screen recording, accessibility tree snapshots, and training annotations.
File Structure
Each demonstration contains five core files:
meta.json- Demonstration metadata and configurationinput_log.jsonl- Detailed event log of all user interactionsinput_log_meta.json- Metadata for the input log filerecording.mp4- Video recording of the screen during the demonstrationsft.json- Supervised fine-tuning annotations for AI training
Additional Files (Storage)
checksums.json- File integrity verification with SHA-256 hashes and overall checksummanifest.json- Extended demonstration manifest with task and environment metadata
Schema Versioning
All demonstration files use semantic versioning to ensure backward compatibility:
major: Breaking changes (not backward compatible)
minor: New features (backward compatible)
patch: Bug fixes and minor changes
File Specifications
1. meta.json
Core metadata about the demonstration session, system configuration, and task details.
Key Fields:
id: Timestamp-based unique identifier (YYYYMMDD_HHMMSS) generated when recording startstimestamp: ISO 8601 timestamp of when the demonstration was created (includes timezone)duration_seconds: Total demonstration duration calculated from actual video file using FFprobestatus: Recording state -"recording"(in progress),"completed"(finished successfully),"failed"(error occurred)reason: End condition -"done"(task completed),"fail"(task failed), or custom termination reasontitle: Human-readable task name displayed to userdescription: Detailed task instructions provided to the demonstratorplatform: Operating system -"macos"or"windows"(Linux not currently supported)arch: System architecture -"aarch64"(Apple Silicon),"x86_64"(Intel/AMD)version: Operating system version stringlocale: System language and region setting (e.g.,"fr-FR","en-US")keyboard_layout: Current keyboard layout identifier (e.g.,"us-qwerty","fr-azerty","de-qwertz"ornullif detection fails)primary_monitor: Physical screen dimensions used for video recording (includes scale factor for high-DPI displays)quest: Complete task definition including objectives, reward information, and source pool
2. input_log.jsonl
Event log capturing all user interactions in JSON Lines format. Each line represents a single event.
Event Structure:
event: Type of interactiondata: Event-specific detailstime: Relative timestamp in milliseconds since recording start
Event Types:
Mouse Events:
mousemove: Cursor position changes with dual coordinate systemmousedown/mouseup: Button press/release with normalized positionsmousewheel: Scroll events with delta values
Keyboard Events:
keydown/keyup: Key press/release with layout-aware detection
System Events:
ffmpeg_stderr/ffmpeg_stdout: Video encoding logsaxtree_interaction: UI accessibility tree snapshots
Mouse Coordinate System
The system provides dual coordinate tracking for comprehensive mouse event capture:
Normalized Coordinates (x, y):
Reference Frame: Relative to primary monitor (0,0 = top-left corner)
Units: Logical pixels (DPI-independent)
Bounds: Clamped to monitor boundaries [0, monitor_width] × [0, monitor_height]
Multi-Monitor: Always positive values, even with secondary displays
Use Case: Training models with consistent coordinate space
Raw Coordinates (raw_x, raw_y):
Reference Frame: System's global coordinate space
Units: Physical pixels (DPI-dependent)
Bounds: May be negative or exceed primary monitor bounds
Multi-Monitor: Reflects actual system coordinates across all displays
Use Case: Debugging, multi-monitor analysis, system-level automation
Processing in Quality Agent: The extraction pipeline (simple-extractor.ts) prioritizes normalized coordinates for model training:
Training Data Format: The message formatter (message-formatter.ts) outputs normalized coordinates for SFT:
Examples:
Coordinate Normalization Examples:
Accessibility Tree Events (axtree_interaction)
Captures comprehensive UI structure snapshots triggered by significant user interactions. This is the core mechanism for training AI agents to understand application context and verify correct app usage during grading.
Triggering Logic: AXTree events are intelligently triggered only for meaningful interactions:
Mouse clicks (mousedown/mouseup)
Navigation keys (Tab, Arrow keys, Enter, Escape, Space)
Function keys (F1-F12)
Page navigation (PageUp, PageDown, Home, End)
Significant scroll events (delta > 1.0 to filter micro-movements)
500ms debouncing prevents excessive captures from rapid interactions
Focus Detection & App Status: The system implements sophisticated focus tracking to determine application readiness:
App Status Determination: The app_status field is crucial for training and grading accuracy:
"ready": Application is fully loaded and functionalFocused app name matches an entry in the available apps tree
All UI elements are accessible and interactive
Window content is fully rendered
"launching": Application is starting but not yet functionalFocused app is detected but not yet in the accessibility tree
UI elements may be partially loaded or unresponsive
Transitional state during app startup
"unknown": Cannot determine application stateNo focused app detected
Accessibility tree is empty or inaccessible
System is in an indeterminate state
Focus Detection Algorithm:
Get Focused App: Uses platform-specific APIs (AX on macOS, UI Automation on Windows)
Cross-Reference: Checks if focused app appears in the accessibility tree
String Matching: Performs case-insensitive substring matching between focused app name and tree entries
Status Assignment: Determines readiness based on presence and accessibility
Grading Integration: This data enables precise task validation:
Correct App Usage: Verifies the demonstrated app matches task requirements
Interaction Timing: Ensures actions occur when apps are ready, not during loading
UI Element Validation: Confirms interactions target appropriate interface elements
Focus Tracking: Detects context switches and multi-app workflows
Anti-Cheating Measures:
Clones App Filtering: Automatically excludes Clones desktop app from captures
System App Exclusion: Filters out Window Server, Dock, Spotlight, and other system processes
Focus Validation: Ensures demonstrated actions occur in the intended application context
Keyboard Layout Detection:
The system captures both physical key presses and their semantic meaning across different keyboard layouts:
Physical Key Detection:
Captures the actual physical key pressed (e.g., the key in the "A" position)
Records raw key codes independent of layout
Provides
detection_methodfield indicating capture technique used
Layout-Aware Processing:
QWERTY: Standard US/UK layouts with
Keyformat ("KeyA", "KeyB")AZERTY: French layout where A/Q keys are swapped
QWERTZ: German layout where Y/Z keys are swapped
Dvorak: Alternative efficiency-focused layouts
Detection Methods:
"rdev_cross_platform": Cross-platform detection using rdev library"multiinput_windows": Windows-specific hardware-level captureLayout-dependent flag: Indicates whether the key output varies by keyboard layout
Key Event Pair Structure: Each key interaction generates two events with complementary information:
keydown Event (Character Production):
keyup Event (Physical Release):
Character Extraction Process: The actual_char field uses Unicode format parsing:
Layout Demonstration (AZERTY keyboard):
Physical Key:
KeyQ(Q key location on QWERTY)Produced Character:
"A"(because on AZERTY, the Q position produces A)Training Benefit: Models learn both the physical action AND the semantic result
Important Distinctions:
key: Physical key location (hardware-level, layout-independent)actual_char: Character that appears on screen (layout-dependent, semantic result)keydown: Captures character production and typing intent
keyup: Captures physical key release timing (duration calculation)
This dual-event system allows precise reconstruction of both typing mechanics and semantic content across any keyboard layout.
Key Features:
Cross-platform compatibility: Handles Windows and macOS with unified event format
Layout-aware keyboard detection: Distinguishes between physical keys and produced characters
Coordinate normalization: Clamps mouse coordinates to screen bounds with raw coordinate preservation
Intelligent UI triggering: Only captures accessibility snapshots for significant interactions
Focus-aware grading: App status detection enables accurate task validation
Anti-tampering: Built-in filtering of system apps and recording software
3. input_log_meta.json
Metadata describing the input log file structure and content.
format: Always"jsonl"(JSON Lines) but later could be Parquet, Arrow, ...event_count: Total number of events in the logtimestamp_type:"relative"(milliseconds since recording start) or"absolute"
4. recording.mp4
High-quality screen recording captured during the demonstration:
Codec: H.264 (libx264)
Container: MP4
Frame Rate: Variable (typically 30fps)
Resolution: Matches physical screen resolution (e.g., 2880x1800 for Retina displays)
Bitrate: Adaptive based on content
The recording captures:
All visual changes on the screen
Mouse cursor movements and clicks
Window focus changes
Application launches and interactions
5. sft.json
Supervised Fine-Tuning annotations that convert raw demonstrations into structured training data for AI models. This file contains the conversation format with user messages, screenshots, and AI responses.
Key Features:
Multi-turn conversations: Sequences of user instructions and AI responses
Screenshot integration: Base64-encoded images at key interaction points
Timestamp alignment: Synchronized with input_log.jsonl events
Training-ready format: Compatible with instruction-tuning frameworks
Storage and Distribution
File Integrity
checksums.json - Integrity Verification
Each demonstration includes a comprehensive integrity manifest that ensures data authenticity and detects tampering:
Key Fields:
demoHash: Content-addressable identifier generated from submission ID, user address, and timestampsubmissionId: Unique identifier linking to the demonstration submission in the databaseuserAddress: Ethereum address of the user who created the demonstration (lowercase)timestamp: Unix timestamp (milliseconds) when the demonstration was storedfiles: Array of file integrity records with SHA-256 hashes, sizes, and modification timesoverallHash: SHA-256 hash computed from all individual file hashes for tamper detection
Integrity Verification Process:
Individual File Verification: Each file's current SHA-256 is compared against stored hash
Size Validation: File sizes are checked against recorded values
Overall Hash Check: Combined hash of all file hashes is verified
Accessibility Test: Ensures all files remain accessible in storage
manifest.json - Extended Metadata
Provides additional context and metadata beyond the core meta.json file:
Key Fields:
demonstration_id: Same as demoHash, provides content-addressable referencetask.type: Classification of task type ("computer_use", "web_navigation", etc.)task.description: User-friendly task description for training contexttask.url: Icon URL or reference URL for the primary applicationenvironment.os: Operating system for platform-specific analysisenvironment.browser: Application context ("desktop_app", browser name, etc.)environment.screen_resolution: Display resolution for coordinate context
Use Cases:
Dataset Organization: Categorizing demonstrations by task type and environment
Quality Metrics: Tracking performance across different platforms and applications
Training Optimization: Filtering datasets by environment characteristics
Research Analysis: Understanding task distribution and completion patterns
API Access
⚠️ Access Control: Demonstration data requires authentication and authorization. Users can only access their own demonstration files through wallet signature verification.
Demonstrations are accessible via RESTful API endpoints:
List files:
GET /api/v1/forge/demo-files/{submissionId}(requires session auth)Download file:
GET /api/v1/forge/demo-files/{submissionId}/{filename}(requires session auth)Verify integrity:
GET /api/v1/forge/demo-files/{submissionId}/verify(requires session auth)
Authentication Requirements:
Valid wallet session with matching user address
Case-insensitive address matching for submission ownership
Rate limiting applied to all endpoints
Available Files via API:
All core demonstration files (
meta.json,input_log.jsonl,input_log_meta.json,recording.mp4,sft.json)Integrity verification file (
checksums.json)Extended metadata (
manifest.json)Files listed with download URLs, sizes, and SHA-256 hashes for client-side verification
Response Formats:
JSON files:
application/jsonVideo files:
video/mp4ortext/plain(base64)Input logs:
application/x-ndjson
Platform Specifics
Windows
Input capture: Multiinput library for hardware-level input capture of keyboards, mice, and joysticks
Screen recording: GDI capture (gdigrab) with desktop input
Accessibility: Windows UI Automation (planned future implementation)
Keyboard layouts: KLID registry detection with caching for performance
macOS
Input capture: rdev library combined with Core Foundation for absolute mouse positioning
Screen recording: AVFoundation with automatic device detection
Accessibility: Native AX API with CGWindowList fallback for comprehensive UI tree capture
Permissions: Requires Accessibility permission in System Settings → Privacy & Security → Accessibility
Keyboard layouts: TIS (Text Input Source) API with intelligent detection and fallback to null on errors
Quality Metrics
The system tracks several quality indicators:
Completion rate: Whether the task was successfully completed
Interaction efficiency: Number of actions relative to task complexity
Error rate: Failed interactions or corrections during demonstration
Timestamp accuracy: Synchronization between events and video
File integrity: SHA-256 verification of all components
Use Cases
AI Training
Multimodal learning: Screenshots paired with interaction events
Instruction following: User commands with corresponding actions
Error correction: Learning from failed attempts and corrections
Research and Analysis
UI interaction patterns: Understanding how users navigate applications
Cross-platform behavior: Comparing interaction patterns across OS
Accessibility analysis: How users interact with different UI elements
Quality Assurance
Automated testing: Replay interactions for regression testing
User experience: Analyzing task completion efficiency
Application usability: Identifying common user difficulties
Technical Implementation
Recording Pipeline
Initialization: Set up FFmpeg, input listeners, and accessibility monitoring
Event capture: Parallel threads for input events, screen recording, and UI snapshots
Synchronization: Relative timestamps ensure precise event alignment
Finalization: Generate metadata, verify integrity, and package files
Event Processing
Debouncing: Prevents excessive UI snapshots (500ms minimum interval)
Filtering: Ignores system applications and focus changes to Clones app
Normalization: Coordinate clamping and cross-platform key mapping
Storage Architecture
Distributed storage: Object storage with content-addressable hashing
Deduplication: File-level deduplication based on SHA-256
Compression: Efficient storage of large video files and JSON data
Conclusion: Toward Comprehensive Computer Use Agent Training
Together, these seven core files define a complete demonstration that captures every aspect of human-computer interaction with unprecedented detail and precision. When multiple high-quality demonstrations are combined, they form comprehensive training datasets specifically designed for Computer Use Agent (CUA) development.
Dataset Formation and AI Training Pipeline
The ultimate objective of this demonstration data format is to enable the creation of robust training datasets through the following pipeline:
Quality Assessment: Each demonstration is evaluated by the Clones Quality Agent service, which scores demonstrations based on task completion, interaction efficiency, and data integrity
Dataset Curation: High-scoring demonstrations are aggregated into structured datasets, filtered by task complexity, application domains, and success metrics
Model Training: These curated datasets are used for supervised fine-tuning of AI models, enabling them to learn multimodal computer interaction patterns
Agent Deployment: Trained models power Computer Use Agents that can autonomously execute complex tasks on users' computers with human-level proficiency
Evaluation Framework for CUA Training Efficacy
We assess whether this data format provides sufficient structure and richness for effective multitasking Computer Use Agent training through six critical dimensions:
1. Data Integrity and Synchronization
Hash Consistency: SHA-256 verification ensures data authenticity and tamper detection
Temporal Alignment: Precise synchronization between input logs, video frames, and accessibility snapshots
Schema Versioning: Backward-compatible evolution prevents data corruption during dataset updates
2. Event Log Richness and Precision
Complete User Actions: Comprehensive capture of clicks, keystrokes, scrolls, and drag operations with millisecond precision
Coordinate Systems: Dual tracking (normalized/raw) ensures training consistency across display configurations
Layout Awareness: Physical key mapping combined with semantic character output supports universal keyboard compatibility
3. Task Context and Clarity
Objective Definition: Clear task descriptions with structured step-by-step objectives
Application Context: Detailed metadata about target applications, environments, and system configurations
Instruction Quality: Natural language prompts that establish clear expectations for agent behavior
4. Video-Event Correlation
Frame Synchronization: Video timestamps precisely aligned with input events for multimodal learning
Visual Validation: Screen recordings provide ground truth for verifying logged interactions
State Transitions: Visual evidence of UI changes corresponding to recorded user actions
5. SFT File Structure and Coherence
Conversation Flow: Logical sequence of user instructions and assistant actions in training format
Screenshot Integration: Base64-encoded images at critical interaction points for visual context
Focus Transitions: Accurate tracking of application switches and window management
Action Annotations: Python-style function calls that translate human actions into executable agent commands
6. Action Sequence Reproducibility
Deterministic Reconstruction: Sufficient detail to replay interactions with identical outcomes
State Consistency: AXTree snapshots provide UI state verification at key interaction points
Error Recovery: Documentation of failed actions and corrections for robust agent training
Multi-Application Workflows: Support for complex tasks spanning multiple applications and contexts
Training Dataset Characteristics
This format produces datasets with several key advantages for CUA development:
Multimodal Learning: Screenshots paired with precise interaction events enable visual-motor learning
Cross-Platform Generalization: Consistent data structure across Windows and macOS environments
Task Diversity: Support for web browsing, document creation, file management, and application-specific workflows
Quality Assurance: Built-in integrity verification and grading systems ensure high-quality training examples
Scalable Architecture: Content-addressable storage and API access support large-scale dataset management
This comprehensive data format establishes a foundation for developing Computer Use Agents that can reliably execute complex tasks in real-world computing environments, ensuring that every demonstration captures the full context needed to train capable computer-use AI agents while maintaining compatibility across different platforms and use cases.
Last updated