Demonstration Data

Every aspect of the Clones platform is designed to generate high-quality training data for multimodal computer-use AI agents. From AI-generated tasks to expert demonstrations, we capture comprehensive data that helps train AI to understand and replicate human computer interactions.

Overview

Each recorded demonstration consists of multiple files that work together to provide a complete picture of the user's interaction with the computer during a specific task. The system captures user inputs, screen recording, accessibility tree snapshots, and training annotations.

File Structure

Each demonstration contains five core files:

meta.json - Demonstration metadata and configuration
input_log.jsonl - Detailed event log of all user interactions
input_log_meta.json - Metadata for the input log file
recording.mp4 - Video recording of the screen during the demonstration
sft.json - Supervised fine-tuning annotations for AI training

Additional Files (Storage)

checksums.json - File integrity verification with SHA-256 hashes and overall checksum
manifest.json - Extended demonstration manifest with task and environment metadata

Schema Versioning

All demonstration files use semantic versioning to ensure backward compatibility:

{
  "schema_version": {
    "major": 1,
    "minor": 0, 
    "patch": 0
  }
}

major: Breaking changes (not backward compatible)
minor: New features (backward compatible)
patch: Bug fixes and minor changes

File Specifications

1. meta.json

Core metadata about the demonstration session, system configuration, and task details.

{
  "schema_version": {
    "major": 1,
    "minor": 0,
    "patch": 0
  },
  "id": "20251013_161828",
  "timestamp": "2025-10-13T16:18:28.998094+02:00",
  "duration_seconds": 22,
  "status": "completed",
  "reason": "done",
  "title": "Open Word and create document",
  "description": "Hi! I need to open Microsoft Word on my Mac...",
  "platform": "macos",
  "arch": "aarch64", 
  "version": "15.7.0",
  "locale": "fr-FR",
  "keyboard_layout": "us-qwerty",
  "primary_monitor": {
    "width": 2880,
    "height": 1800,
    "scale_factor": 2.0,
    "x": 0,
    "y": 0
  },
  "quest": {
    "title": "Open Word and create document",
    "app": "Microsoft Word",
    "icon_url": "https://s2.googleusercontent.com/s2/favicons?domain=office.com&sz=64",
    "objectives": [
      "Open <app>Microsoft Word</app> on your macOS device",
      "Go to the File menu",
      "Select 'New Document' from the dropdown", 
      "Begin typing in the blank document"
    ],
    "content": "Hi! I need to open Microsoft Word on my Mac...",
    "pool_id": "factory_0x592C7b04b29411FAbbE55aabE89eE7dA120906e0",
    "reward": {
      "time": 0,
      "max_reward": 2.0
    },
    "task_id": "1054c5ba-020e-4db6-a68f-e7362c58bd09"
  }
}

Key Fields:

id: Timestamp-based unique identifier (YYYYMMDD_HHMMSS) generated when recording starts
timestamp: ISO 8601 timestamp of when the demonstration was created (includes timezone)
duration_seconds: Total demonstration duration calculated from actual video file using FFprobe
status: Recording state - "recording" (in progress), "completed" (finished successfully), "failed" (error occurred)
reason: End condition - "done" (task completed), "fail" (task failed), or custom termination reason
title: Human-readable task name displayed to user
description: Detailed task instructions provided to the demonstrator
platform: Operating system - "macos" or "windows" (Linux not currently supported)
arch: System architecture - "aarch64" (Apple Silicon), "x86_64" (Intel/AMD)
version: Operating system version string
locale: System language and region setting (e.g., "fr-FR", "en-US")
keyboard_layout: Current keyboard layout identifier (e.g., "us-qwerty", "fr-azerty", "de-qwertz" or null if detection fails)
primary_monitor: Physical screen dimensions used for video recording (includes scale factor for high-DPI displays)
quest: Complete task definition including objectives, reward information, and source pool

2. input_log.jsonl

Event log capturing all user interactions in JSON Lines format. Each line represents a single event.

Event Structure:

{
  "event": "eventType",
  "data": {
    // Event-specific properties
  },
  "time": 1234
}

event: Type of interaction
data: Event-specific details
time: Relative timestamp in milliseconds since recording start

Event Types:

Mouse Events:

mousemove: Cursor position changes with dual coordinate system
mousedown/mouseup: Button press/release with normalized positions
mousewheel: Scroll events with delta values

Keyboard Events:

keydown/keyup: Key press/release with layout-aware detection

System Events:

ffmpeg_stderr/ffmpeg_stdout: Video encoding logs
axtree_interaction: UI accessibility tree snapshots

Mouse Coordinate System

The system provides dual coordinate tracking for comprehensive mouse event capture:

Normalized Coordinates (x, y):

Reference Frame: Relative to primary monitor (0,0 = top-left corner)
Units: Logical pixels (DPI-independent)
Bounds: Clamped to monitor boundaries [0, monitor_width] × [0, monitor_height]
Multi-Monitor: Always positive values, even with secondary displays
Use Case: Training models with consistent coordinate space

Raw Coordinates (raw_x, raw_y):

Reference Frame: System's global coordinate space
Units: Physical pixels (DPI-dependent)
Bounds: May be negative or exceed primary monitor bounds
Multi-Monitor: Reflects actual system coordinates across all displays
Use Case: Debugging, multi-monitor analysis, system-level automation

Processing in Quality Agent: The extraction pipeline (simple-extractor.ts) prioritizes normalized coordinates for model training:

// Uses normalized coordinates for AI training consistency
if (event.data.x !== undefined && event.data.y !== undefined) {
  lastKnownPos = { x: event.data.x, y: event.data.y };
  // Process for click/drag detection using logical pixel space
}

Training Data Format: The message formatter (message-formatter.ts) outputs normalized coordinates for SFT:

click(450, 300)  # Using logical pixels for model training
drag([100, 200, 150, 250, 200, 300])  # Normalized coordinate pairs

Examples:

// Mouse movement with coordinate normalization
{"data":{"raw_x":216.99,"raw_y":-255.44,"x":216.99,"y":0.0},"event":"mousemove","time":4753}

// Mouse click with button and position
{"data":{"button":"Left","x":450,"y":300},"event":"mousedown","time":5102}

// Key press with layout-aware character detection (AZERTY example)
{"data":{"key":"KeyQ","actual_char":"UnicodeInfo { name: Some(\"A\"), unicode: [65], is_dead: false }","detection_method":"rdev_cross_platform","layout_dependent":true},"event":"keydown","time":8453}

// Key release (same physical key, no character)
{"data":{"key":"KeyQ","actual_char":"","detection_method":"rdev_cross_platform","layout_dependent":true},"event":"keyup","time":8519}

// Mouse wheel scroll
{"data":{"delta":-120},"event":"mousewheel","time":6789}

Coordinate Normalization Examples:

// Single monitor (2880×1800 Retina display)
{"data":{"raw_x":1440.0,"raw_y":900.0,"x":1440.0,"y":900.0},"event":"mousemove","time":1000}

// Multi-monitor setup - cursor on secondary display
{"data":{"raw_x":3200.0,"raw_y":400.0,"x":2880.0,"y":400.0},"event":"mousemove","time":2000}

// Edge case - negative raw coordinates clamped
{"data":{"raw_x":-100.0,"raw_y":50.0,"x":0.0,"y":50.0},"event":"mousemove","time":3000}

Accessibility Tree Events (axtree_interaction)

Captures comprehensive UI structure snapshots triggered by significant user interactions. This is the core mechanism for training AI agents to understand application context and verify correct app usage during grading.

Triggering Logic: AXTree events are intelligently triggered only for meaningful interactions:

Mouse clicks (mousedown/mouseup)
Navigation keys (Tab, Arrow keys, Enter, Escape, Space)
Function keys (F1-F12)
Page navigation (PageUp, PageDown, Home, End)
Significant scroll events (delta > 1.0 to filter micro-movements)
500ms debouncing prevents excessive captures from rapid interactions

Focus Detection & App Status: The system implements sophisticated focus tracking to determine application readiness:

{
  "event": "axtree_interaction", 
  "data": {
    "duration": 5645,
    "app_status": "ready",
    "focused_app": {
      "name": "Microsoft Word",
      "bundle_id": "com.microsoft.Word", 
      "path": "/Applications/Microsoft Word.app",
      "pid": 1234
    },
    "tree": [
      {
        "name": "Microsoft Word",
        "role": "application",
        "description": "Display 0",
        "value": "",
        "bbox": {"x": 0, "y": 0, "width": 0, "height": 0},
        "children": [
          {
            "name": "Document Window",
            "role": "window", 
            "bbox": {"x": 100, "y": 100, "width": 800, "height": 600},
            "display_index": 0,
            "children": [
              {
                "name": "Text Editor",
                "role": "textfield",
                "description": "Document content",
                "value": "Hello World",
                "bbox": {"x": 120, "y": 150, "width": 760, "height": 500}
              }
            ]
          }
        ]
      }
    ],
    "queries": {
      "cursor": {
        "element": {
          "name": "Text Editor",
          "role": "textfield",
          "bbox": {"x": 120, "y": 150, "width": 760, "height": 500}
        },
        "position": {"x": 450, "y": 300}
      },
      "random1": {
        "element": {
          "name": "File Menu",
          "role": "button", 
          "bbox": {"x": 50, "y": 30, "width": 80, "height": 25}
        },
        "position": {"x": 90, "y": 42}
      }
    }
  },
  "time": 6234
}

App Status Determination: The app_status field is crucial for training and grading accuracy:

"ready": Application is fully loaded and functional
- Focused app name matches an entry in the available apps tree
- All UI elements are accessible and interactive
- Window content is fully rendered
"launching": Application is starting but not yet functional
- Focused app is detected but not yet in the accessibility tree
- UI elements may be partially loaded or unresponsive
- Transitional state during app startup
"unknown": Cannot determine application state
- No focused app detected
- Accessibility tree is empty or inaccessible
- System is in an indeterminate state

Focus Detection Algorithm:

Get Focused App: Uses platform-specific APIs (AX on macOS, UI Automation on Windows)
Cross-Reference: Checks if focused app appears in the accessibility tree
String Matching: Performs case-insensitive substring matching between focused app name and tree entries
Status Assignment: Determines readiness based on presence and accessibility

Grading Integration: This data enables precise task validation:

Correct App Usage: Verifies the demonstrated app matches task requirements
Interaction Timing: Ensures actions occur when apps are ready, not during loading
UI Element Validation: Confirms interactions target appropriate interface elements
Focus Tracking: Detects context switches and multi-app workflows

Anti-Cheating Measures:

Clones App Filtering: Automatically excludes Clones desktop app from captures
System App Exclusion: Filters out Window Server, Dock, Spotlight, and other system processes
Focus Validation: Ensures demonstrated actions occur in the intended application context

Keyboard Layout Detection:

The system captures both physical key presses and their semantic meaning across different keyboard layouts:

Physical Key Detection:

Captures the actual physical key pressed (e.g., the key in the "A" position)
Records raw key codes independent of layout
Provides detection_method field indicating capture technique used

Layout-Aware Processing:

QWERTY: Standard US/UK layouts with Key format ("KeyA", "KeyB")
AZERTY: French layout where A/Q keys are swapped
QWERTZ: German layout where Y/Z keys are swapped
Dvorak: Alternative efficiency-focused layouts

Detection Methods:

"rdev_cross_platform": Cross-platform detection using rdev library
"multiinput_windows": Windows-specific hardware-level capture
Layout-dependent flag: Indicates whether the key output varies by keyboard layout

Key Event Pair Structure: Each key interaction generates two events with complementary information:

keydown Event (Character Production):

{
  "event": "keydown",
  "data": {
    "key": "KeyQ",                                              // Physical key location
    "actual_char": "UnicodeInfo { name: Some(\"A\"), unicode: [65], is_dead: false }", // Character produced
    "detection_method": "rdev_cross_platform",
    "layout_dependent": true
  },
  "time": 11017
}

keyup Event (Physical Release):

{
  "event": "keyup",
  "data": {
    "key": "KeyQ",                                              // Same physical key
    "actual_char": "",                                          // No character on release
    "detection_method": "rdev_cross_platform",
    "layout_dependent": true
  },
  "time": 11083
}

Character Extraction Process: The actual_char field uses Unicode format parsing:

Input: "UnicodeInfo { name: Some(\"A\"), unicode: [65], is_dead: false }"
Extracted: "A"  // The actual character that appeared on screen

Layout Demonstration (AZERTY keyboard):

Physical Key: KeyQ (Q key location on QWERTY)
Produced Character: "A" (because on AZERTY, the Q position produces A)
Training Benefit: Models learn both the physical action AND the semantic result

Important Distinctions:

key: Physical key location (hardware-level, layout-independent)
actual_char: Character that appears on screen (layout-dependent, semantic result)
keydown: Captures character production and typing intent
keyup: Captures physical key release timing (duration calculation)

This dual-event system allows precise reconstruction of both typing mechanics and semantic content across any keyboard layout.

Key Features:

Cross-platform compatibility: Handles Windows and macOS with unified event format
Layout-aware keyboard detection: Distinguishes between physical keys and produced characters
Coordinate normalization: Clamps mouse coordinates to screen bounds with raw coordinate preservation
Intelligent UI triggering: Only captures accessibility snapshots for significant interactions
Focus-aware grading: App status detection enables accurate task validation
Anti-tampering: Built-in filtering of system apps and recording software

3. input_log_meta.json

Metadata describing the input log file structure and content.

{
  "schema_version": {
    "major": 1,
    "minor": 0, 
    "patch": 0
  },
  "format": "jsonl",
  "event_count": 645,
  "timestamp_type": "relative", 
  "created_at": "2025-10-13T16:18:52.690658+02:00"
}

format: Always "jsonl" (JSON Lines) but later could be Parquet, Arrow, ...
event_count: Total number of events in the log
timestamp_type: "relative" (milliseconds since recording start) or "absolute"

4. recording.mp4

High-quality screen recording captured during the demonstration:

Codec: H.264 (libx264)
Container: MP4
Frame Rate: Variable (typically 30fps)
Resolution: Matches physical screen resolution (e.g., 2880x1800 for Retina displays)
Bitrate: Adaptive based on content

The recording captures:

All visual changes on the screen
Mouse cursor movements and clicks
Window focus changes
Application launches and interactions

5. sft.json

Supervised Fine-Tuning annotations that convert raw demonstrations into structured training data for AI models. This file contains the conversation format with user messages, screenshots, and AI responses.

[
  {
    "role": "user",
    "content": "Hi! I need to open Microsoft Word...",
    "timestamp": 5999
  },
  {
    "role": "user", 
    "content": {
      "type": "image",
      "data": "/9j/4AAQSkZJRgABAQAAAQABAAD..." // Base64 encoded screenshot
    },
    "timestamp": 6234
  },
  {
    "role": "assistant",
    "content": "I'll help you open Microsoft Word. Let me click on the Applications folder...",
    "timestamp": 6500
  }
]

Key Features:

Multi-turn conversations: Sequences of user instructions and AI responses
Screenshot integration: Base64-encoded images at key interaction points
Timestamp alignment: Synchronized with input_log.jsonl events
Training-ready format: Compatible with instruction-tuning frameworks

Storage and Distribution

File Integrity

checksums.json - Integrity Verification

Each demonstration includes a comprehensive integrity manifest that ensures data authenticity and detects tampering:

{
  "schema_version": {
    "major": 1,
    "minor": 0,
    "patch": 0
  },
  "demoHash": "653cd6b1575b6363c32c0d0af2727f76602e53c2b3ebe80df2034970e8b7f4b3",
  "submissionId": "07548db448b011b708796d76df15f2aa7c0e9b1dc839f83fb7e8320fc5d7136b",
  "userAddress": "0x88b61192cdfced65e969bd58fd37d48498dd69de",
  "timestamp": 1760361109571,
  "files": [
    {
      "filename": "recording.mp4",
      "sha256": "e7a93efde4c0ad26dd10866143758b5785faa7d56bbca2705802cbdde2c4156c",
      "size": 1427061,
      "lastModified": "2025-10-13T13:11:49.989Z"
    },
    {
      "filename": "meta.json",
      "sha256": "ef1fc84a04d1d0fa05872dcfc40eb0194328fa290142d48a613dc2319b8ea3b3",
      "size": 1291,
      "lastModified": "2025-10-13T13:11:50.061Z"
    },
    {
      "filename": "input_log.jsonl",
      "sha256": "c69938dc87cbfe2361aa5b8f9d4524e43d8656aafcddf306e01f88e462da3f90",
      "size": 64353,
      "lastModified": "2025-10-13T13:11:50.145Z"
    },
    {
      "filename": "sft.json",
      "sha256": "427803b1db429007dca5107189cdfacb48cdc44fafe43a29bc5466f7d3b5e8db",
      "size": 838336,
      "lastModified": "2025-10-13T13:11:50.213Z"
    }
  ],
  "overallHash": "cb7be366e3d4db58258860b0786bd78c1767a8fce368f321db716a1142aa24d0"
}

Key Fields:

demoHash: Content-addressable identifier generated from submission ID, user address, and timestamp
submissionId: Unique identifier linking to the demonstration submission in the database
userAddress: Ethereum address of the user who created the demonstration (lowercase)
timestamp: Unix timestamp (milliseconds) when the demonstration was stored
files: Array of file integrity records with SHA-256 hashes, sizes, and modification times
overallHash: SHA-256 hash computed from all individual file hashes for tamper detection

Integrity Verification Process:

Individual File Verification: Each file's current SHA-256 is compared against stored hash
Size Validation: File sizes are checked against recorded values
Overall Hash Check: Combined hash of all file hashes is verified
Accessibility Test: Ensures all files remain accessible in storage

manifest.json - Extended Metadata

Provides additional context and metadata beyond the core meta.json file:

{
  "schema_version": {
    "major": 1,
    "minor": 0,
    "patch": 0
  },
  "demonstration_id": "653cd6b1575b6363c32c0d0af2727f76602e53c2b3ebe80df2034970e8b7f4b3",
  "user_address": "0x88b61192cdfced65e969bd58fd37d48498dd69de",
  "submission_id": "07548db448b011b708796d76df15f2aa7c0e9b1dc839f83fb7e8320fc5d7136b",
  "created_at": "2025-10-13T13:11:50.269Z",
  "task": {
    "type": "computer_use",
    "description": "Hi! Let's open Microsoft Word and create a new blank document. I'll guide you through it.",
    "url": "https://s2.googleusercontent.com/s2/favicons?domain=office.com&sz=64"
  },
  "environment": {
    "os": "macos",
    "browser": "desktop_app",
    "screen_resolution": "2880x1800"
  }
}

Key Fields:

demonstration_id: Same as demoHash, provides content-addressable reference
task.type: Classification of task type ("computer_use", "web_navigation", etc.)
task.description: User-friendly task description for training context
task.url: Icon URL or reference URL for the primary application
environment.os: Operating system for platform-specific analysis
environment.browser: Application context ("desktop_app", browser name, etc.)
environment.screen_resolution: Display resolution for coordinate context

Use Cases:

Dataset Organization: Categorizing demonstrations by task type and environment
Quality Metrics: Tracking performance across different platforms and applications
Training Optimization: Filtering datasets by environment characteristics
Research Analysis: Understanding task distribution and completion patterns

API Access

⚠️ Access Control: Demonstration data requires authentication and authorization. Users can only access their own demonstration files through wallet signature verification.

Demonstrations are accessible via RESTful API endpoints:

List files: GET /api/v1/forge/demo-files/{submissionId} (requires session auth)
Download file: GET /api/v1/forge/demo-files/{submissionId}/{filename} (requires session auth)
Verify integrity: GET /api/v1/forge/demo-files/{submissionId}/verify (requires session auth)

Authentication Requirements:

Valid wallet session with matching user address
Case-insensitive address matching for submission ownership
Rate limiting applied to all endpoints

Available Files via API:

All core demonstration files (meta.json, input_log.jsonl, input_log_meta.json, recording.mp4, sft.json)
Integrity verification file (checksums.json)
Extended metadata (manifest.json)
Files listed with download URLs, sizes, and SHA-256 hashes for client-side verification

Response Formats:

JSON files: application/json
Video files: video/mp4 or text/plain (base64)
Input logs: application/x-ndjson

Platform Specifics

Windows

Input capture: Multiinput library for hardware-level input capture of keyboards, mice, and joysticks
Screen recording: GDI capture (gdigrab) with desktop input
Accessibility: Windows UI Automation (planned future implementation)
Keyboard layouts: KLID registry detection with caching for performance

macOS

Input capture: rdev library combined with Core Foundation for absolute mouse positioning
Screen recording: AVFoundation with automatic device detection
Accessibility: Native AX API with CGWindowList fallback for comprehensive UI tree capture
Permissions: Requires Accessibility permission in System Settings → Privacy & Security → Accessibility
Keyboard layouts: TIS (Text Input Source) API with intelligent detection and fallback to null on errors

Quality Metrics

The system tracks several quality indicators:

Completion rate: Whether the task was successfully completed
Interaction efficiency: Number of actions relative to task complexity
Error rate: Failed interactions or corrections during demonstration
Timestamp accuracy: Synchronization between events and video
File integrity: SHA-256 verification of all components

Use Cases

AI Training

Multimodal learning: Screenshots paired with interaction events
Instruction following: User commands with corresponding actions
Error correction: Learning from failed attempts and corrections

Research and Analysis

UI interaction patterns: Understanding how users navigate applications
Cross-platform behavior: Comparing interaction patterns across OS
Accessibility analysis: How users interact with different UI elements

Quality Assurance

Automated testing: Replay interactions for regression testing
User experience: Analyzing task completion efficiency
Application usability: Identifying common user difficulties

Technical Implementation

Recording Pipeline

Initialization: Set up FFmpeg, input listeners, and accessibility monitoring
Event capture: Parallel threads for input events, screen recording, and UI snapshots
Synchronization: Relative timestamps ensure precise event alignment
Finalization: Generate metadata, verify integrity, and package files

Event Processing

Debouncing: Prevents excessive UI snapshots (500ms minimum interval)
Filtering: Ignores system applications and focus changes to Clones app
Normalization: Coordinate clamping and cross-platform key mapping

Storage Architecture

Distributed storage: Object storage with content-addressable hashing
Deduplication: File-level deduplication based on SHA-256
Compression: Efficient storage of large video files and JSON data

Conclusion: Toward Comprehensive Computer Use Agent Training

Together, these seven core files define a complete demonstration that captures every aspect of human-computer interaction with unprecedented detail and precision. When multiple high-quality demonstrations are combined, they form comprehensive training datasets specifically designed for Computer Use Agent (CUA) development.

Dataset Formation and AI Training Pipeline

The ultimate objective of this demonstration data format is to enable the creation of robust training datasets through the following pipeline:

Quality Assessment: Each demonstration is evaluated by the Clones Quality Agent service, which scores demonstrations based on task completion, interaction efficiency, and data integrity
Dataset Curation: High-scoring demonstrations are aggregated into structured datasets, filtered by task complexity, application domains, and success metrics
Model Training: These curated datasets are used for supervised fine-tuning of AI models, enabling them to learn multimodal computer interaction patterns
Agent Deployment: Trained models power Computer Use Agents that can autonomously execute complex tasks on users' computers with human-level proficiency

Evaluation Framework for CUA Training Efficacy

We assess whether this data format provides sufficient structure and richness for effective multitasking Computer Use Agent training through six critical dimensions:

1. Data Integrity and Synchronization

Hash Consistency: SHA-256 verification ensures data authenticity and tamper detection
Temporal Alignment: Precise synchronization between input logs, video frames, and accessibility snapshots
Schema Versioning: Backward-compatible evolution prevents data corruption during dataset updates

2. Event Log Richness and Precision

Complete User Actions: Comprehensive capture of clicks, keystrokes, scrolls, and drag operations with millisecond precision
Coordinate Systems: Dual tracking (normalized/raw) ensures training consistency across display configurations
Layout Awareness: Physical key mapping combined with semantic character output supports universal keyboard compatibility

3. Task Context and Clarity

Objective Definition: Clear task descriptions with structured step-by-step objectives
Application Context: Detailed metadata about target applications, environments, and system configurations
Instruction Quality: Natural language prompts that establish clear expectations for agent behavior

4. Video-Event Correlation

Frame Synchronization: Video timestamps precisely aligned with input events for multimodal learning
Visual Validation: Screen recordings provide ground truth for verifying logged interactions
State Transitions: Visual evidence of UI changes corresponding to recorded user actions

5. SFT File Structure and Coherence

Conversation Flow: Logical sequence of user instructions and assistant actions in training format
Screenshot Integration: Base64-encoded images at critical interaction points for visual context
Focus Transitions: Accurate tracking of application switches and window management
Action Annotations: Python-style function calls that translate human actions into executable agent commands

6. Action Sequence Reproducibility

Deterministic Reconstruction: Sufficient detail to replay interactions with identical outcomes
State Consistency: AXTree snapshots provide UI state verification at key interaction points
Error Recovery: Documentation of failed actions and corrections for robust agent training
Multi-Application Workflows: Support for complex tasks spanning multiple applications and contexts

Training Dataset Characteristics

This format produces datasets with several key advantages for CUA development:

Multimodal Learning: Screenshots paired with precise interaction events enable visual-motor learning
Cross-Platform Generalization: Consistent data structure across Windows and macOS environments
Task Diversity: Support for web browsing, document creation, file management, and application-specific workflows
Quality Assurance: Built-in integrity verification and grading systems ensure high-quality training examples
Scalable Architecture: Content-addressable storage and API access support large-scale dataset management

This comprehensive data format establishes a foundation for developing Computer Use Agents that can reliably execute complex tasks in real-world computing environments, ensuring that every demonstration captures the full context needed to train capable computer-use AI agents while maintaining compatibility across different platforms and use cases.

PreviousWhat is a demonstration NextThe Playground

Last updated 3 months ago

hashtagOverview

hashtagFile Structure

hashtagAdditional Files (Storage)

hashtagSchema Versioning

hashtagFile Specifications

hashtag1. meta.json

hashtag2. input_log.jsonl

hashtag3. input_log_meta.json

hashtag4. recording.mp4

hashtag5. sft.json

hashtagStorage and Distribution

hashtagFile Integrity

hashtagchecksums.json - Integrity Verification

hashtagmanifest.json - Extended Metadata

hashtagAPI Access

hashtagPlatform Specifics

hashtagWindows

hashtagmacOS

hashtagQuality Metrics

hashtagUse Cases

hashtagAI Training

hashtagResearch and Analysis

hashtagQuality Assurance

hashtagTechnical Implementation

hashtagRecording Pipeline

hashtagEvent Processing

hashtagStorage Architecture

hashtagConclusion: Toward Comprehensive Computer Use Agent Training

hashtagDataset Formation and AI Training Pipeline

hashtagEvaluation Framework for CUA Training Efficacy

hashtagTraining Dataset Characteristics

Overview

File Structure

Additional Files (Storage)

Schema Versioning

File Specifications

1. meta.json

2. input_log.jsonl

3. input_log_meta.json

4. recording.mp4

5. sft.json

Storage and Distribution

File Integrity

checksums.json - Integrity Verification

manifest.json - Extended Metadata

API Access

Platform Specifics

Windows

macOS

Quality Metrics

Use Cases

AI Training

Research and Analysis

Quality Assurance

Technical Implementation

Recording Pipeline

Event Processing

Storage Architecture

Conclusion: Toward Comprehensive Computer Use Agent Training

Dataset Formation and AI Training Pipeline

Evaluation Framework for CUA Training Efficacy

Training Dataset Characteristics