VideoCsound

MSc Thesis: Generating Music from Video with Computer Vision

August 2025

Overview

Traditional acoustic instruments offer a natural, symbiotic relationship between physical action and sonic response. In contrast, laptop-based electronic music often lacks this visual expressiveness, making it difficult for an audience to interpret the relationship between a performer's actions and the resulting sound.

VideoCsound is an open-source performance system designed to bridge this gap. Through computer vision, it tracks everyday objects and translates their high-level semantic data into real-time synthesis parameters.

Technical Implementation

VideoCsound utilizes a modular Python pipeline to synchronize high-speed object detection with real-time audio synthesis. OpenCV handles the video stream, while Ultralytics YOLO extracts bounding box coordinates and unique track IDs.

These data points are fed into the Csound Python API (ctcsound), which modulates synthesis parameters in real-time. Through OpenCV, the processed video is rendered with an overlay of the detected object boundaries, providing the performer with immediate visual feedback of the tracking state.

To decouple the CV engine from creative assets, the system employs a template-based project structure. Initializing a performance generates a standardized directory with YAML configurations and boilerplate Csound orchestra (.orc) files. This allows users to define complex object-to-audio mappings, detection thresholds, and bus names in a centralized location without modifying the underlying Python source code. Consolidating project files into a single folder also allows for easy sharing and organization of different setups.

LanguagePython / Csound

LibrariesUltralytics / OpenCV

ModelYOLO11n / COCO

Key Outcomes

Marker-less Tangible Interface

Successfully replaced physical fiducial markers with YOLO11 object detection, enabling household items to act as musical controllers.

Performative Occlusion

Identified that the act of hiding/revealing objects creates a natural 'mute' gesture, providing intuitive structure for live performance.

Semantic Mapping Pipeline

Mapped real-time 3D bounding box coordinates to complex synthesis variables, creating a transparent link between motion and sound.

User Friendly Workflow

Created a modular, template-based Python toolkit that allows other artists and researchers to experiment with data-driven sonification and marker-less audiovisual interaction.

Challenges & Solutions

Problem Sonic Monotony

Stationary objects produced undifferentiated drones since audio was mapped strictly to shifting physical coordinates.

Solution:

Redesigned Csound instruments with internal modulation through Sample-and-Hold based amplitude modulation to create musical interest independent of physical movement.

Problem Real-Time Latency

Identified a consistent ~500ms delay between visual events and audio response, detracting from the sense of direct physical interactivity.