VideoCsound
MSc Thesis: Generating Music from Video with Computer Vision
Overview
Traditional acoustic instruments offer a natural, symbiotic relationship between physical action and sonic response. In contrast, laptop-based electronic music often lacks this visual expressiveness, making it difficult for an audience to interpret the relationship between a performer's actions and the resulting sound.
VideoCsound is an open-source performance system designed to bridge this gap. Through computer vision, it tracks everyday objects and translates their high-level semantic data into real-time synthesis parameters.
Technical Implementation
VideoCsound utilizes a modular Python pipeline to synchronize high-speed object detection with real-time audio synthesis. OpenCV handles the video stream, while Ultralytics YOLO extracts bounding box coordinates and unique track IDs.
These data points are fed into the Csound Python API (ctcsound), which modulates synthesis parameters in real-time. Through OpenCV, the processed video is rendered with an overlay of the detected object boundaries, providing the performer with immediate visual feedback of the tracking state.

To decouple the CV engine from creative assets, the system employs a template-based project structure. Initializing a performance generates a standardized directory with YAML configurations and boilerplate Csound orchestra (.orc) files. This allows users to define complex object-to-audio mappings, detection thresholds, and bus names in a centralized location without modifying the underlying Python source code. Consolidating project files into a single folder also allows for easy sharing and organization of different setups.
Key Outcomes
Marker-less Tangible Interface
Successfully replaced physical fiducial markers with YOLO11 object detection, enabling household items to act as musical controllers.
Performative Occlusion
Identified that the act of hiding/revealing objects creates a natural 'mute' gesture, providing intuitive structure for live performance.
Semantic Mapping Pipeline
Mapped real-time 3D bounding box coordinates to complex synthesis variables, creating a transparent link between motion and sound.
User Friendly Workflow
Created a modular, template-based Python toolkit that allows other artists and researchers to experiment with data-driven sonification and marker-less audiovisual interaction.
Challenges & Solutions
Problem Sonic Monotony
Stationary objects produced undifferentiated drones since audio was mapped strictly to shifting physical coordinates.
Solution:
Redesigned Csound instruments with internal modulation through Sample-and-Hold based amplitude modulation to create musical interest independent of physical movement.
Problem Real-Time Latency
Identified a consistent ~500ms delay between visual events and audio response, detracting from the sense of direct physical interactivity.
Solution:
Shifting to native Ultralytics video handling to implement 'video stride' (processing every Nth frame) to reduce computational load.
Future Work
GUI Development
Transitioning from a CLI to a desktop application using Tkinter or PySide to make the framework accessible to non-technical musicians.
Nuanced Parameter Mapping
Extracting velocity and acceleration data from the CV model to map physical energy to sonic intensity, rather than just position.
Expanded Applications
Exploring kinesthetic feedback for physical therapy and spatial audio alerts for industrial safety systems.