ken

STM32 Polyphonic Synth: Bare-Metal DSP

8-Voice Polyphonic Wavetable Synthesizer on Cortex-M4

BoardSTM32F407G Discovery
LanguageC11 / C++17
FrameworksSTM32F4xx HAL, CMSIS 5, STM32 USB Device Library
BuildCMake, CTest, GitHub Actions CI
ToolsSTM32CubeMX, ST-Link GDB, DWT cycle profiler

Overview

I built a fully polyphonic synthesizer on bare hardware (no Linux, no RTOS, no audio framework) to understand the full stack from clock configuration to DSP algorithm. The STM32F407G Discovery was the target: an on-board CS43L22 codec, a Cortex-M4 FPU, and enough SRAM to be genuinely interesting without being trivially easy.

The result is a self-contained USB MIDI instrument built solo over 4 weeks: 8-voice polyphony, real-time wavetable morphing across 8 waveforms (sine, saw, square, Rhodes, clav, choir, acid, glass), a zero-delay feedback filter, hardware knobs, an OLED display, and per-voice LED metering.

STM32 Synth Hardware Setup
STM32F407G Discovery board with SSD1306 OLED, potentiometer bank, and analog multiplexer.

Constraints

The system runs under hard real-time constraints with no safety net: no OS scheduler, no dynamic memory allocation on the audio path, and no tolerance for dropouts.

ConstraintRequirement
Sample rate48 kHz, interrupt-driven; no dropout tolerance
Audio bit depth16-bit signed int via I2S to CS43L22 DAC
CPUSTM32F407VG @ 168 MHz, Cortex-M4 with FPV4-SP-D16 FPU
MemoryNo dynamic allocation on the audio path; VoiceManager placed in CCMRAM
Polyphony8 simultaneous voices, no audible artifacts on voice steal
Buffer / Latency128-sample stereo circular buffer; 64-sample halves filled inside one 1.333 ms interrupt window
SchedulingAll timing is interrupt-driven; no OS

Architecture

The system strictly separates hardware control from the synthesis engine. The audio path runs exclusively inside the DMA half-transfer interrupt and never touches I2C, UART, or any shared peripheral; this is the core guarantee that makes real-time stability possible.

Signal flow architecture: USB MIDI through DSP chain to DAC output
Audio data path: USB OTG through VoiceManager, SVF filter, DMA circular buffer, and I2S to the CS43L22 codec.

To guarantee deterministic audio processing, I eliminated mutexes within the audio callback. Parameter updates leverage the ARM Cortex-M4 memory model to ensure atomic float writes. This maintains thread safety between the control loop and the audio ISR without introducing priority inversion or jitter. MIDI input is decoupled via a lock-free circular buffer, allowing the synthesis engine to operate in total isolation from the main loop.

User hardware interaction is managed via a dedicated TIM4 interrupt, which drives ADC DMA scanning across eight potentiometers via a analog multiplexer. A GPIO interrupt handles button debouncing for waveform selection. To prevent bus contention, the OLED and LED drivers update over I2C at a lower priority, fully decoupled from the time-critical audio path.

CCMRAM placement of VoiceManager eliminates bus contention with the DMA controller during the critical audio interrupt window.

Engineering Deep-Dive

Problem Filter Selection Within the CPU Budget

The original target was a Moog Ladder filter. Profiling with DWT->CYCCNT revealed it costs 275 cycles/sample, and that's already the aggressively reduced version: per-stage tanh saturation was replaced with a single tanh on the combined input+feedback signal. A faithful Huovilainen implementation (5 tanh calls per oversample iteration, ×4 for oversampling) would run around 4 times more expensive. Even the stripped-down version costs 838 µs per 8-voice block before oscillators or ADSR run.

Solution:

Replaced it with a Zero-Delay Feedback State Variable Filter (ZDF SVF). Solving the algebraic loop analytically per sample eliminates the unit-delay approximation of naive implementations and gives analog-accurate phase response at 82 cycles/sample, a 3.35x reduction. At 8 voices, filter cost drops to 251 µs (18.8% of budget), making full polyphony viable.

Problem Voice Stealing Without Audible Clicks

When all 8 voices are active, a new note must steal an existing one. A hard oscillator reset produces an immediate discontinuity in the audio signal, an audible click.

Solution:

A two-stage soft-kill: adsr.kill() calculates a ramp that fades the stolen voice to silence over 240 samples (~5 ms at 48 kHz). The oscillator stores the incoming note in a pending struct and waits for the ramp to reach zero before executing noteOn. Voice allocation follows a deterministic LRU priority: idle → released → oldest active. Only the oldest-active path triggers the kill ramp.

Problem Custom USB MIDI Class Compliance

The STM32 HAL USB library provides generic device templates but no MIDI class, so the board would appear as an unknown device to any DAW.

Solution:

Modified the USB Device Middlewares to implement custom MIDI descriptors from scratch. The board now enumerates as a standard class-compliant MIDI device and is recognized plug-and-play by any DAW without drivers.

Validation & Performance

CPU Profiling

8-voice polyphony runs at 26% CPU utilization, leaving 74% headroom in the 1.333 ms audio block window. Measured with DWT->CYCCNT reads before and after VoiceManager::process().

ConditionTime% of 1.333 ms window
Baseline overhead (0 voices)20 µs1.5%
8 voices, full polyphony347 µs26.0%
Per-voice cost~41 µs~3.1%
Headroom remaining986 µs74.0%

Filter Comparison

The Moog Ladder was profiled against the ZDF SVF under identical conditions. At 8 voices, the Moog consumes 62.9% of the entire audio block budget on filtering alone before a single oscillator or envelope runs. The SVF's 3.35× efficiency advantage is what makes 8-voice polyphony viable on this hardware.

FilterCycles/sampleCost at 8 voices% of budget
ZDF SVF (shipped)82~251 µs18.8%
Moog Ladder (rejected)275~838 µs62.9%

Algorithm Correctness

A host-side CTest suite in tests/ runs on any desktop compiler, no hardware required. It separates algorithm correctness from hardware bring-up and runs on every push via GitHub Actions CI.

TestResult
SVF −3 dB at cutoffMeasured ratio 0.7071 vs. theoretical 0.70711, < 0.001% error
SVF self-oscillation stabilityBounded over 10,000 samples at max resonance (k = 0.01)
Moog Ladder DC stabilityConverges without drift at resonance = 0
ADSR attack accuracyReaches 1.0 within ±10% of configured attack time
ADSR kill ramp durationReaches silence within 300 samples (contract: 240 = 5 ms @ 48 kHz)
ADSR release convergenceForced to exact 0.0 → IDLE, no asymptotic fade

Memory Utilization

RegionTotalUsedUtilization
Flash1 MB212 KB20.7%
SRAM128 KB69 KB54.0%
CCMRAM64 KB34 KB52.6%

Flash usage is dominated by wavetable data: 8 waveforms x 4096 samples x 4 bytes = 128 KB. CCMRAM is almost entirely the VoiceManager, a deliberate placement, not a forced one.

Reflection

Real-time audio programming has a single governing rule: the audio callback is sacred. No allocations, no blocking calls, no peripheral access. Every architectural decision in this project (CCMRAM placement, atomic float writes, the kill ramp) exists to protect that invariant.

The filter swap was the most instructive moment. Choosing the Moog Ladder based on "analog character" before measuring its CPU cost would have made 8-voice polyphony impossible. The limited speed of the MCU revealed how expensive the Huovilainen model is, and how much of a privilege it is to be able to use it in real-time.

If I rebuilt this today, I'd replace the implicit Cortex-M4 atomicity of float writes with an explicit lock-free SPSC ring buffer for parameter updates, more portable, easier to audit, and not dependent on knowing the architecture's memory guarantees.

Future Work