STM32 Polyphonic Synth: Bare-Metal DSP

8-Voice Polyphonic Wavetable Synthesizer on Cortex-M4

February 2026

Source Code

Watch "STM32 Polyphonic Synth: Bare-Metal DSP" on YouTube

BoardSTM32F407G Discovery

LanguageC11 / C++17

FrameworksSTM32F4xx HAL, CMSIS 5, STM32 USB Device Library

BuildCMake, CTest, GitHub Actions CI

ToolsSTM32CubeMX, ST-Link GDB, DWT cycle profiler

Overview

I built a fully polyphonic synthesizer on bare hardware (no Linux, no RTOS, no audio framework) to understand the full stack from clock configuration to DSP algorithm. The STM32F407G Discovery was the target: an on-board CS43L22 codec, a Cortex-M4 FPU, and enough SRAM to be genuinely interesting without being trivially easy.

The result is a self-contained USB MIDI instrument built solo over 4 weeks: 8-voice polyphony, real-time wavetable morphing across 8 waveforms (sine, saw, square, Rhodes, clav, choir, acid, glass), a zero-delay feedback filter, hardware knobs, an OLED display, and per-voice LED metering.

STM32 Synth Hardware Setup — STM32F407G Discovery board with SSD1306 OLED, potentiometer bank, and analog multiplexer.

Constraints

The system runs under hard real-time constraints with no safety net: no OS scheduler, no dynamic memory allocation on the audio path, and no tolerance for dropouts.

Constraint	Requirement
Sample rate	48 kHz, interrupt-driven; no dropout tolerance
Audio bit depth	16-bit signed int via I2S to CS43L22 DAC
CPU	STM32F407VG @ 168 MHz, Cortex-M4 with FPV4-SP-D16 FPU
Memory	No dynamic allocation on the audio path; VoiceManager placed in CCMRAM
Polyphony	8 simultaneous voices, no audible artifacts on voice steal
Buffer / Latency	128-sample stereo circular buffer; 64-sample halves filled inside one 1.333 ms interrupt window
Scheduling	All timing is interrupt-driven; no OS

Architecture

The system strictly separates hardware control from the synthesis engine. The audio path runs exclusively inside the DMA half-transfer interrupt and never touches I2C, UART, or any shared peripheral; this is the core guarantee that makes real-time stability possible.

To guarantee deterministic audio processing, I eliminated mutexes within the audio callback. Parameter updates leverage the ARM Cortex-M4 memory model to ensure atomic float writes. This maintains thread safety between the control loop and the audio ISR without introducing priority inversion or jitter. MIDI input is decoupled via a lock-free circular buffer, allowing the synthesis engine to operate in total isolation from the main loop.

User hardware interaction is managed via a dedicated TIM4 interrupt, which drives ADC DMA scanning across eight potentiometers via a analog multiplexer. A GPIO interrupt handles button debouncing for waveform selection. To prevent bus contention, the OLED and LED drivers update over I2C at a lower priority, fully decoupled from the time-critical audio path.

CCMRAM placement of VoiceManager eliminates bus contention with the DMA controller during the critical audio interrupt window.

Engineering Deep-Dive

Problem Filter Selection Within the CPU Budget

The original target was a Moog Ladder filter. Profiling with DWT->CYCCNT revealed it costs 275 cycles/sample, and that's already the aggressively reduced version: per-stage tanh saturation was replaced with a single tanh on the combined input+feedback signal. A faithful Huovilainen implementation (5 tanh calls per oversample iteration, ×4 for oversampling) would run around 4 times more expensive. Even the stripped-down version costs 838 µs per 8-voice block before oscillators or ADSR run.

Solution:

Replaced it with a Zero-Delay Feedback State Variable Filter (ZDF SVF). Solving the algebraic loop analytically per sample eliminates the unit-delay approximation of naive implementations and gives analog-accurate phase response at 82 cycles/sample, a 3.35x reduction. At 8 voices, filter cost drops to 251 µs (18.8% of budget), making full polyphony viable.

Problem Voice Stealing Without Audible Clicks

When all 8 voices are active, a new note must steal an existing one. A hard oscillator reset produces an immediate discontinuity in the audio signal, an audible click.

Solution:

A two-stage soft-kill: adsr.kill() calculates a ramp that fades the stolen voice to silence over 240 samples (~5 ms at 48 kHz). The oscillator stores the incoming note in a pending struct and waits for the ramp to reach zero before executing noteOn. Voice allocation follows a deterministic LRU priority: idle → released → oldest active. Only the oldest-active path triggers the kill ramp.

Problem Custom USB MIDI Class Compliance

The STM32 HAL USB library provides generic device templates but no MIDI class, so the board would appear as an unknown device to any DAW.

Solution:

Modified the USB Device Middlewares to implement custom MIDI descriptors from scratch. The board now enumerates as a standard class-compliant MIDI device and is recognized plug-and-play by any DAW without drivers.

Validation & Performance

CPU Profiling

8-voice polyphony runs at 26% CPU utilization, leaving 74% headroom in the 1.333 ms audio block window. Measured with DWT->CYCCNT reads before and after VoiceManager::process().

Condition	Time	% of 1.333 ms window
Baseline overhead (0 voices)	20 µs	1.5%
8 voices, full polyphony	347 µs	26.0%
Per-voice cost	~41 µs	~3.1%
Headroom remaining	986 µs	74.0%

Filter Comparison

The Moog Ladder was profiled against the ZDF SVF under identical conditions. At 8 voices, the Moog consumes 62.9% of the entire audio block budget on filtering alone before a single oscillator or envelope runs. The SVF's 3.35× efficiency advantage is what makes 8-voice polyphony viable on this hardware.

Filter	Cycles/sample	Cost at 8 voices	% of budget
ZDF SVF (shipped)	82	~251 µs	18.8%
Moog Ladder (rejected)	275	~838 µs	62.9%

Algorithm Correctness

A host-side CTest suite in tests/ runs on any desktop compiler, no hardware required. It separates algorithm correctness from hardware bring-up and runs on every push via GitHub Actions CI.

Test	Result
SVF −3 dB at cutoff	Measured ratio 0.7071 vs. theoretical 0.70711, < 0.001% error
SVF self-oscillation stability	Bounded over 10,000 samples at max resonance (k = 0.01)
Moog Ladder DC stability	Converges without drift at resonance = 0
ADSR attack accuracy	Reaches 1.0 within ±10% of configured attack time
ADSR kill ramp duration	Reaches silence within 300 samples (contract: 240 = 5 ms @ 48 kHz)
ADSR release convergence	Forced to exact 0.0 → IDLE, no asymptotic fade

Memory Utilization

Region	Total	Used	Utilization
Flash	1 MB	212 KB	20.7%
SRAM	128 KB	69 KB	54.0%
CCMRAM	64 KB	34 KB	52.6%

Flash usage is dominated by wavetable data: 8 waveforms x 4096 samples x 4 bytes = 128 KB. CCMRAM is almost entirely the VoiceManager, a deliberate placement, not a forced one.

Reflection

Real-time audio programming has a single governing rule: the audio callback is sacred. No allocations, no blocking calls, no peripheral access. Every architectural decision in this project (CCMRAM placement, atomic float writes, the kill ramp) exists to protect that invariant.

The filter swap was the most instructive moment. Choosing the Moog Ladder based on "analog character" before measuring its CPU cost would have made 8-voice polyphony impossible. The limited speed of the MCU revealed how expensive the Huovilainen model is, and how much of a privilege it is to be able to use it in real-time.

If I rebuilt this today, I'd replace the implicit Cortex-M4 atomicity of float writes with an explicit lock-free SPSC ring buffer for parameter updates, more portable, easier to audit, and not dependent on knowing the architecture's memory guarantees.

Future Work

Lock-Free Parameter Queue
Replace implicit Cortex-M4 atomic float writes with an explicit SPSC ring buffer, more portable and easier to reason about under code review.
Wavetable Anti-Aliasing
Implement BLIT or MinBLEP to suppress harmonic aliasing at high pitches, a known limitation of the current phase accumulator approach.
LFO Modulation Matrix
Add LFO-to-filter and LFO-to-amplitude routing with a configurable mod matrix, enabling filter sweeps and tremolo without hardware changes.
CMSIS-DSP SIMD Optimization
Port the ZDF SVF inner loop to CMSIS-DSP SIMD intrinsics for further CPU reduction, freeing headroom for additional voices or effects.
Custom PCB Shield
Consolidate the breadboarded multiplexer and potentiometer arrays into a unified PCB shield for the Discovery board.
SysEx Patch Storage
Add SysEx support for saving and recalling patches over MIDI, turning the synth into a fully programmable instrument.

Overview

Constraints

Architecture

Engineering Deep-Dive

Problem Filter Selection Within the CPU Budget

Problem Voice Stealing Without Audible Clicks

Problem Custom USB MIDI Class Compliance

Validation & Performance

CPU Profiling

Filter Comparison

Algorithm Correctness

Memory Utilization

Reflection

Future Work

Lock-Free Parameter Queue

Wavetable Anti-Aliasing

LFO Modulation Matrix

CMSIS-DSP SIMD Optimization

Custom PCB Shield

SysEx Patch Storage