Introducing Octoseq: A Workbench for Music Interpretation

Octoseq is an experimental side project. It's a early-stage web-based workbench for messing around with Music Information Retrieval (MIR) and visuals. It's a playground / thinking tool and a long way from a finished product. Artists and technologists interested in audio-visual expression may find it interesting.

Warning

Early alpha though experiment, released earlier than it probably should be (assuming it ever should have). Expect rough edges, missing features, and frequent breaking changes.
Project autosaving, saving, uploading, loading are all in flux. You will lose work.
Octoseq is not optimised, not mobile-friendly, and is especially likely to crash and burn on low-end systems. It requires WebGPU, runs hot, and consumes a lot of memory.
The scripting interface is awash with footguns that will allow you to fully lock up a browser tab.
While not entirely vibe-coded in the strict "hands-off" sense, most of the code was written by agents, and some with minimal human oversight. Expect bugs and questionable design decisions.

What Octoseq is (and isn't)

Octoseq is an open source, experimental workbench for exploring how music can be interpreted over time, rather than simply reacted to.

It's not a finished product, and it's not a one-click generator. It's also not a live VJ tool.

Instead, it's a place to experiment with:

Analysing whole tracks, not just the present moment, using MIR algorithms
Extracting and crafting time-continuous signals that relate to perception (buildup, density, anticipation, release)
Treating those signals as inputs to other systems

My working hypothesis is that while no existing MIR algorithms truly capture human perceptual traits of music, many of them may be useful building blocks. With full-track context and some human guidance, it may be possible to assemble signals that align more closely with how music feels, not just how it measures.

A scriptable visualiser is baked in, not necessarily as the end goal, but as a fast feedback loop, so you can see whether the signals you're extracting from a track actually "gel".

You can think of Octoseq as aiming to sit somewhere between traditional MIR tooling, audio-reactive visualisers, and generative video systems, but not quite belonging to any of those categories.

What you can do with Octoseq today

Load a single audio file, or a collection of stems
Split any audio file into frequency bands
Apply MIR algorithms (e.g. onset detection) on full-spectrum audio and individual bands, and visualise & compare the results (which may be 2D spectrums, 1D signals, or discrete events) across stems and bands.
Derive custom signals and event streams from the above analysis
Script old-school, vaguely Milkdrop-ish 3D visuals + 2D effects using those signals and events

The scripting API includes:

Signal and event stream processing functions for interpolation, easing, smoothing, combining signals, emphasising peaks/troughs, etc.
Tempo-aware signal generators (sine, square, noise, etc.)
3D scene building and mesh transforms
Frame feedback and post-processing pipelines for things like motion blur, bloom, etc.
The ability to bind all numeric API parameters to continuous signals (with peak-preserving sampling at the desired frame rate)

The scripting API is deliberately expressive rather than minimal. The goal is to make it easy to explore relationships between signals, events, and visuals without constantly dropping down into imperative update loops.

Background

I've been interested in using computers as tools for artistic expression since seeing demo-scene productions in the early 90s, which made me realise computers can behave in inventive and weird ways.

Over the years I've come to accept that I'm not a particularly talented artist myself; however, I get a huge kick from building tools that enable other people to express themselves artistically.

Audio-synchronised visuals

One niche interest that's stuck with me is audio-synchronised visuals. The early 2000s were a golden era for this: there were vibrant ecosystems around Geiss, AVS, and later Milkdrop. There's been a continuing lineage of talented VJs and niche tools since then, but from the outside it felt to me like the space didn't advance at the same pace as hardware, programming techniques, or graphics capabilities.

(Sidenote: Although the peak of Milkdrop's popularity is behind us, there's still a widely-felt visceral joy that comes from watching audio-sync'ed visuals, which manifests today via the popularity of lip-sync / dance videos. Interestingly, as much as I respect dancers' athleticism, I personally don't get much joy from watching people dance. But there is an interesting parallel to explore in terms of how both dance and abstract music visuals aim to capture, interpret and amplify perceptual traits from music, like anticipation, tension, release… Definitely a topic to think about!)

Example audio-synchronised visual from Milkdrop

A screenshot from a Milkdrop visual

I was pulled back to this interest when generative AI image models exploded in late 2022, and looped image-to-image processing made basic but highly controllable video gen possible. I built Parseq and much of Deforum Studio, including all of the audio-related parts. Deforum Studio's backend invokes a Milkdrop-like pipeline to condition AI video gen.

Deforum Studio audio event mapping interface

Deforum Studio's audio event discovery and mapping interface

That work scratched an itch, and I left the space alone for a while. But there's a few questions that have been bubbling at the back of my mind since.

Questions that kept nagging

I'm not fully across what VJs and audio-visual artists use today, but I've experimented with a few modern tools in the space. Those brief explorations left me curious about a few things:

What becomes possible if you remove the real-time constraint entirely? Most tools optimise for immediate reaction. But music is structured over minutes, not milliseconds. What changes if you allow full-track analysis, look-ahead, and carefully authored timelines?

Where is Music Information Retrieval actually up to today? MIR has advanced significantly in research settings, but most creative tools still use a handful of relatively simple, real-time features. If we don't need to operate live, can we use richer analysis that better aligns with how humans actually perceive music?

What if analysis doesn't produce a finished visual at all? Instead of jumping straight from raw audio to pixels, what if the output is a structured, interpretable representation that can condition other systems, including generative video models?

Are we skipping an important intermediate layer? It feels like many systems leap directly from sound to image, optimising for immediacy, while neglecting artistic interpretation, "choreography" authorship, and intent.

Coding agents to the rescue

It's now possible to few-shot complex features spanning multiple components in a large codebase. So, ideas that had been sitting half-formed in my mind suddenly felt tractable to experiment with (not necessarily to solve, but at least to probe). Octoseq is a first step in this exploration.

Architecture and design philosophy

The main components of Octoseq are:

TypeScript + WebGPU MIR library

I chose this stack primarily for approachability and inspectability. Running MIR in the browser makes it easier to visualise intermediate results, tweak parameters, and involve a human in the loop. WebGPU provides enough performance headroom for non-realtime analysis without requiring native code or a server round-trip.

Rust + WASM scripting and rendering pipeline

The renderer is written in Rust for determinism, performance, and portability. Crucially, it can run both headless (for offline / faster-than-realtime rendering and batch jobs) and in the browser via WASM (for preview). The offline/batch pipeline isn't finished or properly exposed yet, but it's crucial because I envisage a future where we might want to experiment with bulk-generating different visuals from the same set of perceptual traits.

Next.js web app

The web app ties everything together: loading audio, running analysis, inspecting signals, authoring events, and scripting visuals. It's intentionally a "lab" rather than a polished editor, a place to explore and question results.

The web app makes copious use of one of my favourite open source projects: Wavesurfer. I've included a Wavesurfer plugin using the new 8.0-beta API for signal analysis. Let me know if you find it useful.

Some lines in the sand

Human judgement first. Octoseq is not about one-click results. Human interpretation is first-class.
Determinism. Outputs must be deterministic; all noise/RNGs are seeded.
Real-time is optional. It's OK for the web preview to drop frames. The "true" output will eventually be generated offline.
Transparency. Intermediate representations should be visible and inspectable.

Sound interesting?

At this stage, Octoseq is best thought of as a thinking tool. I don't know yet which ideas here will survive contact with reality, but this is my attempt to find out.

If any of this resonates, give it a go. I'll be sharing more as the project evolves.