encryptio.com

This sentence is very meta.

Impulse Response Extraction

One day, I got particularly interested in convolution reverb. Basically, the idea is that you somehow get a recording of what the room sounds like when you play an extremely short pop in it (that is, its impulse response.) Then, you take that impulse response and convolve it with the sound you want, and you get something extremely similar to what you would hear if you were actually there, playing the sound in the room.

Initial Method

Unfortunately, the mathematics around extracting good quality impulse responses are difficult at best, so I started by using a very stupid method - I played a bunch of impulses and wrote programs to average all the recordings I got. This worked, but it was extremely low quality for the time it took. The details are not worth mentioning.

Exponential Sine Sweep

The best known method for getting an impulse response is to play a specific sound, an exponential sine sweep, record the result of playing it, and "deconvolve" the sine sweep recording back into an impulse response. This is done by dividing in the frequency domain, or equivalently, convolving (forward) with the reversed sine sweep that was played.

More intelligibly for the less signal-processing minded, you simply take each frequency and shift it in time so that it lines up with all the others. Some spectrums:

I play this sound: (its supposed to be a single line, but unfortunately, the methods that I use to create this sound are not exact)

And in the room, I get this sound from the microphone: (This is a very noisy recording, you can probably do better without too much trouble. Note: the horizontal lines are part of the room noise, as well as the lower portions continuous warmth. The you can also see some lines at precisely integer multiples of the main frequency - this is nonlinear distortion caused by the speakers being used slightly past their good response volume. This isnt much of an issue, as we see later.)

Then deconvolve it, and get this sound: (Note: The main sweep has been moved into a single spike in the middle, and the nonlinear distortions have all been moved into spikes before the main impulse, thus, we can easily cut them away. Also note that the general level of noise has decreased, due to the requirement that we lower the volume so that the impulse doesnt clip like mad.)

Actually doing it

I wrote my own software for this - I used ininit for creating the sine sweep, and my convolution program for convolution/deconvolution.

First, I created the sine sweep - here, T = 45 for a 45-second "stretched impulse."

shell$ cat sweep.lua
f1 = 20
f2 = 20000
T = 45

b = f2/f1
a = f1
freq = makefn(100, function () return b^(gettime()/T) * a end)
sweep = osc_sine(0.25, freq)

saver(sweep, "sweep.au")
run(T)
shell$ ininit sweep.lua

Then, I played sweep.au with Audacity and recorded the result, then did the same with a slightly different microphone placement for stereo seperation. To create the impulse responses out of this, I first saved the recorded signals to recorded-left.wav and recorded-right.wav, and then reversed the sweep and saved it as revplayed.wav. Then, I deconvolved it:

shell$ convolute recorded-left.wav revplayed.wav impulse-uncut-left.wav 0.00003
Reading 2426880 samples from recorded-left.wav...
Reading 1984500 samples from revplayed.wav...
fftlen is 4194304
doing 2 steps of size 2209794
maximum amplitude: 0.847453

Similar for the right channel.

There's a small gotcha with this filter's response - it's all lined up, but the spectrum is wrong. To understand why, think of this - the sine sweep is starting at a low frequency and going higher. It's taking more time per hertz at the start (low frequencies) than at the end (high frequencies.) Therefore, the low frequencies have been given more power than the high ones. The response has been given an unintended equalization. This is easily reverted by creating the proper correction filter and applying it.

With my sweep.lua script, the unintended equalization is precisely 20dB/oct. I wrote a simple perl script to create data to be used by mkfilter.

shell$ cat mk_unsweepfilter_data.pl
#!/usr/bin/perl
use warnings;
use strict;

my ($low, $high) = @ARGV;
die "Usage: $0 lowfreq highfreq\n" unless $low and $high;

my $at = $high;
while ( $at > $low ) {
    my $oct = log($at)-log($high);
    my $power = 2.718281828459**$oct;
    print "$at $power\n";
    $at *= 0.99;
}

So I create the filter and apply it:

shell$ perl mk_unsweepfilter_data.pl 1 21000 > correction-filter-data
shell$ mkfilter -t custom -C correction-filter-data -l 10000 -o correction.wav
shell$ convolute impulse-uncut-left.wav  correction.wav impulse-uncut-corrected-left.wav  1
shell$ convolute impulse-uncut-right.wav correction.wav impulse-uncut-corrected-right.wav 1

The last step is to cut the filter. I then opened up audacity, cut impulse-uncut-corrected-left.wav and impulse-uncut-corrected-right.wav to have the first impulse starting 5 milliseconds into the file and fading away after a few seconds. I saved this, and now I can apply it to any file I want to with convolute input.wav impulse-left.wav 0.01.

Finally, I applied my system equalization response (see below) to the final impulse files to get rid of the microphone and speaker responses, leaving me with only the room impulse response.

Microphone and Speaker Response Removal

The impulse response you get by doing this is not just the response of the room, but the response of the particular setup you had at the time you recorded it. Most notably this includes your speakers, microphone, their relative placement, and any physical coupling between them. If you record it with crappy speakers and a crappy microphone, your impulse will have the equalization curves of both of them combined.

You can combat this by recording your particular pair of speaker(s) and microphone(s) in an anechoic chamber or an approximation (for example, a wide open outdoor area) and inverting the response in the frequency domain around the impulse. I wrote mkunfilter.pl specifically for this purpose. With this file, you can convolve your normal room impulse responses with it to get rid of most of the speaker and microphone equalization, leaving you with, nominally, just the rooms response.

Note that if you do this, you also need to correct for the 20dB/oct equalization caused by the logarithmic sine sweep. It is possible to correct for both at once, but things get hairy real fast if you're not extremely careful.