Covoder - Audio Synthesiser for the ESP32 & PCM5102

An audio Digital Signal Processing (DSP) demonstrator for the Espressif Systems ESP32 Microcontroller and PCM5102 Digital to Analog Converter (2022)

A good pen pal of mine - the one hosting this web site infact - posted me Arduino hardware via a series of "care packs" to support development of a project he was working on. I was loosely assisting him as tester and critic. In these bits and pieces was an Espressif Systems ESP32 microcontroller. I became bored during winter in 2021 (June). The project name is a play on words from the COVID-19 pandemic. It was born from a Freudian skid in an email to the above mentioned friend mid way through the project. Before this, project was simply called "Vocoder".

This application runs on a DOIT ESP32 development board, and outputs digital audio data to a PCM5102 packaged on a DIYMORE PCM5102 development board. It was developed using Arduino 1.8.19. This is my very first Arduino embedded application.

Audio input from a dynamic microphone is fed into the audio input of this device. A simple LM386 audio amplifier circuit "signal conditions" this audio signal for the 0.2V to 3.3V voltage range (rest voltage is 1.9V) required by the ESP32's 12bit unsigned analog to digital (ADC) converter. Sampling is performed at 22050Hz. The ESP32 then processes this audio based on a given digital effect (see below). Finally, it sends data to the PCM5102 Digital to Analog (DAC) IC Converter via the ESP32's high speed I2S bus.

The digital effect (mode) is set by the user via a push button located on the enclosure. A flashing LED on the side of the enclosure indicates the mode when changed.

Click on the embedded audio below for a demonstration:

The default mode is simply an electric microphone. No processing is performed, except for converting the ESP32's ADC data from 12 bit unsigned to 16 bit signed data (a format required for the PCM5102 DAC). Note, covoder is a block processing design. It reads the ESP32's ADC in 256 bit blocks, processes this block (buffer) of data, and sends it to the PCM5102 via a I2S bus.

The next mode is an electronic echo effect using a 16384 sample echo feedback buffer (64 element circular queue). Effectively, current sample data is mixed with oldest entry the echo buffer. Once mixed, the data is sent to the PCM5102 via a I2S bus. The same copy is then put into the echo buffer as the newest entry. This gives a 0.74 second echo effect.

The next mode is the Reverb echo effect. This differs from the previous mode in that feedback is finite. This mode uses the same circular queue as the echo effect. However, this effect will take the current samples, and past samples at three points, and mix them to produce an output.

The last modes are three Channel Vocoder effects: sine wave, triangle wave, and square wave.

For the Sine Wave Channel Vocoder, this effect converts the incoming buffer (which is presented in the time domain), into the frequency domain buffer using a Cooley-Tukey Fast Fourier Transform (FFT) Algorithm. Once in the frequency domain, 30 points in the frequency domain buffer are used to construct 30 sine waves. The software defined oscillators that produce these sine waves are given a frequency, a phase and an amplitude in order to reproduce these sine waves. The frequency for each sine wave is specified by virtue at its given location in the frequency domain buffer (ie: the 30 points in the frequency domain buffer, hence the frequencies, are set by me in the firmware). The phase and amplitude are taken from the values in the frequency domain buffer. These values are in rectangular complex form (eg: 0.45+j1.34). These rectangular quantities are converted to polar form in order to get amplitude and phase at these taps. Note, the action of generating these sine waves converts the signal back to time domain. The 30 sine waves are then mixed (summed) together and then normalised (ie: divided by a constant) to get the cool vocoder result.

The triangle and square wave channel vocoder effects are the same as the sine wave vocoder above, except the software defined oscillators produce triangle and square waves instead of sine waves.

This cool effect was inspired by the 1970s rock band Electric Light Orchestra and Battle Star Galactica (the 1978 series) for their use of speech vocoders. The vocoder was invented in 1938 by Homer Dudley at Bell Labs. The word vocoder is a Portmanteau - Voice Encoder. Its original purpose was to compress audio signals over limited bandwidth RF communication transmissions. Vocoders are an example of transforming a signal from the time domain, into the frequency domain, and then back into the time domain. Infact, from a digital transmission perspective, it's only necessary to transmit phase and magnitude of each tap as the receiver is able to reconstruct the signal. This phase/amplitude data could also be reduced in precision as well, as course (approximate) information would probably be adequate. In my example application, the audio signal could be compressed by 8.5 times if the phase and magnitude was transmitted in the same packet size as the individual 12 bit samples from the ESP32's ADC. I think this could be achievable, and give the same audio quality as this application. Below is Covoder with the cover off during integration.

A couple of points about Arduino and the ESP32. Embedded systems development has come along away from writing firmware in machine code, and "blowing" a removable EEPROM. The Arduino development environment is very quick and easy to develop applications. The concurrent model of Arduino is really good as well. It forces developers (particularly new developers) into thinking about concurrency from the very start. This is an important concept when developing applications on multitasking/context switching operating systems (referring to Arduino's FreeRTOS).

ESP32 is feature rich, however it is hacky and broken. I am not used to this, as microcontrollers like the very old Intel 8051s and Atmel AVR series are designed for commercial applications. Hence very reliable and very well documented. My original concept was to use the ESP32's Bluetooth hardware to output audio to a stereo system. Infact, the Bluetooth service was the key reason that motivated me to write covoder. However, this proved ill, as the user space service routine that received audio data, and sent it to the Bluetooth service was very CPU intensive (or rather just a very inefficent way of doing it. Basically two block copy operations.. maybe three block copies including one in the kernel space). Unfortunately, I could not get Bluetooth and the vocoder code to run concurrently without the Bluetooth audio getting badly chopped up. I noticed one of the Espressif employees recommending to a user on their forum not to bother with ESP Bluetooth. Nobody got far with it, and it had a bad rep. So after a full winter's work getting no where with this, I finally I abandoned the Bluetooth idea and used the PCM5102 DAC.

Secondly, the ESP32's ADC hardware does not have an analogue ground (AGND). This is (excuse the French) a bullsh!t design in my opinion, particularly when Espressif went to the bother of putting an ADC service on their chipset as well as a calibrating feature to improve accuracy of their ADC. All this trouble, and they omit a key feature that reduces noise. WTF Espressif ????

References:

Vocoder font is Data70 from fontsgeek.com.

Channel Vocoder concept from wikipaedia.

Fast Fourier Transform and window filtering concepts from: The Scientist and Engineers Guide to Digital Signal Processing 2nd Ed. By Steven W. Smith. 1997-1999.

Back