GUSLAR: A FRAMEWORK FOR SINGING VOICE PROCESSING

Abstract

The GUSLAR is a framework for singing voice processing that is used in a karaoke application with automated voice correction. The intended purpose of the application is to automatically improve user’s performance towards performance of a professional singer by implementation of voice effects such as pitch correction, artificial polyphony, time stretching and other. The GUSLAR framework incorporates a complete processing workflow including analysis, morphing and synthesis. The framework uses an original model of voiced speech which represents each harmonic as a multicomponent function and provides high quality processing in conditions of partial glottalization.

Introduction

The crucial part of a voice morphing system is underlying signal model which interprets the signal in parametric domain. The modeling framework GUSLAR [1] is designed specifically for singing voice processing and can change pitch/tempo of the signal and add artificial polyphony. Processing of voiced speech is made in warped-time domain where it is possible to use narrow-band filters and extract harmonic and subharmonic components. Due to warped-time processing GUSLAR can be potentially beneficial for modeling various phonation phenomena such as glottalization, creaky voice, diplophonic phonation etc. This might be valuable for singing voice processing since these effects are typical in singing. The demo presented here performs processing of user's singing on the spot. Voice recording is done using an interactive karaoke application implemented on a smartphone. Underway...

Modeling of voiced and mixed sounds

For representing voiced and mixed sounds GUSLAR uses signal model similar to [2]. Harmonic model implies manipulating of each harmonic of the signal separately. Here, in singing voice processing, GUSLAR tries to extract and process subharmonic components as well. In order to make bands of analysis filters narrow enough we utilize very long analysis frames (up to 16 pitch periods that corresponds to 35–320 ms for pitch range 450–50 Hz). It is possible to use such large windows without frequency smoothing due to time-warping which results in a signal with stable pitch. The idea is illustrated in figure 1.

Figure 1. Comparison: uniform time analaysis and time-warping analysis

The model considers each harmonic as a multicomponent periodic function and represents voiced speech signal s(n) as

where

– a gain factor specified by the spectral envelope;
– number of sinusoidal components for each harmonic;
– frequency of c-th component of k-th harmonic;
– initial phase of c-th component of k-th harmonic;
– excitation signal of k-th harmonic;
– Amplitudes are normalized in order to set the unit energy to each harmonic’s: .

Application

The demo is implemented as an interactive internet service. Using MATLAB implementation of GUSLAR a remote server processes incoming sound files according to a given melody score. The general scheme of the voice correction system is presented in figure 2. In order to record user's voice and communicate with the server a dedicated client application is implemented on a smartphone. A typical demo session involves two steps.

Figure 2. Singing voice correction system

At the first step the user sings while listening to the backing in earphones and seeing lyrics on the screen as shown in figure 3. When recording session is finished the data are encoded and transmitted to the server. The second step is voice processing. The pitch contour is extracted from user's singing and then the target contour is generated using the melody of the song. Other model parameters are estimated from the signal and morphing is applied. The synthesized signal is mixed with the backing and the result is encoded and returned to the user. The user can listen to the result with the demo smartphone, or alternatively the result can be sent to a specified e-mail.

Figure 3. User interface for recording session

Download

Demo files

Reference

[1] E. Azarov, M. Vashkevich, and A. Petrovsky "GUSLAR: a framework for automated singing voice correction," Proc. IEEE ICASSP'14, Florence, Italy, May 2014.

[2] E. Azarov, M. Vashkevich and A. Petrovsky "Instantaneous harmonic representation of speech using multicomponent sinusoidal excitation," Proc. the 14th Annual Conference of the International Speech Communication Association (Interspeech-2013), France, Lyon, 25-29 August 2013.

Copyrights to these papers may be held by the publishers. The download files are preprints. All persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.