Transformation¶

pyvoimooo.pvoccombo(in, fs[, ps][, ps_max][, fw][, pe][, wm_gain][, efx_smile_alpha][, efx_smile_alpha_high_shelf][, efx_inteligibility_scaling][, efx_inteligibility_e0db][, efx_inteligibility_e0db_autobias][, mode][, specs][, ampenvs])¶

The Phase Vocoder with effect Combinations. The transformation engine that we advise to use in priority.

It uses a Pitch adaptive Overlap-Add (POLA) process.

Parameters

in (array<float>) – The input wavform to extract the frames from.
fs (int, Hz) – The sampling rate.
ps (float, optional) – Pitch scaling coefficient (def. 1.0). WARNING: This should not go above ps_max (see below). Expect artefacts otherwise.
ps_max (float, optional) – Maximum pitch scaling value. WARNING: The higher the value, the bigger the internal processing windows. Depending on the sound you process, if these windows are too big, reverberations effect might be heard. (def. 2.0)
dpss (2D array<float>, optional) –
Time stamped pitch scaling coefficients. This has priority over ps.

It must have two dimensions, with shapes [2,N], where the first row [0,:] are the time instants [s] of the pitch scaling coefficients of the second row [1,:]. WARNING: This cannot be a list of arrays.

The time instants must be in ascending order.

The pitch scaling coefficients used at each frame is then linearly interpolated between the given neighbor values (constant extrapolation is used for any time instant outside of the given time range).
psmv_mean_cent (float, cent, optional) – Pitch the mean on a scale in cents (def. no one applied). The scaling is made on a scale in cents. This takes priority over ps and dpss. Please consider warnings of ps.
psmv_var_coef (float, coefficent>=0, optional) – Set pitch variance scaling (def. no one applied). The scaling is made using a median value on a linear scale in cents. This takes priority over ps and dpss. Please consider warnings of ps.
psmv_forcemean (float, Hz, optional) – Force the mean value for the variance scaling. See psmv_var_coef
pst (float, Hz, optional) – Set pitch target (def. none). This takes priority over ps and dpss. Please consider warnings of ps.
fw (float, optional) – Frequency warping coefficient (def. 1.0). This warps the amplitude spectral envelope of the spectrum (push the smooth shapes higher (with fw>1.0), or lower (with fw<1.0)).
pe (bool, optional) – When a pitch scaling is applied, preserve the sectral envelope as is (def. true)
wm_gain (float, dB, optional) – Gain of the visual watermarking (def. -12dB)
efx_smile_alpha (float, optional) – Alpha parameter of the smile effect. The bigger the value, the more smily should be the effect (def. 1.0, in [1.0,+inf) )
efx_smile_alpha_high_shelf (bool, optional) – Activate or deactivate the extra high-shelf of the Smile effect (def. true)
efx_inteligibility_scaling (float, optional) – Size effect of the Intelligibility effect, 0.0 means no effect, 1.0 is maximum effect (def. 0.0, in [0.0,1.0] )
efx_inteligibility_e0db_autobias (float, dB, optional) – Bias for the automatic audio level correction of the Intelligibility effect (should not be changed) (def. +10)
efx_inteligibility_e0db (float, dB, optional) – Audio level correction of the Intelligibility effect (should not be changed) (def. not used, overwritten by efx_inteligibility_e0db_autobias)
efx_denoiser_gate_coef_autobias (float, dB, optional) – Bias for the automatic audio level of the Denoiser effect. The higher the value, the more denoised the sound. (def. -128.0)
efx_denoiser_gate_coef (float, dB, optional) – Audio level for the Denoiser effect (should not be changed) (def. not used, overwritten by efx_denoiser_gate_coef_autobias)
mode (string, optional) – onestep or analysis or synthesis or denoisenn (def. onestep)
denoisenn_modelpath (string, optional) – For option mode=:attr:denoisenn above. Give the path to the Neuralnet model. Has to be the root filename, which will extend to .json and .norm. Ex. ../mymodelpath, which points to ../mymodelpath.json and ../mymodelpath.norm
specs (list<array<complex<float>>>) – The complex spectrum, at each frame (as provided by analysis mode), which can be modified.
ampenvs (list<array<float>>) – The amplitude envelope, at each frame (as provided by analysis mode), which can be modified.

Returns

syn (array<float>) - The synthesized waveform.
tts (array<float>,seconds) - The time instant of the center of the analysis window at each frame (needs mode='analysis').
f0ss (array<float>,Hz) - The f_0 values at each frame (needs mode='analysis').
f0confs (array<float>) - A confidence factor for f_0 (needs mode='analysis').
specs (list<array<complex<float>>>) - The complex spectrum at each frame, which can be modified (needs mode='analysis').
ampenvs (list<array<float>>) - The amplitude envelope at each frame, which can be modified (needs mode='analysis').

Examples:

syn, tts, f0s, f0confs, specs, envs = vmo.pvoccombo(wav, fs, mode='analysis')

An example of analysis and synthesis steps:

features = vmo.pvoccombo(wav, fs, mode='analysis')

tts = features[1]
specs = features[4]
ampenvs = features[5]

for fi in range(len(ampenvs)):
    dftlen = (len(ampenvs[fi])-1)/2
    fleft = int(2000*dftlen/float(fs))
    fright = int(6000*dftlen/float(fs))
    ampenvs[fi][fleft:fright] *= 0.125

syn = vmo.pvoccombo(wav, fs, mode='synthesis', specs=specs, ampenvs=ampenvs)

Complete example for scaling the f0 variance:

import sys
import os
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
#plt.ion()

os.environ["VOIMOOO_LICENSE_ID"]="THIS-IISS-INVA-LIDL-ICEN-SEID"

# Load Voimooo python wrapper
sys.path.append('.')
import pyvoimooo as vmo

# F0 analysis and scaling parameters

v_th = 0.1 # voicing threshold

expr_scale = 3. # 0 for flat, 1 for original, >1 for amplified expressivity

down_r = 2. # reduction on down pitch: 1 for no reduction, 2 for half-range reduction, ...
up_l = 2.5 # limit for the deviation of the scaled pitch [multiplier of the original f0],
p_off = 1. # constant static pitch shift of the average f0 [multiplier of the original f0]


# Input and output files

# wget http://apps-download.alta-voce.tech/data/db/Diversity/wav/07001104.fr.f.NCSE.F11n4.wav
input_file = '07001104.fr.f.NCSE.F11n4.wav' # absolute path to the input file
out_file = input_file+'.pvoccombo_f0_scaling.wav'

wav, fs = vmo.readwav(input_file)
transf = wav

# Analysis
v_tss, v = vmo.voicing(transf, fs) #voicing analysis

# F0 analysis on voiced segments of the original signal
dum_syn, tts, f0s, dum_f0conf, dum_specs, dum_env = vmo.pvoccombo(transf, fs, mode='analysis')

voiced = np.interp(tts,v_tss,v)
voiced_filt = f0s*(voiced>v_th)

voiced_median = np.median(voiced_filt[voiced_filt>0])
voiced_std = np.std(voiced_filt[voiced_filt>0])

print('Original voiced median: ',voiced_median)
print('Original voiced std: ',voiced_std)

voiced_filt[voiced<=v_th] = voiced_median

fig = plt.figure(figsize=(20,5))

plt.subplot(121)
plt.plot(tts,f0s)

plt.subplot(122)
plt.plot(tts,voiced_filt)
plt.show()


# Processing: F0 scaling

#express_amp
voiced_median = voiced_median * p_off
pitch_scale = np.absolute((voiced_median + expr_scale*(f0s - voiced_median))/f0s)
pitch_scale[voiced<v_th] = 1

# correction for pitch up
scaled_std = np.std(pitch_scale * f0s)
arr_1 = pitch_scale * f0s # scaled f0 values
sc_2 = up_l * voiced_median # upper frequency limit
scaled_f0s = np.minimum(arr_1,sc_2)
pitch_scale = scaled_f0s / f0s

# correction for pitch down
down_val = pitch_scale[pitch_scale < 1.] # take pitch shift value between 0 and 1
down_val = ((down_val - 1.) / down_r) + 1. # shift start/end to 0 then divide and shift back
pitch_scale[pitch_scale < 1.] = down_val

plt.subplot(121)
plt.plot(tts,pitch_scale)

filt_pitch_scale = signal.medfilt(pitch_scale, kernel_size=25)

plt.subplot(122)
plt.plot(tts,filt_pitch_scale)
plt.show()

dpss = np.vstack([tts, filt_pitch_scale])

transf = vmo.pvoccombo(transf, fs, dpss=dpss, ps_max=3.0)

# F0 analysis on voiced segments of the transformed signal

dum_syn, tts, scaled_f0s, dum_f0conf, dum_specs, dum_env = vmo.pvoccombo(transf, fs, mode='analysis')

voiced = np.interp(tts,v_tss,v)
voiced_filt = scaled_f0s*(voiced>v_th)

voiced_median = np.median(voiced_filt[voiced_filt>0])
voiced_std = np.std(voiced_filt[voiced_filt>0])

v_tss, v = vmo.voicing(transf, fs) #voicing analysis
print('Transformed voiced median: ',voiced_median)
print('Transformed voiced std: ',voiced_std)


fig = plt.figure(figsize=(20,5))

plt.subplot(121)
plt.plot(tts,scaled_f0s)

plt.subplot(122)
plt.plot(tts, f0s)
plt.show()

# Write out file
vmo.writewav(out_file, fs, transf)

Complete example for making the voice more “afraid”:

import sys
import os
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
#plt.ion()

os.environ["VOIMOOO_LICENSE_ID"]="THIS-IISS-INVA-LIDL-ICEN-SEID"

# Load Voimooo python wrapper
sys.path.append('.')
import pyvoimooo as vmo

# F0 analysis and scaling parameters

v_th = 0.1 # voicing threshold

mod_amp = 0.3

mod_freq = 8.5
mod_rnd = 0.2
rnd_freq = 10


# Input and output files

# wget http://apps-download.alta-voce.tech/data/db/Diversity/wav/07001104.fr.f.NCSE.F11n4.wav
input_file = '07001104.fr.f.NCSE.F11n4.wav' # absolute path to the input file
out_file = input_file+'.pvoccombo_f0_scaling.wav'

wav, fs = vmo.readwav(input_file)
transf = wav
nsamp = len(wav)

# Analysis
v_tss, v = vmo.voicing(transf, fs) #voicing analysis


# F0 analysis on voiced segments of the original signal
dum_syn, tts, f0s, dum_f0conf, dum_specs, dum_env = vmo.pvoccombo(transf, fs, mode='analysis')

voiced = np.interp(tts,v_tss,v)
voiced_filt = f0s*(voiced>v_th)

# Modulator
rand_mod = np.interp(tts, np.linspace(0, nsamp/fs, num=rnd_freq*int(nsamp/fs)), mod_rnd*(np.random.random_sample(rnd_freq*int(nsamp/fs))-0.5))
mod = np.interp(tts,tts,mod_amp*np.sin((mod_freq+rand_mod)*(2*np.pi*tts)))


# Processing
pitch_scale=1+mod
pitch_scale[voiced<v_th] = 1

filt_pitch_scale = signal.medfilt(pitch_scale, kernel_size=25)

ps_max=np.max(filt_pitch_scale)

plt.plot(tts,pitch_scale)
plt.show()

dpss = [tts,filt_pitch_scale]

transf = vmo.pvoccombo(transf, fs, dpss=dpss, ps_max=ps_max)


# Write out file

vmo.writewav(out_file, fs, transf)

New in version 0.17.7.

pyvoimooo.pola_ana(in, fs[, timestep_seconds][, winlen_seconds][, frame_type])¶

The analysis step of a Pitch and OverLap-Add (POLA) transformation. It basically extracts frames that can be modified and then merged using pola_syn.

It uses a Pitch adaptive Overlap-Add (POLA) process.

Parameters

in (array<float>) – The input wavform to extract the frames from.
fs (int, Hz) – The sampling rate.
timestep_seconds (float, optional, seconds) – (def. 0.005s)
winlen_seconds (float, optional, seconds) – (def. 3 periods of the fundamental frequency)
frame_type (string, optional) – time or spec (def. time)

The length of the DFT is always the next power of two above the winlen.

Note

A good practice is to gather all the optional arguments into a dict() and pass it as argument to both pola_ana() and pola_syn() since they have to be common (see the example below).

Returns

tss (float,second) - Time instants of analysis (the center time of each frame)
frames (list<array<float>>) - The frames
f0ss (array<float>,Hz) - The f_0 values at each frame.

See pola_syn() below for a complete example.

pyvoimooo.pola_syn(in, fs, kwargs)¶

The synthesis step of a Pitch and OverLap-Add (POLA) transformation. It resynthesise frames that have been extracted using pola_ana(), and modified as desired.

It uses a Pitch adaptive Overlap-Add (POLA) process.

Parameters

tss (array<float>, second) – Times of analysis, as provided by pola_ana() (non-modified).
frames (list<array<float>>) – The frames, as provided by pola_ana() and modified as desired.
f0s (array<float>,Hz) – The f_0 values at each frame, as provided by pola_ana() (non-modified).
wavlen (int, number of samples) – The number of samples in the synthesized waveform.
fs (int, Hz) – The sampling rate.
kwargs – Extra arguments to chose the type of frames extraction, as given to pola_ana() (non-modified).

Returns

syn (array<float>) - The synthesized waveform.

Complete example:

import sys
import numpy as np
import matplotlib.pyplot as plt
plt.ion()

import pysndfile # Get it from pip

# Load Voimooo python wrapper
sys.path.append('.')
import pyvoimooo as vmo

# Read the source file
wav, fs, enc = pysndfile.sndio.read('../test/snd/eng-usa.f.arctic_a0487_32khz.wav')
wavts = np.arange(len(wav))/float(fs)

# Prepare a dict with the options that _have_ to be common between analysis and synthesis stages
opts = dict()
opts['frame_type'] = 'spec' # 'time'
opts['timestep_seconds'] = 0.010
opts['winlen_seconds'] = 0.050

# Run the analysis
tss, frames, f0s = vmo.pola_ana(wav, fs, **opts)

# Modify frames
dftlen = (len(frames[0])-1)*2;
framesnew = list()
for fr in frames:
    winlen = 1+2*int(0.5*opts['winlen_seconds']*fs)
    fr = fr.astype('complex128')
    if 1:
        # Robot
        fr = np.abs(fr)
        fr = fr*np.exp(-2j*((winlen-1)/2)*np.pi*np.arange(dftlen/2+1)/float(dftlen))
    else:
        # Cepstral compensation
        rcc = np.fft.irfft(np.log(abs(fr)))
        rcc[int(dftlen/2):] = 0
        rcc[1:] *= 2
        rcc[1:2] = 0
        fr = np.exp(np.real(np.fft.rfft(rcc, dftlen)))*np.angle(fr)
    framesnew.append(fr.astype('complex64'))

frames = framesnew # Just forget about the original frames

# Synthesize the result
syn = vmo.pola_syn(tss, frames, f0s, len(wav), fs, **opts)

# Write down the synthesis
pysndfile.sndio.write('eng-usa.f.arctic_a0487_32khz.pola.wav', syn, fs)

# Plot the features
plt.subplot(211)
plt.plot(wavts, wav, 'k')
plt.plot(tss, f0s/100.0, 'b')
plt.plot(wavts, syn, 'r')
plt.subplot(212)
vmax = 20*np.log10(np.max(np.abs(frames)))
vmin = vmax-70.0
plt.imshow(20*np.log10(np.abs(frames)).T, origin='lower', aspect='auto', interpolation='none', cmap='jet', extent=[0.0, 1.0, 0.0, fs/2], vmin=vmin, vmax=vmax)

from IPython.core.debugger import  Pdb; Pdb().set_trace()

New in version 0.8.9.

pyvoimooo.pitch_scaling_snm(in, fs[, sps][, dpss][, spvs][, ep][, ses])¶

Pitch scale the given waveform using a sinusoidal model.

It uses a Pitch adaptive Overlap-Add (POLA) process.

Parameters

in (array<float>) – The input wavform to transform.
fs (int, Hz) – The sampling rate.
sps (float, optional) –
Static pitch scaling factor (def. 1.0).

E.g. If sps is 2.0, the pitch curve of the whole waveform will be twice higher than the original.
dpss (2D array<float>, optional) –
Time stamped pitch scaling coefficients.

It must have two dimensions, with shapes [2,N], where the first row [0,:] are the time instants [s] of the pitch scaling coefficients of the second row [1,:]. WARNING: This cannot be a list of arrays.

The time instants must be in ascending order.

The pitch scaling coefficients used at each frame is then linearly interpolated between the given neighbor values (constant extrapolation is used for any time instant outside of the given time range).
spvs –
Static pitch variance scaling factor (def. 1.0).

E.g. If spvs is 2.0, the variancee of the pitch curve of the whole waveform will be twice wider than in the original.
ep (boolean, optional) – Preserve the amplitude spectral envelope (def. true).
ses (float, optional) –
Static envelope scaling factor (def. 1.0).

E.g. If ses is 2.0, the envelope of the whole spectrum will be stretched with a factor 2.

Returns

syn (array<float>) - The transformed waveform.
tss (array<float>) - The time stamps of the f0 values.
f0s (array<float>) - The f_0 values estimated during processing.

Example:

syn, tts, f0s = vmo.pitch_scaling_snm(wav, fs, sps=2.0)

syn, tts, f0s = vmo.pitch_scaling_snm(wav, fs, dpss=[[0.0, 1.5, 2.0], [1.0, 0.8, 1.5]])

New in version 0.10.1.

pyvoimooo.freqwarp_pola(in, fs[, gfs][, static_freqs][, dynamic_times][, dynamic_freqs][, f0min][, f0max])¶

Warp the spectral envelope in frequency domain.

It uses a Pitch adaptive Overlap-Add (POLA) process.

Parameters

in (array<float>) – The input wavform to transform.
fs (int, Hz) – The sampling rate.
gfs (float, optional) –
Global frequency scaling parameter (def. 1.0).

E.g. If gfs is 2.0, the envelope bin at 1kHz will be warped/shifted to 2kHz.

This is can be combined with SFW and DFW options below.
static_freqs (2D array<float>, Hz, optional) –
Static Frequency Warping (SFW) parameters.

It should have two dimensions, with shapes [2,N], where the first row [0,:] are the input frequencies and the the second row [1,:] are the corresponding output frequencies.

The frequencies must be in ascending order.

The warping function follows the same principle as for gfs. The warped frequencies between two given points in static_freqs are linearly interpolated between the given neighbor values.

For a traditional use of frequency warping, two points should be part of static_freqs, one at zero frequency and one at Nyquist (please see the example below).

This option is exclusive with the dynamic_freqs option.
dynamic_times (array<float>, seconds, optional) –
Dynamic Frequency Warping (DFW) parameters time stamps.

Time stamps of dynamic_freqs values (see below).

This parameter has to be used jointly with the dynamic_freqs parameter. This option is exclusive with the static_freqs option.
dynamic_freqs (list< 2D array<float> >, Hz, optional) –
Dynamic Frequency Warping (DFW) parameters.

A list of 2D arrays as in static_freqs. The time stamps of each element of this list are in dynamic_times.

This parameter has to be used jointly with the dynamic_times parameter. This option is exclusive with the static_freqs option.
f0min (float, Hz, optional) – The minimal f_0 value.
f0max (float, Hz, optional) – The maximal f_0 value.

Returns

syn (array<float>) - The transformed waveform.
tss (array<float>) - The time stamps of the f0 values.
f0s (array<float>) - The f_0 values estimated during processing.

Examples:

syn, tts, f0s = vmo.freqwarp_pola(wav, fs, gfs=0.5, static_freqs=[[0,2000,fs/2],[0,4000,fs/2]])

syn, tss, f0s = vmo.freqwarp_pola(wav, fs, dynamic_times=[0.0, 2.0], dynamic_freqs=[[[0,3000,fs/2],[0,2500,fs/2]],[[0,2000,fs/2],[0,4000,fs/2]]])

New in version 0.10.1.

pyvoimooo.smile(in, fs[, alpha][, alphas][, f0min][, f0max][, gender][, anchorfreqs][, shelf])¶

Transform a wavform using the SMILE algorithm.

It uses a Pitch adaptive Overlap-Add (POLA) process.

Parameters

in (array<float>) – The input wavform to transform.
fs (int, Hz) – The sampling rate.
alpha (float, optional) – The alpha value for scaling the effect (in [0.8,1.4], def. 1.0).
alphas (2D array<float>, optional) –
Time stamped alpha values.

It must have two dimensions, with shapes [2,N], where the first row [0,:] are the time instants [s] of the alpha values of the second row [1,:].

The time instants must be in ascending order.

The alpha value used at each frame is then linearly interpolated between the given neighbor values (constant extrapolation is used for any time instant outside of the given range in alphas).
f0min (float, Hz, optional) – The minimal f_0 value.
f0max (float, Hz, optional) – The maximal f_0 value.
gender (str, optional) – ‘male’ or ‘female’, specify the gender (def. to None, an average value).
anchorfreqs (1D array<float>, optional) –
Custom attachment frequencies A size 4 vector that defines: the two values where the frequencies are preserved (values at index 0 and 3), and the two values defining the interval of the frequencies that are warped (values at index 1 and 2).

Given FN the N-th formant freqeuncy, set them to [F1/2, F2, F3, F5], in order to fallback on the alternative.

Set the first value to -1 in order to disable these custom values and re-use the internal values.
shelf (1D array<float>, optional) –
Custom shelf parameters (frequency[Hz], gain[dB]) A size 2 vector with: The custom frequency [Hz] and gain [dB].

Set the custom frequency to -1 in order to disable these custom values and re-use the internal values.

Returns

syn (array<float>) - The transformed waveform.
tss (array<float>) - The time stamps of the f0 values.
f0s (array<float>) - The f_0 values estimated during processing.
envs_ori (list<array<float>>) - The spectral envelopes of the analysed file.
envs_new (list<array<float>>) - The transformed spectral envelopes applied to obtain the resulting file.

Examples:

syn, tts, f0s, envs_ori, envs_new = vmo.smile(wav, fs, alpha=1.2)

syn, tts, f0s, envs_ori, envs_new = vmo.smile(wav, fs, alphas=[[0.0, 1.5, 2.0], [1.0, 1.0, 1.2]])

New in version 0.10.1.

pyvoimooo.intelligibility(in, fs[, IOEC0dB][, scaling])¶

Transform a waveform using the Intelligibility algorithm.

It uses a Pitch adaptive Overlap-Add (OLA) process.

Parameters

in (array<float>) – The input waveform to transform.
fs (int, Hz) – The sampling rate.
IOEC0dB (float, dB, optional) – The IOEC0dB compression reference (def. -10.0 dB).
scaling (float, optional) – The scaling value for scaling the effect (in [0.0,1.0], def. 1.0).

Returns

syn (array<float>) - The transformed waveform.

Examples:

syn = vmo.intelligibility(wav, fs, IOEC0dB=-15.0, scaling=1.0)

New in version 0.16.6.

pyvoimooo.denoise(in, fs[, gate_coef_db][, gate_coef_auto_bias_db])¶

Denoise a waveform using spectral noise gate.

It uses a Pitch adaptive Overlap-Add (OLA) process.

Parameters

in (array<float>) – The input waveform to transform.
fs (int, Hz) – The sampling rate.
gate_coef_db (float, dB, optional) – The noise threshold (def. -46.0 dB).
gate_coef_auto_bias_db (float, dB, optional) – The threshold bias used with semi-automatic gate threshold (def. -12.0 dB).

Returns

syn (array<float>) - The transformed waveform.

Examples:

syn = vmo.denoise(wav, fs, gate_coef_db=-15.0)

New in version 0.17.8.

pyvoimooo.pitch_scaling_doubledelay(in, fs[, sps][, dpss][, spvs])¶: Deprecated since version 0.20.1: Please use pvoccombo() or pitch_scaling_snm()