Lab 6: Voice Recognition

The goal of this lab is to use your accumulated knowledge of signal and information processing to design a system for the recognition of a spoken digit.

• Click here to download the assignment.

• Due on Feb 28 by 5pm.

Acquiring and Processing Training Sets

To design a recognition system, first we need to create a training set with $N = 10$ recordings for each of the digits “one” and “two”. But that can be done by simply using the provided $\p{recordsound.py}$ class $10$ times for each of the digits.

T = 1  
fs = 8000  
num_recs = 10  
digits = [1, 2]  
digit_recs = []

for digit in digits:
    partial_recs = np.zeros((num_recs, int(T*fs)))
    print('When prompted to speak, say ' + str(digit) + '. \n')
    for i in range(num_recs):
        time.sleep(2)
        digit_recorder = recordsound(T, fs)
        spoken_digit = digit_recorder.solve().reshape(int(T*fs))
        partial_recs[i, :] = spoken_digit
    digit_recs.append(partial_recs)

Besides recording our voice, we also need to know what are the labels — that is, the associated digit — of each sample in the training set. Here, that can be inferred from the position of the recording on the list where those are saved: note that all recordings associated to the digit “one” are stored in a $N \times T f_s$ matrix that corresponds to the initial element of the list. To avoid re-recording the digits, we can use the $\p{numpy.save}$ to save the recordings locally.

Now, all we need to do to finish Part 1 is to compute the (normalized) DFTs of the samples in our training set using the provided $\p{dft}$ class. Here, we use the same list structure described above.

digit_recs = np.load("recorded_digits.npy")
digits = [1, 2]
num_recs, N = digit_recs[0].shape 
fs = 8000
DFTs = []
DFTs_c = []

for digit_rec in digit_recs:
    DFTs_aux = np.zeros((num_recs, N), dtype=np.complex_)
    DFTs_c_aux = np.zeros((num_recs, N), dtype=np.complex_)
    for i in range(num_recs):
        rec_i = digit_rec[i, :]
        # We can use the norm of the ith signal to compute a normalized DFT
        energy_rec_i = np.linalg.norm(rec_i)
        rec_i /= energy_rec_i
        DFT_rec_i = dft(rec_i, fs)
        [_, X, _, X_c] = DFT_rec_i.solve3()
        DFTs_aux[i, :] = X 
        DFTs_c_aux[i, :] = X_c
    DFTs.append(DFTs_aux)
    DFTs_c.append(DFTs_c_aux) 

np.save("spoken_digits_DFTs.npy", DFTs)
np.save("spoken_digits_DFTs_c.npy", DFTs_c)

Comparison with Average Spectrum

Now that we have training sets with recordings of the digits “one” and “two”, we can compute the average spectra of each of those training sets,

\begin{equation} \bar{Y} = \frac{1}{N} \sum_{i=1}^N |Y_i| \, \text{ and } \, \bar{Z} = \frac{1}{N} \sum_{i=1}^N |Z_i|. \end{equation}

Then, we can compare the spectrum of an unknown sample $X$ against the average spectra defined above. To do that, we compute the inner product between the DFT of the unknown sample and the average spectra, and assign to it the digit of the spectrum with the largest magnitude. Note that here we define the inner product $p(X,Y)$ between the spectra of any two signals $X$ and $Y$ as the inner product between their absolute values, that is,

\begin{equation} p(X,Y) = |X|^T|Y| = \sum_{k} |X(k)|\cdot |Y(k)|. \end{equation}

We do that because we want to compare the spectra by inner product in terms of magnitude, but since the spectra might be complex, absolute value is needed to find the magnitude. To implement this recognition system, we can use the following code snippet. Note that here we assume the training set has the same structure described above.

T = 1 
fs = 8000  
test_set = np.load("test_set.npy")
training_set_DFTs = np.abs(np.load("spoken_digits_DFTs.npy"))

num_digits = len(training_set_DFTs)
_, N = training_set_DFTs[0].shape
average_spectra = np.zeros((num_digits, N), dtype=np.complex_)

for i in range(num_digits):
    average_spectra[i, :] = np.mean(training_set_DFTs[i], axis=0) 

num_recs, N = test_set.shape
predicted_labels = np.zeros(num_recs)
    
for i in range(num_recs):
    rec_i = test_set[i, :]
    energy_rec_i = np.linalg.norm(rec_i)
    rec_i /= energy_rec_i
    DFT_rec_i = dft(rec_i, fs)
    [_, X, _, X_c] = DFT_rec_i.solve3()

    inner_prods = np.zeros(num_digits) 
    for j in range(num_digits):
        inner_prods[j] = np.inner(np.abs(X), np.abs(average_spectra[j, :]))
    predicted_labels[i] = np.argmax(inner_prods) + 1
    
print("Average spectrum comparison --- predicted labels: \n")
print_matrix(predicted_labels[:, None], nr_decimals=0)

To estimate the classification accuracy of this recognition system, we can record additional instances of the digits “one” and “two”, and keep track of how many times the recognition system is able to properly identify the digit. An estimate of the classification accuracy is then given by the ratio of the number of successes to total number of attempts.

Nearest Neighbor Comparison

Our second recognition system compares the spectrum of a test sample against each of the spectra stored in the training set and finds the element of the training set that resembles the test sample the most. That is, it computes the inner product $p(X, Y_i)$ between the unknown spectrum $X$ and each of the spectra $Y_i$ associated with the digit “one”, as well as the inner product $p(X, Z_i)$ between the unknown spectrum $X$ and each of the spectra $Z_i$ associated with the digit “two”, and then assigns the digit of the spectrum with the largest inner product. To implement this second digit recognition system, we can use the code snippet below. Note that the code assumes the training set has the structure described above, and the test set is an $N \times Tf_s$ matrix, with each row corresponding to one of the $N$ recordings stored in the test site.

T = 1 
if __name__ == '__main__':
    T = 1 
    fs = 8000  
    test_set = np.load("test_set.npy")

    training_set_DFTs = np.load("spoken_digits_DFTs.npy")
    num_digits = len(training_set_DFTs)

    num_recs, N = test_set.shape
    predicted_labels = np.zeros(num_recs)
    training_set_size, _ = training_set_DFTs[0].shape
    
    for i in range(num_recs):
        rec_i = test_set[i, :]
        energy_rec_i = np.linalg.norm(rec_i)
        rec_i /= energy_rec_i
        DFT_rec_i = dft(rec_i, fs)
        [_, X, _, X_c] = DFT_rec_i.solve3()

        inner_prods = np.zeros((num_digits, training_set_size))
        for j in range(num_digits):
            for k in range(training_set_size):
                sample_dft = (training_set_DFTs[j])[k, :]  
                inner_prods[j, k] = np.inner(np.abs(X), np.abs(sample_dft))
        max_position = np.unravel_index(np.argmax(inner_prods), inner_prods.shape)  
        predicted_labels[i] = max_position[0] + 1  
    
    print("Nearest neighbor comparison --- predicted labels: \n")
    print_matrix(predicted_labels[:, None], nr_decimals=0)

Assuming the test set has been recorded beforehand, we can then load the recordings stored in the test set, as well as the DFTs stored in the training set — assuming we have also saved the DFTs, and not only the recordings. Then, for each recording in the test set, we compute the inner product between the magnitude of its spectrum and the magnitude of each spectrum stored in the training set. To keep track of the inner products, we create a matrix, from which we can then find the largest inner product, and infer the digit it is associated to.

Code links

The classes provided for Lab 6 can be found in the following folder: ESE224_Lab6_provided.zip. This folder contains the following 2 files:
  • $\p{dft.py}$: The class $\p{dft}$ implements the discrete Fourier transform in $3$ different ways.
  • $\p{recordsound.py}$: The class $\p{recordsound}$ records your voice and saves it to a $\p{.wav}$ file.
The code used to implement the recognition system can be downloaded from ESE224_LAB6.