How to get reports from audio files using speech recognition and NLP

  • 17 September 2021
  • 0 replies

Userlevel 2
Badge +1

Transform speech into knowledge with HuggingFace/Facebook AI and


Photo by Volodymyr Hryshchenko on Unsplash


Over the years I’ve saved tons of podcasts, telling myself I would soon listen to them. This folder has now become an enormous messy heap of audios, and I often don’t even remember what each particular file is about. That’s why I wanted to create a program to analyze audio files and produce a report on their content. I needed something that with a simple click would show me topics, main words, main sentences, etc. To achieve this, I used Facebook AI/Hugging Face Wav2Vec 2.0 model in combination with’s NL API. I uploaded the code here, hoping that it would be helpful to others as well.



This solution is broken down in three main steps:

  • Pre-processing stage (extension handling and resampling)
  • Speech to Text conversion
  • Text analysis and report generation

For the first step, I checked many options. Some were very practical (did not require a subscription, and were easy to implement), but quality wasn’t impressive. Then I found Facebook AI Wav2Vec 2.0, a Speech to Text model available on HuggingFace, which proved reliable and provided good results. Thanks to this, I was able to avoid cloud subscriptions (which required a credit card and other requests that made sharing my work more complicated than it needed to be). Even without any further fine tuning, the pre-trained model I used (wav2vec2-base-960h) worked well. Have a look here if you want to go for additional fine tuning.

With regard to the NLP/NLU (Natural Language Processing and Understanding) part, I used it’s easy to implement, and has a vast range of options; it’s available just with a quick registration. This is necessary as your email registration and password will be system variables used to call’s cloud service. Both Wav2Vec and NL API are free to use – with some volume limitations.



As far as I’ve seen, all the Speech-To-Text modules only audios with a sampling rate of 16kHz. In addition to that, these modules are computationally very heavy (that is often the main reason to choose a cloud service), and the first time I tried to process a 2-minute audio sample my laptop couldn’t keep up. This led me to add one more step: “Speech-to-Text Chunking”. The idea is that, since I could not feed the model the entire file, I processed multiple (smaller) audio chunks, and then merged their transcriptions back in one text file. By following this technique, it’s not impossible that a few words might be cut off (this didn’t have measurable effects on quality in my tests), but the advantage is that I can use the Speech-To-Text module on a regular laptop – and with good results! Last but not least, this model only accepts .wav files. So I added in the script a conversion snippet to use if the audio files have a different format/extension. In order to make this conversion work you have to download the ffmpeg.exe (here) and store the file in the same folder of the running script.

With regards to NL API, there’s a limit on the size of the text one can analyze in one request (10,000 byte, approximately 10-12 minutes of speech, depending on how fast the speakers talk). The output of’s NLU analysis has quite the range: from NER to POS tagging, from classification to sentiment analysis, PII, Writeprint and much more. However, since the text we send to the cloud service is completely lower case and without any punctuation (consequence of our Speech to Text step), some analyses may not show the full potential of this NLU technology. For this reason, I only query the service for topics, main lemmas (this list shows relevant nouns, in their base form), and main phrases (aka relevant sentences); these worked well in my tests. I’m confident that, as soon as Speech-to-Text technology improves and introduces capitalization and punctuation, this step will offer even more.



The figure below shows the workflow of the project.


The process starts in our original folder where all audio files are stored, carrying their original extension. The program sends those files to the “converted” folder, converting the non-.wav files (if any). Then it starts iterating through all the converted files. These are resampled at 16kHz and saved in the “resampled” folder.

After that, they’re sent to the Speech-to-Text module. Here the function cuts the audio into 30-second blocks – this parameter is customizable – and these blocks are sent one by one to the generate_transcript function, which returns the transcription (further details in the following sections).

Block by block, all the audio is transcribed and concatenated. Every two blocks I decided to insert a carriage return ( interprets this as the end of a sentence) in order to avoid ending up with a huge single line as a final transcript – which would have made the subsequent text analysis borderline absurd. At this point in the workflow, we have a meaningful textual document (though all lower case, and bare minimum/simulated punctuation), so it is NLU time. The transcription is analyzed by’s NL API services, whose output is then worked into a report (stored in the form of .txt file in the “audio_report” folder). In the end, we have a text file that shows the main topics the audio file presented, as well as relevant nouns and statements. It was funny to discover how many of my podcasts I don’t care about anymore, while others still pique my interest and can be prioritized.

So, simply put, first all files are converted (if necessary), and then they go, one at a time, through the cycle that takes care of resampling, transcription, NLU analysis, report generation.

Let’s take a closer look at the code.



First thing the script does is importing all the necessary libraries and model and setting the variables.

import librosa
import torch
import time
import datetime
from pathlib import Path
import subprocess
import os
import shutil
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from datetime import date
from import ExpertAiClient variables and import
# Import the Wav2Vec model and processor
model = "facebook/wav2vec2-base-960h"
print("Loading model: ", model)
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

path_base = "Audio files/" #Original speech/audio files folder
sr = 16000 #Sampling rate
block_length = 30 #Speech chunk size
language = "en"
expertai_account = "your_expert.ai_email" #Your email account
expertai_psw = "your_expert.ai_psw" #Your psw
os.environ["EAI_USERNAME"] = expertai_account
os.environ["EAI_PASSWORD"] = expertai_psw

#Folders and Path Creation
audio_report = "Reports" #This is the folder where your report will be stored
path_converted_audio = "converted_files/" #This is the temporary folder for converted audio files
resampled_folder = "resampled_files/" #This is the folder for the resampled audio files
Path(audio_report).mkdir(parents = True, exist_ok = True) #This creates the reports folder
Path(path_converted_audio).mkdir(parents = True, exist_ok = True) #This creates the folder for converted audio files
Path(resampled_folder).mkdir(parents = True, exist_ok = True) #This creates the folder for resampled audio files

#Conversion List
extension_to_convert = ['.mp3','.mp4','.m4a','.flac','.opus'] #List of the supported files types/extensions

There are plenty of Wav2Vec models on HuggingFace. I chose the “base-960h” because it’s a good compromise between quality and weight structure. Write the path for your audio files in “path_base”. Leave the variable sr at 16000 (this is the sampling rate); you can also choose a different block length, depending on your CPU and RAM capabilities: I set it at 30 (the unit is seconds). Insert your developer portal email and password in their respective variables. Then write the folders names you prefer for the conversion, resampling and final report. The program will create those paths for you (mkdir). You can increase the extension_to_convert list adding more extensions, if necessary.



I begin preprocessing the audios. The aim is to get a folder filled by only .wav files.

#Pre-processing function
def preprocessing(path_base, path_converted_audio):
for file in os.listdir(path_base):
filename, file_extension = os.path.splitext(file)
print("\nFile name: " + file)
if file_extension == ".wav":
file_to_process = file
shutil.copy(path_base + file, path_converted_audio + file)
elif file_extension in extension_to_convert:['ffmpeg', '-i', path_base + file,
path_base + filename + ".wav"])
shutil.move(path_base + filename + ".wav", path_converted_audio + filename + ".wav")
print(file + " is converted into " + filename +".wav")
print("ERROR: Unsupported file type - "+ file + " was not converted. Modify the pre-processing stage to convert *" + file_extension + " files.")


The pre-processing function iterates through the original folder where your audio files are stored. If the file has a “.wav” extension, then it sends the file to the “path_converted_audio” folder, otherwise it converts such file to a “.wav” extension first. Two things: 1) in order to make this conversion work you must have ffmpeg.exe installed in the same folder of your running script; 2) if your file has an extension that is not in the “extension_to_convert” list, then it will not be converted and the program goes to the next iteration (it will give you a warning that the file has not been converted).

As the FOR cycle in the preprocessing function comes to an end, I have the “path_converted_audio” filled with all “.wav” files. I am now ready to start the process that generates the text report. It is composed by three functions: resample, asr_transcript (and its nested generate_transcription function) and text_analysis.



#Resampling function
def resample(file, sr):
print("\nResampling of " + file + " in progress")
path = path_converted_audio + file
audio, sr = librosa.load(path, sr=sr) #File load and resampling
length = librosa.get_duration(audio, sr) #File lenght
print("File " + file + " is",datetime.timedelta(seconds=round(length,0)),"sec. long")
sf.write(os.path.join(resampled_folder,file), audio, sr) #(resampled_folder + file, audio, sr)
resampled_path = os.path.join(resampled_folder,file) #resampled_folder + file
print(file + " was resampled to " + str(sr) + "kHz")

The resample function, as the name says, resamples the audio. It takes the file and the sampling rate as arguments. For my purpose I am resampling it at 16kHz but if you want to use it with other models that accept or need a different sampling rate, just change the “sr” variable in the variable section (or pass it directly to the function), and you’ll get your desired sampling rate conversion. Here the function (librosa.load) loads the file, resampling it, and also gets the length information back (librosa.get_duration). Lastly, it stores the resampled file in the resample_path folder. The function returns the resampled_path and length.



Now I can pass the resampled audio to the asr_transcript function.

#Transcription function
def asr_transcript(processor, model, resampled_path, length, block_length):
chunks = length//block_length
if length%block_length != 0:
chunks += 1
transcript = ""
# Split the speech in multiple 30 seconds chunks rather than loading the full audio file
stream =, block_length=block_length, frame_length=16000, hop_length=16000)

print ('Every chunk is ',block_length,'sec. long')
print("Number of chunks",int(chunks))
for n, speech in enumerate(stream):
print ("Transcribing the chunk number " + str(n+1))
separator = ' '
if n % 2 == 0:
separator = '\n'
transcript += generate_transcription(speech, processor, model) + separator
print("Encoding complete. Total number of chunks: " + str(n+1) + "\n")
return transcript

#Speech to text function
def generate_transcription(speech, processor, model):
if len(speech.shape) > 1:
speech = speech[:, 0] + speech[:, 1]
input_values = processor(speech, sampling_rate = sr, return_tensors="pt").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.decode(predicted_ids[0])
return transcription.lower()


The asr_transcript function takes five arguments: processor and model have been imported at the beginning section of the script, block_length has been set in the variables section (I assigned a value of 30, that means 30 seconds), and the resampled_path and length are returned from the previous function (resampled). At the beginning of the function, I immediately calculate how many pieces the audio consists of and then I instantiate “transcript” as an empty string. Then I apply to the file the function that returns (in fixed-length buffers) a generator, on which I iterate over to produce blocks of audio.

I send each block to the generate_transcription function, the proper speech-to-text module that takes the speech (that is the single block of audio I am iterating over), processor and model as arguments and returns the transcription. In these lines the program converts the input in a pytorch tensor, retrieves the logits (the prediction vector that a model generates), takes the argmax (a function that returns the index of the maximum values) and then decodes it. The final transcription is all capital letters. In absence of casing, an NLP service like handles this ambiguity better if everything is lowercase, and therefore I apply that case conversion.

So, when I call the asr_transcript function it takes the audio, iterates over it providing each time a block of the audio to the generate_transcription function, which in turn transcribes it and then appends this transcription to the previous one (creating a new line every two blocks).

At this point, I’ve got the transcription of our original audio file. It’s time to analyze it.


Information Discovery time. Now that I have a transcript, I can query the NL API service and generate the final report.

#NLU Analysis
def text_analysis(transcript, language, audio_report, file, length):
#Keyphrase extraction
print("NLU analysis of " + file + " started.")
client = ExpertAiClient()
output = client.specific_resource_analysis(body={"document": {"text": transcript}},
params={'language': language, 'resource': 'relevants'})

today =
report = f"REPORT\nFile name: {file}\nDate: {today}" \
f"\nLength: {datetime.timedelta(seconds=round(length,0))}" \
f"\nFile stored at: {os.path.join(audio_report, file)}.txt"

report += "\n\nMAIN LEMMAS:\n"
for lemma in output.main_lemmas:
report += lemma.value + "\n"
report += "\nMAIN PHRASES:\n"
for lemma in output.main_phrases:
report += lemma.value + "\n"
report += '\nMAIN TOPICS:\n'
for n,topic in enumerate(output.topics):
if topic.winner:
report += '#' + topic.label + '\n'

#Write the report to a text file
filepath = os.path.join(audio_report,file)
text = open(filepath + ".txt","w")
print("\nReport stored at " + filepath + ".txt")
return report


Text_analysis takes five arguments: transcript (returned from asr_transcript function), language and audio_report (already set in the variables section), file (it’s the single file from the group I am iterating) and length (returned from the resample function). I instantiate the ExpertAiClient() calling it simply “client” and then I send my request. It’s very simple, and it takes just one line of code. I specify the method (in my case “specific_resource_analysis”), and then I pass “transcript” as text, “language” as language and “relevants” as resource. This call is specific to my case, but with a slight modification you can query other types of analysis such as emotional traits, classification, NER, POS tagging, Writeprint, PII and much more. Once I get the response back, I iterate through it extracting main lemmas, main phrases, and main topics, adding these responses to the report which is written in a .txt file stored in the audio_report folder.

We’ve done all the steps necessary to get a report from an audio file. Finally, let’s look at the main function that executes all these other functions in the proper order.



Speech2Data is the function that drives the execution of the entire workflow. In other words, this is the one function we call to get a report out of an audio file.

def speech_to_data():

preprocessing(path_base, path_converted_audio)

for file in os.listdir(path_converted_audio):
resampled_path, length = resample(file, sr) #sampled_name
print("\nTranscribing ", file)
transcript = asr_transcript(processor, model, resampled_path, length, block_length)
report = text_analysis(transcript, language, audio_report, file, length)


This function triggers the pre-processing function, that creates a folder with all converted files ready to be analyzed, and then iterates through every file. It resamples the file, then transcribes it, analyzes the text and generates the report. The last line of code removes the now useless path_converted_audio folder.



I enjoyed writing this code. Thanks to open source, Facebook AI, HuggingFace, and, I’ve been able to get reports from audio files just by using my home computer. And the list of potential applications I see is endless, especially with the possibility to tailor down classification and data mining tasks to run on top of speech recognition, using tools like Studio. With this possibility in mind, the module can be customized to bring value to any speech recognition technology out there, and you can also shape the data collection tasks to your needs, therefore process any type of speech to automate the collection of the information that matter to you.


GitHub code repo developer portal documentation

Github Wav2Vec

Hugging face model

0 replies

Be the first to reply!