How to Turn ChatGPT into a Voice-Enabled Chatbot in 60 Lines of Code

Turn ChatGPT into a Voice-Enabled Chatbot in 60 Lines of Code

I recently found an article about someone who built a Zapier integration between Alexa and ChatGPT. Although it was a cool project, I realized it wasn’t necessary to use a third-party platform or Home Speaker to interact with ChatGPT using voice. All the required functionality is natively available in Chrome.

Back in 2017, I worked on a product that used a voice-controlled interface, and I became very familiar with web speech recognition and speech synthesis APIs. Using these, alongside some tricks to work with ChatGPT’s interface, I was able to start talking to the AI and give it a voice!

In this article, I’ll walk you through how to turn ChatGPT into a voice-enabled chatbot using 60 lines of code. Meanwhile, anyone can copy and paste the script.

Steps to Complete

To build this chatbot–literally–we’ll complete the following three steps:

Step 1: Open the ChatGPT website in your Chrome browser.

Step 2: Open the developer console by pressing Ctrl+Shift+I or right-clicking anywhere on the page and selecting “Inspect”. Open the “Console” tab.

Step 3: Paste in the Javascript code and start chatting! You can just have a natural conversation without keywords.

The Script

Create a new SpeechRecognition instance with the following configurations. Use any language you like, and learn more about this API here.

const SpeechRecognition = window.SpeechRecognition || webkitSpeechRecognition;
const recognition = new SpeechRecognition();
recognition.lang = "en-US";
recognition.continuous = true;
recognition.maxAlternatives = 1;
recognition.interimResults = false;

The ChatGPT interface doesn’t use any descriptive class names or element IDs. To get the form input and the submit button elements, we can use the selectors.

const formTextarea = document.querySelector("main form textarea");
const formSubmit = document.querySelector("main form button");

Several global variables will allow us to manage the conversation. The variable isSpeaking is a boolean that helps prevent recognition transcribing the synthesized audio. Meanwhile, we’ll all store two intervals that determine the state.

let isSpeaking = false;
let intervalRcg, intervalUtr;

When you speak into the microphone, the recognition.onresult function gets triggered. We stop the recognition, and then the speech results of the recognition need to get populated into the chat input and submitted.

recognition.onresult = (event) => {
recognition.stop();
fillAndsubmitForm(event);
setTimeout(pollResultStatus, 1000);
setTimeout(startVoiceSynth, 1000);
};

The fillAndsubmitForm() function will fill the ChatGPT input form with your spoken words (transcribed input) and submit it to ChatGPT.

function fillAndsubmitForm(event) {
const result = event.results[0][0].transcript;
if (result == "stop") return;
formTextarea.value = result;
formSubmit.click();
}

The onresult callback set timeouts to trigger two functions after 1 second, pollResultStatus() and startVoiceSynth().

ChatGPT disables the submit button while the AI is responding. We can check it frequently to determine whether to restart the recognition stream.

function pollResultStatus() {
intervalRcg = setInterval(() => {
if (formSubmit.disabled || isSpeaking) return;
recognition.start();
clearInterval(intervalRcg);
}, 500);
}

The startVoiceSynth() function is responsible for converting the AI’s response into spoken words using the speechSynthesis API.

When called, the function sets the isSpeaking variable to true, indicating that the bot is speaking. An inner function named sayResult()synthesizes the bot’s response most recent response and speaks it out loud.

function startVoiceSynth() {
isSpeaking = true;
function sayResult() {
if (this.innerText === this.spokenText) {
clearInterval(intervalUtr);
isSpeaking = false;
return;
}
speechSynthesis.speak(
new SpeechSynthesisUtterance(
this.innerText.slice((this.spokenText || "").length)
)
);
this.spokenText = this.innerText;
}
intervalUtr = setInterval(
sayResult.bind(document.querySelector(".result-streaming")),
500
);
}

It would be dull if the bot’s full response had to be received before it got spoken. The sayResult() function compares words spoken to the complete result, allowing it only to synthesize the newest response words.

And last but not least…

recognition.start();

Conclusion

Once the above code is copied and pasted into the console, the script will start listening to your voice and transcribing what you say in real-time. When you’re done speaking, it will send your transcription to ChatGPT and wait for a response.

So–in conclusion–turning ChatGPT into a voice-enabled chatbot is relatively simple. It can be done using the speech recognition and synthesis APIs available in Chrome. The code presented in this article is a basic script that can be expanded to accommodate more sophisticated use cases.

I put the full script up on my blog so that you can easily copy and paste it. Find it here walkthrough.ai/voice-enable-chat-gpt.

Have fun!

Posted