Speech Recognition

Hello everyone I am working on a project that I would like to use speech recognition on.
I have tried Microsoft Speech recognition and it does not do well at all, even after training.
I am looking towards you for suggestion.
The environment that I will be using is Windows OS.
Hope to hear from you.

This is a companion discussion topic for the original entry at https://community.robotshop.com/robots/show/speech-recognition

android phone

I use an android phone that runs ROS voice recognition which uses google voice recognition. this one works in a noisy environment

also is very good with children and women voices. While this runs on Linux I’m sure you can find one that works with windows. another plus the microphone on the phones are much better than most desktop microphones

Although you mention wanting

Although you mention wanting to keep it in Windows, I figure it’s worth mentioning that the Google AIY Voice kit is very cool and fairly inexpensive. It uses a Raspberry Pi 3B, a Pi hat that I think optimizes the audio input for speech recognition, a dual-mic array, and a speaker. The Pi OS that they have for it has Google’s speech recognition built in and it also has Python code that allows for easy capture of custom commands, and from each command you can do anything that you can do via Python, which is just about anything. It can be triggered via “OK Google”, or via a button/sensor input of your own concoction. I’m presently trying to figure out how/why I might incorporate it into a robot. Also, your custom commands work offline I think, which means you don’t necessarily need to have the Internet accessible by your Pi. If all else fails in your Windows attempts, you could get a Pi Python program talking serial over USB to your Windows computer, relaying instructions that it has recognized.

I take it back about offline

I take it back about offline recognition. I just tried it, doesn’t seem to work. I’ve read in a couple of places that it works, but I’m not convinced until I see it in person. So, I wouldn’t count on that just yet.

Microsoft Speech Recognition vs Google Speech recognition API.

Hi Jeff,


I see that you work with Windows. I, in my project, initially used “Microsoft Speech Recognition” as voice recognition. Actually the tests were not satisfactory, but I also saw that the system could be trained with a set of valid words. I did not investigate anymore because I have other problems to solve before. Now I’m doing tests with telegram + android phone and using the google voice to text converter that incorporates the phone with very good results. Unfortunately it seems that the Google API for PC is paid and does not allow its use for free. Also, I think that, unlike Microsoft’s, you need to stand still and listen for some time to convert to text.


I do not know if it helps you with anything.


P.S. Sorry if my English level is not very good.

I emailed Nuance the people

I emailed Nuance the people you make Dragon Speaking as they offer a SDK. After sending them 3 emails I finally got a response back. Get ready to take a 2nd mortgage out on your house to purchase the SDK.

The SDK is essentially us giving the API to integrate the Dragon speech recognition into your applications. There are two ways that it can be used, for front end transcription/Client SDK (ex. Closed captioning) and back end transcription/ Server SDK (ex. Batch audio processing). Each product has a cost for the development license ($10,000 for the Server SDK and $5,000 for the Client SDK), as well as the Server SDK has a cost of $25,000 per server depending on the amount of amount that you are transcribing and the Client SDK is a per user/ speaking pricing model where you will need to prepay for the amount of users that would potential be using the application.

My recomendation.

If you work in a Windows environment, it seems that the most reasonable way to try to convert speech to text is to use Microsoft Speech Recognition. It seems that it works much better with a limited number of words, so you’ll have to define the list of words you want to work with. Other solutions such as Google or Azure, with which you get better results when using neural networks have cost.

Speech Recog

Hi Jeff,

This is my current plan for how to approach the problem on Eva (my next bot):   Given the limitations I have found in testing different platforms, its the best I have been able to come up with so far.

It is based on using a windows device that is continuosly listening and coordinating with an Android device that is listening only when triggered by the windows device.  The windows device also runs the brain and a SQL server DB which will hold tables for dictionary of words, phrases, thesaurus, commands, patterns, knowledge, history, etc.

1.  I will use a LattePanda running windows that is listening continuosly.  If the speech matches any pattern that can be recognized, it will be processed immediately through the brain and responded to without involving Android. As windows is unreliable, this will not work most of the time.  I think it could work for a variety of simple commands and short phrases…not sentences though.

2.  If the speech doesn’t match, or the speech matches a set of patterns assocated with listening like “Eva”, “Listen”, “Robot” or other equivalents of “Ok Google”, then the bot will enter a different listening mode.  The latte will speak some kind of acknowledgement like “Yes?” or a tone and send a command to the Android via Bluetooth to start listening.  The Panda will not respond to what it hears during this window of time until it receives a command back from the Android to resume.

3.  The Android phone will listen…but only when asked to by the Panda or triggered by touch (when used as a remote).  The android will decode the speech it hears though Google.  This text will then be sent to the LattePanda via BT to be broken down into grammar, parts of speech, and all the verbal cognitive functions.

3.  Once a response is determined, the Latte Panda will speak it (listening will be suspended here as well to avoid feedback loops).  The windows prototype I have will shape the mouth based on visemes related to the phonemes in the speech.

I have had very good results from Google on Android in the past, especially if I ran the results through what I call a normalization process afterwards.  The listening is not Open Mic though. it must be triggered by something.  Anna and Ava were triggered by touch.  Listening on windows is open mic…so it recognizes speech continuously, just unreliably.  The visemes/phonemes are adavantages on windows.

By melding the two platforms, and having them communicate through bluetooth, I think the advantages of each can be used, and the result is free other than the hardware.  Coordinating is not always easy though.  The only part I haven’t prototyped yet is the coordination so that speech/listening doesn’t happen simultanuously on the 2 platforms.  This is just a solvable software problem.

My previous bots used a USB connection for the phone.  This plan uses bluetooth instead.  This raises the possibility that the phone doesn’t even need to be on the bot.  In fact, in a noisy or crowded environment, it may be better to hold the phone and use it like a “voice remote” for higher reliability anyway.  Currently, I don’t plan on using the phone as a face, I plan on using a screen plugged into the Panda.  The android phone might slip into a slot on the back where it could right along, or be removed and carried when being used as a voice remote.

I hope this is not too confusing.  I wish I knew a way to do it all on the Panda…that would be a huge improvement if it could be as reliable as the phone with google speech-to-text combo.





A great project very similar to mine. He did not know about LattePanda. It seems an interesting product. I use a mini pc + 2 arduinos to control neck and wheels.


I will follow your progress with interest.

re: AlfonzoCAL

Thanks.  I’d sure like to see your bot with the mini pc.

My project will really be two…a sharable brain running on AWS, and a bot running a local brain with a Panda.  This setup will allow memories and configuration to be done via the web.  Memories will upload/download as needed to the Panda.  This will allow general knowledge to grow and be maintained centrally for multiple bots.

Wow. No idea that stuff

Wow.  No idea that stuff was that expensive.  I use it when I have hand problems (used it for six weeks last year when I broke my hands in a skiing accident).  It is very accurate.  For that, it should be though!

Have you looked at Amazon?  Yes, you have to upload audio, but they have .Net SDKs to manage it and you pay for what you use 


I quote from their site:

You can try Amazon Lex for free. From the date you get started with Amazon Lex, you can process up to 10,000 text requests and 5,000 speech requests per month for free for the first year.

How often are you really going to be out of Wifi range?  

Maybe use the onboard Microsoft speech to text to try to do initial processing and then when you are sure that someone is actually talking, send the request to Amazon.  It would be worth a try at the very least.

Good luck.  Let us know how you make out.  



I really appreciate everyone

I really appreciate everyone who commented on this topic and want to say thank you.

I have some news that might help anyone that has is looking for the same thing as I was in speech recognition.

After several failed attemps of getting the speech recognition to work well and be accurate I have found a way that is two part.


Purchase a bluetooth headset.

Here is the one I purchased https://www.jabra.com/bluetooth-headsets/jabra-bt2047

Using the headset took the speech recognition percentage of recognition up from 30% to around 60% of words recognized.

This gave me hope that it could still work. I then did a couple training sesons with the speech recognition and it got better but still was not at the level I wanted.


After doing some reading and then looking at everyone elses failed attemps I came up with an idea that changed the amount of words recognized to just about 90%.

Below is the base code that I established that changed it for me. The idea is to load all the words into an array(List) and then use those to compare against. This idea works well and I am getting very good results. I look forward to you all trying this and hopefully we can come up with a even better working voice recognition.


Option Strict On

Imports System.Speech

Imports System.Speech.Recognition


Public Class Form1

    WithEvents Reco As New SpeechRecognitionEngine

    Dim Counter As Integer = 0


    Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load

        TextBox1.WordWrap = True

        TextBox1.ScrollBars = ScrollBars.Vertical

        AddHandler Reco.SpeechRecognized, AddressOf RSR_SpeechRecognized

        Dim GrammarList As New List(Of String)

        Using SR As New IO.StreamReader(Application.StartupPath & “\Words.txt”)

            Do Until SR.EndOfStream


                Counter += 1


        End Using

        Me.Text = "Form1 " & Counter.ToString

        Dim ChoicesToUse As New Choices(GrammarList.ToArray())


        Dim GB As New GrammarBuilder(ChoicesToUse)

        Dim GrammarToUse As New Grammar(GB)




    End Sub



    Private Sub RSR_SpeechRecognized(ByVal sender As Object, ByVal e As RecognitionEventArgs)

        TextBox1.AppendText(e.Result.Text & " | " & e.Result.Confidence.ToString & vbCrLf)

        TextBox2.Text = TextBox2.Text + " " & e.Result.Text

    End Sub


End Class