Voice UI on Microcontrollers

hello everybody thank you so much for joining us here for this live webinar session brought to you by eetech and all about circuits uh today we've got a good one for you this is enabling voice as a user interface on microcontrollers i'm jason montgomery and i'm going to be your moderator for this session here let me introduce our two speakers here briefly and then we'll get into the presentation itself so today we are joined by alex liu and swat youssef alex is vice president of cyberon corporation in charge of embedded solution business unit he has 25 years of experience in speech technology industry across algorithms applications and business development and we also are joined by swad youssef swat is a business development marketing manager responsible for technologies and embedded system solutions at renesas he has 20 years of experience in semiconductor markets across smart and efficient embedded systems motor driven applications technology dedicated specific micro controller development embedded ai and much more really excited to have both of them here to join us today and what i will also remind everyone here in attendance is that you've got the chance to submit live questions here i will have a live q a session that comes after this presentation you're about to see as well so make sure to use that q a box that q a widget looking forward to a really great presentation here and with that swat i believe i'm handing it off to you sir for the presentation thank you very much jason and hello everyone from my side too and i do thank you too very much for attending today's webinar so we are going to have some exciting news and slides showing you a dedicated future-oriented approach within the user interface technology namely the voice recognition voice command user interface based and enabled by the renesas 32-bit arm core microcontroller and the cyberon solution first of all let me give you just a brief overview about the agenda showing you some market dynamics and key design challenges within giving you some dedicated explanation about the cyberon voice user interface solution itself which is going to be explained by alex and last but not least i'm going to have one slide about the summary and the cta at the end of today's presentation so let me therefore begin with the brief overview of the market and the dynamics within as follows we do see the global vcr market is growing rapidly and it is estimated to hit the 23 billion us dollars range by 2025 having a growth of 17.2 sega value this thanks to the rapid price of the smart speakers that is actually fueling the ui growth in all areas also the actual worldwide situation in term in terms of covet 19 do accelerates the growth and adaption of the voi and the voice-based technology it might be mentioned is the rapid iot adaption within a wide range of applications and systems and also the psycholinguistic data analytics together with the effective computing are as a important and notable drivers of the voice technology adaptions too so saying now voice user interface and voice recognition technology we could and we do have here to have a look on the wide or broad understanding and expectation out of the description itself this is where we need and must understand differences between and within the segmentation of the technology itself within the voi arena we do have a broad segmentation that have to be considered and be able to distinguish between the differences within hereby we do have two main streams within as known as the speech recognition and the so-called conversational ai the one used within the systems so let me check now it's not going okay so we see now here the um the two main streams uh within actually with the with the segmentation and also the capability of the uh of the south um namely um once is the speech recognition that is actually um focused on understanding few words namely some commands within the system and on the right hand side we do see the conversational ai that has to support the human-like globalization including also the nlp within furthermore um um the applications actually based on um and um also on the requirements that can be fulfilled uh we do have uh uh dedicated use cases within the market uh from the left-hand side when we do say speech recognition we do see the white goods the wearables the industrial application for example use cases and systems to be used for and on the right hand side for the conversational ai we do know already the smart speakers the smartphones ivrs are some of those examples to be considered within dedicated for uh for the application and the use cases we do have the requirements actually but also the flexibility within when we are talking about the speech recognition command recognition within the systems we do think about the iot endpoints that can be fulfilled by using the mcus and mpus and this is exactly the case that we are going to see today within the presentation and on the other side when we do uh speak about the conversational ai where we do have the npl mandatory within we do speak about cloud connected or iot gateways as mandatory parts to be involved within having now mpu gpu gpus and dsps at the very high level of computational capabilities so now within the ecosystem here we see that from the conversational ai we do have ecosystem nature provided by the well-known vendors uh on the market as depicted on the slide and when we do come to the speech recognition command recognition we do have ecosystem which are evolving within and this is where cybron namely alex is going to present us a very exciting slide explaining exactly the ecosystem behind that is going to enables you to implement the speech recognition with the within the mcu based systems so alex i'm heading over to you now and moving to the next slide thank you yes you're welcome thank you so much yeah uh we saw the there are many demands from a lot of our devices requests for the voice command or for the uh standalone devices in all experience there are a lot of fields like smart speaker automotive wearable devices storage devices media player contracts and smart home control and they are smooth devices with a very resource limited platform and would like to have a uh the high accuracy voice recognition interface instead of for the connected service because of some requests of security and some requests of the hands-free operations so however uh we know that uh speech recognition is not seen new and uh for the people deployed the voice technician on the resource limited play phone there are some challenges uh the major issues are above of four technicians on computing resource limited systems like mcu play phones including first it is hard to find a qualified algorithm which can run efficiently at a low resource platform because the traditional speech recognition is a complicated algorithm we speak a course model to have good enough performance usually it is very high it has a very high resource requirement and is not designed to fold a small system like a mcu platform and secondly if the algorithm is downsized to be run on mcu platforms due to the limitation of resources in most cases it may just work at certain quiet or small noise environments it is not the case user expected in real life they are usually full of various sounds in the background and to solve this problem speech preprocessing algorithm is needed and to use the preprocessing algorithm to reduce the noise however the pre-processing may bring distortion to speech signals or make speech signals incomplete this will unfortunately reduce the recognition rate and so the developers have to take efforts on fine-tuning both algorithms carefully to ensure the performance it may increase uncertainty to product development and third by conventional approach you need to collect a lot of training data or to create customized keyword or for recognition usually the training data are utterance of hundreds of subs people it takes tremendous time costs huge money and human resources to do that the worst thing is you may miss the time to market and therefore for small and medium-sized projects it is hard to bear the developing cost and for large-scale projects even they are able to build a high cost for specific market it is also difficult to develop the product for global market with multiple languages support simultaneously okay in the following time of this session i will introduce cyber solution and how we overcome these issues before that let me take a few minutes to brief our company cyberon is a professional software company located in taiwan established in year 2000 now we we are always devoted to speech technology for more than 20 years now we have 50 employees among them 80 percent are professional researchers and engineers but with rich experience in this field in the past three years there are more than 85 million devices equipped with cyborgs technologies and shifted to worldwide market products include mobile phones smart toys consumer electronics home appliance iot wearable and automotive devices we provide several kinds of special solutions for various platforms focusing a single keyword detection to combine recognition with grammar on embedded systems and both copies and embedded solutions for continuous speech recognition and dialogue system integrated with nature language processing for specific domains besides the voice recognition we also provide text to speech solution to support the complete voice interaction interfaces for today's topic the embedded system one of our core competence is to make the complicated moisture voice algorithm are compact and efficient yet still keep the high recognition performance just like our chinese name sideway means to race with micro it reveals that we keep developing and optimizing our software and algorithms for resource-limited platforms and bring the cost-effective solutions to partners and customers i am very happy to work with renaissance to make our algorithm work on redness's mcu platform and hope our cooperation will bring benefit to people around the world and now i'm going to introduce our solution cybron this pattern and how we overcome the problems with cooperation with renaissance okay this batter is a local voice recognition engine running on devices with a high accuracy rate and high noise robustness we could walk in always some mode to detect the wake up world and also work for multiple command recognition to achieve local common control functions it is designed for embedded system so no network connectivity is required and there is no latency due to network traffic users always can get the real-time response and further no privacy concern as well to minimize the developing cost we take phoneme based approach for acoustic modeling phoning means the basic element of a language every language has its own phone instead for each language that these fathers approach we cyber study the language funding structure and collect large amount of training data from the native speakers then we'll change and build the base of course model in funding level on our server side in advance after that we provide a tool for developers to create customized command easily this tool is able to predict the pronunciation in phoneme sequence from text input so developers just need to type the text of commands then check and confirm the pronunciation and get the recognition of course model almost immediately so for developers there is no need to take time and effort in collecting speech data for training the combat model in advance the best model i just mentioned is language language dependent also so far this battle supports more than 40 languages they covered most major languages of the world and we met certain localized versions for region essence to enhance the recognition for specific regions for example there are nine versions of english with different accents respectively besides we are still developing more languages there are five new ones is in the roadmap of this year so please let us know if you need any new language support we will arrange the priority and roadmap according to customer's request this button takes a quite low resources and can be applied on many renaissance mcu products for example on cortex m4 platforms it takes about 16 mips and after optimized with simd instructions it occupies around 40 megahertz bandwidth only the fresh size it needs is about 200k byte and the ram size is 50k byte further the required results are almost constant to the number of commands only very few additional resources are needed when we add more command to be recognized usually is able to support almost 100 comments on renaissance cortex and 4 mcu play phones as i said we provide a friendly tool for developers to create customized commands this tool is called dsmt this pattern modeling tool not only creating specific command models this tool also provides a testing function and parameter adjustment for optimization i will introduce more detail of this tool later and this broadcast architecture separates the engine code and the model data so it can switch the recognition command set easily and quickly even without changing the application code this minimizes the efforts to change commands for different language applications or scenarios especially for products that support multiple languages by the way this spatter has a high noise robustness with a tough state of the rdn based modeling for well-designed keyword the accu-3 can reach over 90 percent even on the smr 6 r6db situations and without the speech preprocessing of course speech preprocessing still helps especially for heavy noise environment and the device with audio output but since this product has a wider tolerance range of audio quality this time it's much easier for our developer to reach the major point of both pre-processing and recognition algorithms and developers can save a lot a lot of time of fine tuning right uh besides about these spotted features redness offers full line of mcu products with low cost low power consumption and high security features these features fully fulfill the requests from iot home home appliance and wearable hearable devices i think these are our talking market with big potentials in this page the left side is a picture of readiness development board for voice technician with r86m1 mcu it's a full reference it's a full reference platform for developers to shorten the development time witnesses provide complete simple code simple project of voice recognition voice activity detection and as well as the playback function for it this platform shows that we can achieve high quality voice recognition with redness low cost and also low power consumption highway by the way the one with ra-4m2 mcu is coming soon on the right hand side is the screenshot of dsmt this part of modeling tool this spider engine itself is a language independent algorithm while the language depend part is a course model of recognition command developers can create their customized command easily with this tool just select the talking language input the text of each command and again no need to collect the training data or in advance because the tool will generate the desired combat model from the pre-trend-based model that containing every funding of the language psmp provides a testing functions to verify the performance of customized command and there are two ways to do the testing one is online test that means we can use live audio extreme to test such as from the pc microphone or development ball with url usb output and the gui of online test tool will show you the real-time waveform or spectrogram of audio input and by this we can check the audio recording quality as well if we if needed the other testing function is offline test the offline test text wav file is the input and is able to process a multiple wav file in page by this offline tool offline test tool we can use a fixed set of benchmark data to compare the performance of different settings it's useful and important when we do parameter adjustment for optimization and using offline our tester we can save a lot of time on testing usually on pc we just take about 15 minutes to run 24-hour wave file test and both testing functions will show the information of recognized results including the confidence confidence goal and time and speech energy information on dsmt tool developers can adjust the parameters value according to the information and optimize the performance if they want and before going to production usually uh we recommend developers to create to collect 10 or 20 users testing data from target devices to do the offline test and optimization to sum up this tool tsmt provides a friendly user interface and heavily reduced the effort to develop voice recognition products and help developers to meet the time to market as well as meet the global market with more than 40 languages available there okay next page beside besides voice recognition we provide a value-added software feature on redness mcu play phones including v8b the voice activity detection and audio playback for voice response regarding vad we know that the power consumption is always cons a concern of many applications especially for battery devices and for recognition in always-on mode such as wake-up or detection besides redness provides the low-power low-power consumption hardware we are also trying to reduce the power consumption further one approach is to adopt the vid the v8 software vid we provide is to detect human voice before doing recognition when only human voice will be sent to do recognition by this way we can save the computation of recognition and those data of non non-voice audio input will be performed recognize the recognition and thus reduce the power consumption with this software may reduce the power consumption by up to 50 percent by our stimulation and we also support the audio playback function on mcu platform the audio clips of voice response could be compressed to only 1 15 by a pc tool then distorting mcu's finish when the device wants to give users voice response such as the voice prompt or feedback of recognized results the corresponding audio data will be decompressed on mcu play phone and to be played all real time for example we can store our five minutes audio data by 640 kbyte fresh size and with this your product can provide plenty voice response to enhance the interaction with users okay next page this page shows the development flow of voice command on redness mcu with cyber dispatcher it's very easy and simple and first please determine that the user commands or for the product we provide some guidelines of how to design good keywords for recognition such as the command length pronunciation similarity syllabus charter and so on good keyword design is important to recognition performance and also the user experience especially for the wake up world the second step is to just input the text of a keyword into dsmt after creating a new dsmt project and selecting the talk languages then please check and confirm the pronunciation and save the project if or they are okay now or you get a comment model as mentioned by this tool you can verify the performance by doing online or offline tests and adjust parameters according to the test result for optimization if needed after above the third step is to import the command model or data to rename this mcu workspace and program the application called application control layer code or for each comment's corresponding action then is the last step you can download the images to development board and start testing debugging and enjoy your work okay next page you uh let us show you the whole process and a simple demo by video [Music] i would like to show you this bottle modeling tool and how developers can create their customized command we provide this tool called dsmt dispatcher modeling tool to create the customized commands for all these parts engines developers have to apply accounts to use this tool after logging let's create a new project give the project name then select languages from this list let me pick english or white or english and confirm the configuration and set a folder for this project now we just need to input the text of commands we want for example uh wake up work i princesses after input the text you can see the predict pronunciation here also you can use the play button to hear the pronunciation by tts if the uh the pronunciation is correct we can use it directly if it's incorrect then we can modify it directly in this text area if you are not familiar with the following symbols here is a link to a page for your events after confirm the pronunciation we can save the project then we get the common models in the project folder now it's just so easy and convenient also we can use this tool to verify the performance here we have two testing functions one is online test the other one is offline test online test using the microphone's streaming recording to test and offline test will load or pre-recorded wav file to test let's try online test first start recording and here you can see the waveform for the microphone recordings when i say high renders here i will show the detected keywords along with confidence scores and related information let me try again high rednesses high redness high redness you can see it works well and we don't have to collect training data for the specific keyword in advance at all next let me show you the offline test tool you can load the wav file to test or give a script file contain multiple wav file and reference results like this one this tool will perform recognition on these wav files one by one and show you the heavyweight force rejection rate and number of false acceptance as well as the confidence score and related information of each results we suggest developers collect testing data from target users and target devices to verify the model's performance and if needed adjust parameters according to the test result and score information to optimize the performance not just a wake-up word we can create more comment groups and some more comments by this tool let me insert a new group and add command from the text file now there are 7 commands in this group again we can do the online test and offline test to verify the performance of this group and tuning parameters according to the results if everything is okay just copy the comment model to mcu flash and control the switch of recognition groups in application level program now i'm going to show you real time demo on redness r86m1 evp this demo includes the wake up world and local commands and the integrates voice response as well we can hear the voice feedback and see the text results in console as soon first is intriguing mode just waiting for the wake up words after detecting the wake up world you will enter the comment mode for more comment recognition until you come out hi witnesses hello play music play music play your previous song next song next song volume router volume rounder volume software volume software stop music stop music open camera open camera as said this button has high noise robustness let me play some noise in the background and try the demo again hi witnesses open camera open camera play music play music volume louder volume software volume soccer maybe a song previous song next song next song stop music stop music to sum up this battle is an offline voice recognition officer that provides a friendly tool for easy combat customization saves a lot of development time and it used dna based modeling to have high noise robustness and not so sensitive to the background noises now it supports more than 40 languages including most uh major language of the world and the engine is written in c so it could be pro to many platforms and it used parallel resources enable high-performance voice recognition on analysis mcu with working with the excellent team of renaissance now this battery is available for i4m2 i6 m1 2 m4 and rx 651 and we look forward to more [Music] okay thank you and now here is one of our internal benchmarking study i'd like to share with you we create this part model for the wake up world of amazon avs alexa by dsmd tool with optimization and running on i-6m1 development board to test the test condition include two distance 1.5 meters and three meters and with three kind of big one noises including bubble noise music noise and tv noises and there are two noise levels at piston graphite db and 10 db and regarding the testing data beside the amazon 30 standard utterance we collected our testing set from about 100 english native speakers okay you can see that in all testing items the recognition results exceed amazon's criteria in most conditions the accuracy rate is higher than 95 and in the worst case of the those items the accusation are still higher than 80 percent uh as a fourth a sentence test there are only two first alarms out of 24 hour tv program this study shows the excellent performance of this pattern and we have similar experience in other projects okay uh now is the end of my sharing i hope you enjoy it and swat would you please give us the summary of this session thank you thank you very much alex quite impressing uh presentation um yeah so ladies and gentlemen uh we have seen um actually um the value chained efficiency and easy to be used and implemented but also adapted solution coming from cybron as explained by alex so what we have here by is actually as a summary and call to action is following we do have an edge-based voice user interface running on end points um is definitely expected as we have seen also within the market uh dynamic uh within um so we uh do see uh the the the drivers for like the smart speaker adaption but also the unfortunately the situation uh in terms of covet 19 those are all setting the foundation for the voice technology um the cyber on d spot enables cu definitely to have a rapid voice using interface development using the renesas mcus so you have a speedy command customization you have as you have seen the high noise robustness within the system we do support together more than 40 global languages within the solution and we have definitely the capability of high portable portability and low resources that must be and has to be uh fulfilled um very simplified engagement model please talk to us to renaissance about the bundled voice solutions based on mcu's we are keen and we do welcomes you uh to start the discussion with us and see how we can get together uh within your solution providing you some future oriented voice user interface within if you want to implement the voice technology um let us show you how you can immediately start with uh with the first step namely the prototype development in no time actually as you have seen within the explanations and presentation coming from alex um towards the cybron solution using the renaissance ra 6 m1 mcu we are going to increase the lineup we are working on it as it has been shown by allie so we are going to have solutions uh based on a couple of further devices um coming from the ray family so please stay tuned and check the updates as i said we are keen to welcome you more than um if you have some questions on or request so jason i think we are ready now to move on to the question and answer session great swan thank you very much and thank you alex as well uh to both of you for the uh outstanding presentation that we just had we've got lots of questions that have come in and just as a reminder for everyone here in attendance if you have anything on your mind that you would like us to cover now is a great time to submit those questions and in the meantime uh gentlemen let's just start going through some of these questions here uh first question that i'm gonna pick here the d spotter engine is is it just a software solution or are there additional hardware parts that are also needed oh it's just a software got it thank you very much alex um swab this question might be uh for you can you talk about the license of the spotter a bit um yes about uh the license of the d sporter is um actually as we said in cooperation with saveron so um there is actually um one window one responsible in that case and this is renaissance where we can offer your license free actually this porta 2 chain in case you need to start with your uh adaption and and development so get get in contact with us we are going to uh provide you all the necessary um parts within uh including also uh the license free dsport are actually uh provided by rennes itself thank you very much swat uh is there any demo version of d spotter tool chain out there um alex hello yes hey alex did you hear that last question yeah if uh you want to trade uh our tour the suffrage uh software tour dsmt uh please uh contact uh witnesses and to apply an account for you understood thank you very much what is the power of a cpu that is necessary for d spotter on codex and 4k phone it takes about 40 mips or 40 megahertz computing bandwidth all right thank you very much how many microphones are needed for this solution oh to this partner or you just use one microphone input one audio input one channel the mono channel and if you uh you may uh integrate with uh the signal pre-processing and if in that case uh it will be decided by the pre-processing algorithms we just use the monochrome output as our input of this battery okay thank you very much can you support a multi-language or multiple languages at the same time yeah because uh this model of this battery is language dependent so it depends on the resource available on the platform or for example on rs6m1 we can run two language concurrently and also we provide some bilingual recognition languages like uh chinese english bilingual and so far we also have a japanese english bilingual and korean english bilingual so it can support a mixed language of computer recognition like a open wi-fi in chinese tai chi wi-fi and so we are working on to add more bilingual support or to the language list and yes got it thank you very much uh does it support uh does the d spotter support noise reduction oh this butter itself uh it's a noise it has high noise or fastness and but we don't do noise reduction uh in our algorithm got it uh on the video demo what was the mcu on the board uh go ahead let's go ahead sorry oh it's rx-6m1 m4 core based device running at 120 megahertz providing also 512 kilobyte flash for example but nonetheless as i said uh within uh the summary slide uh we are working on and we are going to have soon uh increase of the devices uh within the lineup so we are going to have also the rv4m2 device supporting the solution and providing just one language command detection and recognition thank you swat very much this next question let me know gentlemen if you need me to go back to a particular slide here i'm going to read out this question c o c-o-n-f-i-44 that looks like confidential of the recognition but what does the 44 in particular mean there uh it's a normalized score of confidence score um uh in the recognized result uh av confidence score is higher than uh zero uh it means that we consider it is a qualified result 44 is a score with a pretty good high uh pretty good scores how high does that score go uh for most cases uh is around 20 to 30. okay uh let's see i think we might have another one or two questions here can you talk about how many commands this can support overall yeah as i presented in this uh some of the slides we can support it should be okay to support 100 comments on our code to stand for play phone of readiness si series all right thank you and this looks like gentlemen it might be our last oh no we're getting maybe i know this might be our last question here can you talk about how to tune the voice modeling oh by the testing result we can see the score distribution and consider to give the reward to each command and for example you want to make it more sensitive or we can add a reward to the confidence scope if it is too sensitive and cause too many false acceptance we can give negative uh reward to the uh to the scores and also uh we provide some guidance uh a guideline of how to add some uh we call garbage work to reduce the foster accept acceptance uh we will we we have uh some material of how to fine-tune the performance and can share with us with everyone all right well um alex what i think that we've gotten here we're just about to the top of the hour and it looks like we've made it through the questions here uh so suad youssef and alex liu i want to thank you both very much for joining us here today thank you gentlemen thank you very much thank you and just as a as a reminder to everyone i've been jason montgomery with eetech and all about circuits and if you go to all about circuits.com you will find plenty of other uh live webinar sessions that have uh been done recently i know that we i have plenty that we've done especially with renisus recently we have more coming up in the future so if you go there you'll be able to find on-demand webinars uh that renesas has put on and be able to register for ones that they have coming up in the future as well i want to thank everyone in the audience for attending us for this live session here today and we look forward to seeing you all again and another one sometime soon thank you all very much you

Transcript for:Voice UI on Microcontrollers

Transcript for:
Voice UI on Microcontrollers