Toggle Search
   Arm Enables

Taking Voice AI to the Next Level

AI ecosystem director Kate Kallot explores what new on-device tech means for the voice AI market

Smart Speaker

Today’s voice AI market is substantial and growing with every year—both in reach and innovation. In the U.S. alone more than 120 million households contain smart speaker devices, and more than half of us regularly call upon our smartphone voice assistants. Voice-based artificial intelligence devices are in our homes, pockets, cars, and offices, and we’re all getting more comfortable communicating vocally with computers rather than using a touchscreen, mouse, or keyboard.

But as smart as these devices seem, they’re less sophisticated than you think: Much of their intelligence relies upon a connection to the Internet. Voice processing is often performed on powerful, cloud-based servers with the only on-device voice AI aspect being what we call ‘keyword spotting’.

Amazon just integrated Alexa Voice Services into its AWS IoT Core, making it easier and more cost-effective for developers to add built-in voice capabilities to small devices powered by Arm Cortex-M processors. However, devices in this category are currently still limited to keyword spotting: Alexa, for example, is only capable of listening out for her own name in order to trigger recording, which is then sent upstream to the cloud for processing.

The given reason for this is often that a device’s size is directly correlated to its ability: small devices such as smart speakers and smartphones don’t pack the power needed for complex AI processing. But that’s not exactly true, and there are a number of companies that have successfully ported voice AI to operate fully on-device. Google, for example, now has the ability to run the Google Assistant almost entirely on the smartphone itself. Meanwhile, Snips—acquired this week by multi-room audio giant Sonos— has built a business model on the privacy benefits of keeping voice AI on the device. More on them later.

This opens up a whole host of exciting new use cases, over and above the smart speaker that’s still useful if the internet goes down. Imagine devices in emergency situations that can assist first responders via voice without the need to connect to a server, or voice AI devices that can operate in extreme conditions with limited connectivity such as by astronauts on the International Space Station or by deep sea divers at the bottom of the ocean. This isn’t science fiction – it’s happening now. And in the next five years, I believe we’ll reach a point where 50% of voice AI workloads are completed on-device.

There’ll still be a need to go online for information such as news and weather, of course – but performing smart home tasks such as playing music, setting reminders and operating lighting and heating in your home will all be done locally. Not only does this offline intelligence make the technology more reliable, but it also ensures our information remains private — keeping what you say and do between you and your voice assistant. 

Taking voice technology into the future

The AI technical capabilities of Arm-based CPUs already enable powerful machine learning (ML) processing today: ‘TinyML’, as the industry has dubbed it, is a major expansion area in this realm, enabling better power management and scheduling, security auditing, object detection and keyword spotting.

And TinyML design will get further help as Arm continues to evolve and optimize its GPU and NPU technologies for AI. We’ve just announced Arm Ethos, a series of machine learning processors designed to tackle a broad spectrum of compute-intensive ML requirements.

For most applications, however, the CPU is set to remain the dominant AI platform – whether it’s handling the AI entirely or partnering with a co-processor for certain tasks. And thanks to continuous advancements in microarchitecture such as Arm Helium, future Arm Cortex-M processors will be able to manage challenging compute workloads. When combined with algorithm-shrinking techniques such as post-training quantization and pruning, we’re able to do a quite incredible amount of AI processing on even the lowest-power Arm processors.

On-device intelligence Made Possible by Arm

“On this kind of hardware, we can achieve cloud-level performance, even on large vocabulary use cases, while keeping all the processing on the device.”

The latest offering in our Made Possible series (where we explore innovations that Arm technology played a part in) features Snips—a company specializing in untethered voice AI in endpoint devices. Companies like Snips are running Natural Language Understanding (NLU) and real-time speech recognition on Arm IP ranging from Cortex-M to Cortex-A.

I recently spoke to Arm Innovator & CTO of Snips, Joseph Dureau, about his experience using Arm’s low powered Cortex-M processors as a platform for this voice AI technology. He told me that Arm processors have been key to allowing Snips to achieve the end-to-end, private-by-design solution that he believes is the future of this technology.

For example, the Snips Spoken Language Understanding engine can run on a wide variety of hardware, with the tiniest solution running on the Arm Cortex-M4 processors at 100MHz.

“It’s the lightest MCU platform we’ve been able to integrate on. It is quite prevalent in the small IoT space, along with the Cortex-M7 processor. Our solution, called Snips Commands, is able to identify a wake word and understand voice commands like “play”, “pause”, or “heavy wash,” says Dureau.

With more powerful Arm processors, the Snips Flow solution can understand queries expressed in natural languages, like “it’s dark in here”, “give me a recipe for pasta and zucchini,” or “play Aretha Franklin on the radio.” The minimal requirement for Snips Flow is a dual-core chip at 1.2GHz. For large vocabulary use cases, Snips typically requires a quad-core Cortex-A53 processor.

“On this kind of hardware, we can achieve cloud-level performance in voice AI, even on large vocabulary use cases, while keeping all the processing on the device. Last fall, we published a benchmark comparing Snips Flow running on a Raspberry Pi 3 to major cloud Speech APIs, for a music use case. The data revealed that Snips can achieve cloud-level accuracy on the device.”

If you’re interested in finding out more about how Snips uses Arm-powered AI, read this Q&A with CTO Joseph Dureau. Read more on how Arm enables artificial intelligence everywhere.

Back to top