Wake-Word Voice Recognition on STM32H7 with Edge Impulse

May 17, 2026

STM32H7 Voice Recognition with Edge Impulse and IAR — a Practical TinyML Guide

A field-tested TinyML guide. Build an always-on STM32H7 wake word Voice classifier with Edge Impulse Studio and IAR Embedded Workbench. Deploy on a stock STM32H747I-DISCO — alongside TouchGFX, FreeRTOS audio, SD logging, and a NORA-W106 Wi-Fi uplink to Google cloud, without breaking any of them.

A wake word is a short phrase your device listens for around the clock without sending any audio to the cloud — think “Hey Siri”, “Alexa”, or “Hey Google”. The moment it hears the phrase, it knows you’re talking to it and starts capturing the actual command.

Deploying machine learning models to the edge—frequently termed TinyML—promises localized, low-latency, and privacy-centric inference directly on target hardware

Part I — Why this matters

1. Embedded ML has crossed the threshold — STM32H7 wake word in a day

AI is taking the world by storm. Moreover, the embedded edge is now firmly part of that wave. From 2022–2024 and accelerating in 2025, microcontrollers like the STM32H7 became capable of running real neural networks in real time, at milliwatts, offline. Examples include STM32H7 wake word detectors, image classifiers, gesture recognisers, and anomaly detectors.

Three forces driving the shift

First, hardware: Cortex-M55 with Helium SIMD, ARM Ethos-U micro-NPUs, ST’s Neural-ART on the STM32N6, and Alif Ensemble parts. Second, tooling: the MLOps ecosystem around TFLite for Microcontrollers is finally mature. Finally, demand: every consumer, industrial, medical, and automotive team is expected to ship “smart” features.

In parallel, Edge Impulse, founded in 2019, was acquired by Qualcomm in 2025 to anchor their “AI on the edge” strategy. As a result, the on-ramp for an embedded team building a STM32H7 wake word feature has collapsed from “hire a specialist for six months” to “an afternoon in a browser”.

Why ML on the edge matters

There are three reasons. First, latency — single-digit ms on-device vs. 100–500 ms cloud round-trip. Second, power — multi-year battery life needs the radio off. Third, privacy — audio and video processed locally satisfies GDPR, HIPAA, and COPPA. Together, fast + private + offline + low-power is fundamentally different from anything remote inference delivers.

2. Executive summary — STM32H7 wake word with IAR + Edge Impulse

You do not need to be an ML expert to add a STM32H7 wake word feature to a product anymore. Edge Impulse’s browser Studio handles dataset organisation, feature extraction, model architecture, training, quantisation, and inference-engine generation. Then it exports plain C++. IAR Embedded Workbench’s Project Connection feature ingests it with a single file-pick. As a result, an embedded engineer with zero ML background can have a working STM32H7 wake word compiling on real silicon inside a working day, and re-deploy retrained models in under five minutes.

Our STM32H7 wake word build

We built exactly this — an always-listening “Hey Noa” STM32H7 wake word recogniser on a stock STM32H747I-DISCO. It runs alongside TouchGFX, FreeRTOS audio, SD logging, and a Wi-Fi cloud uplink. Notably, we used only the free Edge Impulse tier and IAR. ML expertise contributed: none. Time from blank project to live on-chip detection: roughly two working days, mostly spent recording.

What the STM32H7 wake word costs on chip

The final footprint is small. The STM32H7 wake word stack uses ~63 KB of the H747’s 2 MB Flash (3 %), 137 KB of its 1 MB internal SRAM (13 %), and 25 % of one Cortex-M7 core while classifying every 500 ms (always-on). Meanwhile, the remaining ~97 % Flash, ~87 % SRAM, and three quarters of the CPU still run TouchGFX, FreeRTOS, audio, UART, and SD writes with zero regressions.

Why this STM32H7 wake word combination works

Each tool does what it’s best at. Studio’s resource estimator, EON Tuner search, and Live Classification view make iteration feel like normal embedded development. Crucially, IAR’s Project Connection was the deciding factor for re-iteration: ~20 re-exports during bring-up, each re-import 5 minutes instead of 45. That saved a working day. More importantly, the psychological cost of throwing away a variant dropped to near zero, so we tried more variants and converged on a better model.

Part II — Background

3. A glossary in four categories

Architecture: where the AI lives

Edge Computing. Processing where data is collected (MCU, sensor, gateway), not in a server farm.
TinyML. Subset: ML on ultra-low-power hardware (<1 mW, Cortex-M-class). Our system is textbook TinyML.
Cloud vs. Edge. Cloud has unlimited compute but high latency; edge has strict resource limits but instant offline execution. Most products use both — training in the cloud, inference at the edge.

Action: what the chip is doing

Training vs. Inference. Training is the heavy learning process. Inference is running the frozen model on new data. MCUs only do inference.
DSP / Feature Extraction. Raw sensor data is too messy for a tiny network. DSP turns it into clean features.
Impulse. Edge Impulse’s term for the whole pipeline: Raw Data → DSP → ML Model → Output.

Optimisation: making it fit

Quantisation. Float32 weights → int8. ~4× smaller, dramatically faster on MCUs.
Tensor Arena. Contiguous static RAM block where the NN framework stores intermediate calculations.
Pruning. Removing low-contribution connections from a trained network.

Real-world constraints

Latency. Time between sensor reading and decision.
Determinism. Predictable execution time and memory use, every call.
Bandwidth and Privacy. On-device processing avoids Wi-Fi streaming and keeps sensitive data local.

4. A two-minute machine-learning brief

A neural network for keyword spotting is a chain of matrix multiplications interleaved with non-linearities. Audio in as a feature matrix; a probability per label out. Each layer computes y = activation(W · x + b), with ReLU for hidden layers and softmax for the 3-element output [HEY NOA, Noise, unknown].

Raw samples are useless to a small network (16,000/sec). We pre-process with MFCC: short-time FFT → mel-filter bank → log → DCT. Collapses a one-second waveform into a ~50 × 13 matrix the conv network chews through in ~100 ms on a Cortex-M7.

Quantisation maps weights and activations from float to int8: 4× smaller, ~3× faster on the M7’s 8-bit MAC. Small rounding rarely changes the argmax. All of this turns into C: weights become Flash const arrays, layer ops become TFLite Micro calls, MFCC becomes a fixed DSP pipeline — all generated by Edge Impulse.

Part III — The Edge Impulse platform

5. What Edge Impulse is

Edge Impulse is a browser-based MLOps platform for ML on the edge. First, create a project at studio.edgeimpulse.com. Then work through a tab workflow: Data acquisition, Impulse design, DSP block, NN Classifier, Deployment. There is no Python, no TensorFlow install. Importantly, the choices Studio offers are the ones that matter for embedded.

Live resource feedback for STM32H7 wake word builds

Every screen continuously shows the resource cost of your choices on the target chip. For example, change MFCC frame_length and estimated RAM updates live. Likewise, switch float32→int8 and Flash drops 4×. Additionally, Live Classification records browser audio and classifies in real time — fastest possible “does this work on me, now” feedback.

Not just audio

The same workflow also covers object detection (FOMO, MobileNetV2-SSD), image classification, gesture recognition (IMU time-series), anomaly detection, and regression. In each case, you record labelled samples, design an impulse, train, and deploy — only the sensor type changes.

Silicon-agnostic

Furthermore, the same project deploys to a Cortex-M0+ with 32 KB RAM, our M7, a Cortex-A53 Linux box, ESP32, RISC-V, or x86. Notably, keyword-spotter models like ours routinely fit in under 32 KB RAM and 100 KB Flash after EON + int8.

6. The Studio workflow tab by tab

Dashboard — project name, target hardware, latency budget.
Data acquisition — record from laptop mic, upload .wav, connect a target device, or import public datasets.
Impulse design — chain a DSP block (MFCC, MFE, Spectrogram) to a learn block (Classifier, Anomaly, Object detection, Regression).
DSP block — visual heatmap per sample + 2-D PCA/t-SNE scatter.
NN Classifier — training cycles, augmentation, quantisation, architecture. Trains server-side in 1–5 min. Confusion matrix, loss curves, test accuracy.
Model testing — runs the trained model on the held-out 20 % with inline audio player.
Live classification — real-time predictions on browser mic.
EON Tuner — automated architecture search: 30–100+ DSP / network / quantisation combinations, Pareto-ranked. Note: EON Tuner searches for the architecture; EON Compiler (section 14) compiles one chosen model.
Versioning — snapshot dataset, impulse, and trained model.
Deployment — C++ library, Arduino, WebAssembly, vendor SDKs. Toggle EON + int8, click Build.

7. Getting started — the seven-step path

Sign up at studio.edgeimpulse.com (free tier covers this guide).
Create a project.
Collect or import data — laptop/phone mic, webcam, IMU, or .wav/.jpg/CSV upload.
Label your data as you record or batch-label imports. Data Explorer projects to 2-D to spot mis-labels.
Pre-process and train — processing blocks extract features, learning blocks train. No Python.
Run inference on a device. Export as WebAssembly, smartphone app, board binary, or C++ library (our path).
Go further with the Organization Hub — automated pipelines, collaboration, custom blocks.

8. Test before you deploy

You can fully test and iterate before flashing a single binary. Historically, embedded ML meant a slow loop: train, port, deploy, discover it’s bad, repeat. By contrast, Edge Impulse compresses that to seconds in the browser.

Five pre-deployment test surfaces for STM32H7 wake word

Model Testing — the held-out 20 %, with every misclassified clip playable inline. Use it to fix the dataset.
Live Classification — real-time predictions on a laptop mic. Notably, this catches what test accuracy doesn’t (95 % on a clean set can still trigger on kitchen music).
Smartphone testing via Studio Mobile Client (QR code) — the closest proxy to MCU conditions.
Resource estimator — every page shows RAM/Flash/inference for the selected target. For example, swap “STM32H747” for “Cortex-M0+” and the numbers update instantly. As a result, you can evaluate hardware fit before committing.
EON Tuner — dozens of variants, fully trained, Pareto-ranked in 30–90 min.

Consequently, every variant we did deploy to the STM32 worked on its first flash.

9. Edge Impulse vs. TensorFlow direct

TensorFlow is Google’s open-source ML framework. Specifically, the TFLite for Microcontrollers (TFLM) subset targets MCUs. Edge Impulse is a layer on top of TensorFlow: Studio runs TF on its servers, EON compiles the TFLite flatbuffer TF produced, and the runtime under EON is TFLM. In other words, we ARE using TF — Studio just hides it.

When TF direct wins

For teams that need maximum flexibility, TF wins on three counts. First, any architecture (latest research, custom ops). Second, no vendor lock-in. Third, integration with existing Jupyter / MLflow / CI pipelines.

When Edge Impulse wins

For everyone else, Edge Impulse wins on Studio workflow vs. weeks of Python plumbing, generated MFCC that matches training bit-exactly on-device, live resource estimation, EON Compiler’s 30–50 % RAM / 25–50 % Flash savings, EON Tuner’s automated search, and in-browser live-classification feedback.

Our pick for the STM32H7 wake word

For teams with strong ML expertise and unusual architectures, TF wins on flexibility. However, for everyone else — especially embedded teams prototyping their first ML feature with deadlines — Edge Impulse saves an order of magnitude in development time. Accordingly, we chose Edge Impulse and would do so again.

Part IV — Our project: design

10. What we built — the STM32H7 wake word pipeline end-to-end

A 16 kHz mono MEMS mic feeds a DFSDM peripheral that decimates PDM to PCM continuously into a rolling 1-second buffer in SRAM1. A FreeRTOS idle hook runs an Edge Impulse classifier twice a second; when “HEY NOA” crosses the threshold, the system fires the same task notification a button press would, and the existing record → SD → UART → cloud STT → command-router pipeline takes over. End-to-end, no button. First detection: 96 % confidence, Google STT transcript, UI updated.

Why “Hey Noa”? Modelled on “Hey Siri” / “Hey Google” / “Alexa” — the “Hey + short name” cadence consumer assistants have trained billions of users to expect. Reusing that habit (instead of teaching a new one) plus the simple phonetics (hard /n/ + open vowels) give the MFCC + 1-D conv classifier a distinctive spectral fingerprint that survives noise well.

11. The hardware — STM32H747I-DISCO

Cortex-M7 at 400 MHz, 1 MB internal RAM across multiple SRAM regions, on-board MEMS mic on DFSDM. Compute fits comfortably: 154 ms DSP + 102 ms inference per pass leaves 75 % of CPU free for LCD, FreeRTOS, audio. RAM is tighter — MFCC scratch peaked at 17 KB (pool sized 56 KB for 3× margin), rolling window 32 KB, tensor_arena ~2.7 KB. Total ~92 KB, fitting cleanly in the 128 KB D2 SRAM1 bank.

12. Recording the dataset

The dataset grew iteratively — that journey was the biggest lesson of the project. Initially, our first attempt was ~20 samples per class. As a result, test-set accuracy was only ~55 % — barely better than random for 3 classes. For example, random whistles tripped “hey_noa”, while clear wake words got classified as “unknown”.

Growing the STM32H7 wake word dataset in three rounds

Therefore, we grew the dataset in three rounds, each targeting a failure mode visible in Model Testing:

Round 1 — wake-word variation. ~20 → ~50 samples. Specifically: different intonations, distances (10 cm / 50 cm / 1 m), speakers, and pacing. Result: ~72 %.
Round 2 — noise diversity. ~20 → ~60 noise clips. These covered music, TV, kitchen sounds, doors, footsteps, breathing, and real-room silence. Previously, the model had overfit “studio quiet”. Result: ~83 %.
Round 3 — explicit “unknown” clips. ~20 → ~70 clips, including “hello”, “okay”, “play music”, and “alexa”, plus near-misses like “hey now” and “hey there”. Otherwise, the binary HEY-NOA-vs-NOISE choice forces the model to guess, and near-misses tip toward HEY NOA. Result: 89 %.

The rule of thumb

Final dataset: ~180 clips, ~60 per class (~4 minutes of audio). In short: at least 100 samples per class, across realistic conditions, with explicit negative examples. Conveniently, Studio’s Model Testing tells you which conditions you’re missing.

The 80/20 split

When you click Train, Studio automatically partitions 80 % training, 20 % test. Consequently, the held-out 20 % is your realistic field-accuracy estimate.

13. Impulse design — MFCC vs. MFE

For voice, the canonical combination is MFCC → 1D Conv. We chose a 1-second window with 500 ms increment (overlapping, two chances per utterance), 32 mel bands, 13 cepstral coefficients, frame length 0.02 s / stride 0.01 s. The classifier is a small 1D conv: two Conv1D → ReLU → MaxPool blocks, flatten, dense, 3-output softmax. ~5 K parameters, ~2 min training.

MFCC vs. MFE

Both share FFT → mel-filter bank → log. However, MFE stops there (log-mel energies), while MFCC adds a DCT that decorrelates and keeps the first 10–13 coefficients.

Property	MFE	MFCC
Per-frame vector	32 mel-band energies	13 cosine coefficients
Information	Full filter-bank spectrum	Decorrelated, lossy summary
RAM cost	Higher (2–3× MFCC)	Lower — best for small MCUs
Classic application	General audio classification	Speech, keyword spotting

We chose MFCC because keyword spotting is its textbook use case. EON Tuner MFE candidates scored ~1 % higher but cost ~2.5× the RAM. MFE would have won for a general audio classifier (doorbell vs. smoke alarm vs. baby cry) where the discarded spectral information matters.

Part V — Optimisation and integration

14. EON Compiler — shrinking the STM32H7 wake word binary

By default, an Edge Impulse export uses the TFLite Micro interpreter — generic runtime that loads weights from a flatbuffer and dispatches each layer via virtual table. Works anywhere but carries ~30 KB of overhead.

EON (Edge Optimized Neural) Compiler replaces it with a model-specific C++ file: each layer is a direct function call with weights and shapes as compile-time constants. Same model with EON typically uses 30–50 % less RAM and 25–50 % less Flash with no accuracy change.

Combined with int8 quantisation: our STM32H7 wake word RAM dropped ~110 KB → ~17 KB peak, Flash ~140 KB → ~60 KB, inference ~180 ms → ~100 ms. Without those two clicks, the wake-word feature would have been physically impossible on this chip alongside TouchGFX and FreeRTOS.

15. From a trained network to C++ source

The “Build” button produces plain C++ with the entire neural network unrolled into compile-time constants. Specifically:

Takes the trained model (int8 if ticked) as a TFLite flatbuffer.
Walks the graph and generates a dedicated C++ function per layer, with shapes baked in as constexpr.
Emits every weight and bias as a const array literal — static const int8_t weights_layer_N[] = { 12, -47, ... }; thousands long.
Chains the per-layer functions through one contiguous arena.
Wraps it in a unique build ID (tflite_learn_985318_52_compiled.cpp) so re-exports never collide.

The result is one .cpp (sometimes plus .h) containing both the model’s shape and its trained knowledge as standard C++. There is no interpreter, no flatbuffer parser, no virtual functions. Consequently, IAR parses it like any other C++ and inlines small functions and MAC loops. In other words, the model IS the binary — no runtime load step.

Alongside, the package ships edge-impulse-sdk/, model-parameters/, and tflite-model/. All plain source.

16. The five-layer optimisation stack

The final size and speed is the product of five independent layers. Each one contributes a measurable shrink.

Layer 1 — EON Compiler

Replaces the TFLM interpreter with a model-specific C++ file. As a result, ~30 KB of interpreter overhead is gone. The net effect: 30–50 % less RAM, 25–50 % less Flash, identical accuracy.

Layer 2 — int8 quantisation

Every weight and activation goes from float32 to int8. Consequently, the model is 4× smaller and ~3× faster on the M7.

Layer 3 — CMSIS-NN intrinsics

The SDK routes convolution / fully-connected / pooling through ARM’s hand-optimised SIMD. Specifically, one instruction does 4 × 8-bit MACs in a cycle. ARM benchmarks: 4.6× speed-up. Automatic, no toggle.

Layer 4 — IAR compiler optimisation

EI’s generated C++ is still subject to IAR’s optimiser. Picking High → Balanced (or Size) over Low/None gives inlining, loop unrolling, dead-code elimination, and LTO. For example, switching to High → Balanced on the EI folder shrunk our model’s .text another ~12 %.

Layer 5 — Cortex-M7 microarchitecture

Finally, the chip itself: a 64-bit AXI bus to AXI SRAM, a 6-stage dual-issue pipeline with branch prediction, an FPU for the few floats EON keeps, and TCM regions that bypass the cache. Importantly, placing the EON pool in cached AXI SRAM rather than external SDRAM (~10× slower) turns the M7’s clock into actual inference throughput.

Cumulative effect on STM32H7 wake word footprint

The combined impact on our project, starting from the same trained model:

Stage	Flash	RAM peak	Inference
Reference (TFLM interpreter, float32)	~140 KB	~110 KB	~180 ms
+ EON Compiler	~95 KB	~75 KB	~150 ms
+ int8 quantisation	~63 KB	~30 KB	~110 ms
+ CMSIS-NN intrinsics	~63 KB	~17 KB	~102 ms
+ IAR High → Balanced	~55 KB	17 KB	~100 ms

The five layers stack multiplicatively: ~60 % size reduction and ~45 % speed-up from default to fully-optimised.

17. Wiring it into IAR — the Project Connection file

[Image]

Figure: Studio’s Deployment screen. Target = “IAR Embedded Workbench Project Connection”, inference engine = EON Compiler.

How Project Connection imports the STM32H7 wake word SDK

Studio’s IAR Embedded Workbench Project Connection target generates an XML manifest (edge_impulse_gen.ipcf). It tells IAR which source files to compile, which include directories to add, and which preprocessor symbols to define. Then in IAR, choose Project → Add Project Connection… → pick the .ipcf. As a result, the IDE imports the entire EI SDK in one step. There’s no drag-and-drop of 500+ files. There’s no manual include-path editing. Moreover, re-importing replaces the previous file set atomically.

Why it mattered for our STM32H7 wake word: ~20 re-exports

Over the project, we re-exported the model roughly 20 times. Without Project Connection: each iteration is 30–45 min of IDE plumbing × 20 = 10–15 hours. With it: 3–5 min, deterministic. In short, roughly a working day saved. Furthermore, the psychological cost of throwing away a variant dropped to near zero. So we tried more variants and converged on a better model.

What we still did by hand

The .ipcf covers files, includes, and defines. However, it doesn’t cover project-wide choices. Therefore we still set four preprocessor defines:

FLATBUFFERS_USE_STD_OPTIONAL=0
EI_PORTING_IAR=1
EIDSP_SIGNAL_C_FN_POINTER=1
TF_LITE_STATIC_MEMORY

Additionally, two surgical edits to vendored code survive every re-export:

edge-impulse-sdk/porting/iar/ei_classifier_porting.cpp:48 — stm32f4xx_hal.h → stm32h7xx_hal.h.
edge-impulse-sdk/classifier/postprocessing/ei_postprocessing_common.h:56 — wrap the function-pointer comparison in (uintptr_t) casts.

Finally, strong-symbol overrides of ei_malloc / ei_free in our own file slice from a static 56 KB pool using a LIFO bump allocator. This gives heap-free runtime semantics with zero changes inside the EI SDK.

18. Running the STM32H7 wake word continuously

We run the classifier from the FreeRTOS idle hook. It’s gated by a time check and a flag that skips inference while a recording or upload is in flight. The idle hook is “do this when nothing else is ready” — it never preempts time-critical work. DFSDM DMA runs forever, feeding a rolling 1-second PCM buffer the classifier snapshots before each pass.

Live: the first time the continuous architecture ran, we said “Hey Noa play Beatles” without pressing anything. Classifier 96 %, recording fired, GCS upload, Google STT returned "play Beatles", LCD updated. End-to-end voice control on a single Cortex-M7, ~4 s from “Hey Noa” to “playing Beatles”. Adding a new wake word is a one-day job. No ML expertise needed.

19. Threshold tuning

One number governs the user experience: the confidence threshold above which HEY NOA counts as detected. Three probabilities sum to ~1.0; threshold 0.50 trips, 0.70 doesn’t.

Lower (0.50): Recall ↑, Precision ↓ (random sounds trigger).
Higher (0.85): Precision ↑, Recall ↓ (users must enunciate).

Measured. At 70 %: reliable triggering (live 90–98 %), ~zero false positives over 5-min idle. At 50 %: 9 triggers in 90 seconds, one a real command. Reverted.

The threshold doesn’t change the model — same probabilities come out. Instead, it changes the policy for converting probabilities to actions. A speaker that talks back to itself is more annoying than one that occasionally needs the word twice. So we erred on precision. Exposed as #define WAKEWORD_THRESHOLD_PCT — one-line change, no model rebuild.

Part VI — Results and outlook

20. Final STM32H7 wake word resource budget

Measured on STM32H747I-DISCO at 400 MHz, EON Compiler + int8, verified against IAR .map and live RTT.

Resource	Cost	Where
Model weights + EON code	~63 KB	Internal Flash
TFLite Micro `tensor_arena`	2.7 KB	D1 AXI SRAM
MFCC scratch pool	56 KB allocated, 17 KB peak	D2 SRAM1
Rolling audio window (1 s @ 16 kHz int16)	32 KB	D2 SRAM1
Classifier input snapshot	32 KB	D2 SRAM1
DFSDM DMA ring	512 B	D2 SRAM1
Idle-task stack (bumped for inference)	14 KB	D2 SRAM2
Total RAM	~137 KB	SRAM1 + SRAM2
Total Flash	~63 KB	Internal Flash

Runtime metric	Value
MFCC pre-processing per inference	154 ms
NN inference (TFLM + EON)	102 ms
Total classify pass	256 ms
Classify cadence	every 500 ms
Wake-word confidence (clean)	96 %
End-to-end “Hey Noa” → LCD	~4 s (incl. 3 s capture + cloud STT)

A Cortex-M7 at 400 MHz still has 75 % of CPU free after this STM32H7 wake word classifier runs at 2 Hz. It adds ~3 % Flash and ~13 % SRAM. That leaves the rest for TouchGFX, FreeRTOS, audio, and UART. That headroom is what makes on-device ML practical on real products, not just demo boards.

21. STM32H7 wake word performance summary

Dataset: ~180 clips, ~4 minutes audio, 3 labels (~60 each), 3 iterative rounds, 80/20 train/test.

Accuracy: 89 % held-out test set; 96–98 % live confidence on clear “Hey Noa”; ~zero false positives at 70 % over multi-minute idle.

Optimisation journey (same trained model, different layers from section 16):

Stage	Flash	RAM peak	Inference	Verdict
TFLM interpreter, float32	~140 KB	~110 KB	~180 ms	Doesn’t fit
+ EON Compiler	~95 KB	~75 KB	~150 ms	Fits, no room for anything else
+ int8 quantisation	~63 KB	~30 KB	~110 ms	First viable build
+ CMSIS-NN intrinsics	~63 KB	~17 KB	~102 ms	Under cadence budget
+ IAR High → Balanced	~55 KB	17 KB	~100 ms	Production quality

Runtime knobs (no retraining): WAKEWORD_THRESHOLD_PCT=70, WAKEWORD_CLASSIFY_INTERVAL_MS=500, WAKEWORD_COOLDOWN_MS=1500, EI_PCM_GAIN=8.

22. What Edge Impulse delivered for our STM32H7 wake word

In short, the Edge Impulse + IAR stack gave us seven concrete things:

On-device “Hey Noa” STM32H7 wake word classifier at 96–98 % confidence, always-on hands-free.
End-to-end voice → action on a single Cortex-M7: DFSDM → MFCC + 1D-conv in idle hook → 3 s command capture → SD log → UART → Wi-Fi → Google STT → command router → YouTube search → JPEG thumbnail. ~4 seconds total.
Repeatable retraining. Add a new wake word, record, retrain, re-export, re-flash — sub-day. Notably, no Python and no hyper-parameter tuning.
Engineering headroom. ~13 % SRAM, ~3 % Flash, ~25 % of one core. Meanwhile, the rest still runs TouchGFX at 60 fps, six FreeRTOS tasks, audio at 16 kHz, UART at 921 600 baud, and SD writes — zero regressions.
Heap-free ML. A static-pool override of ei_malloc/ei_free keeps the entire SDK compliant with our no-dynamic-allocation rule.
Scale-down. The same impulse on a Cortex-M0+ fits under 32 KB RAM and 100 KB Flash.
Memory architecture. We put every DMA-touched audio buffer in D2 SRAM1 marked Non-Cacheable + Shareable (128 KB) — the DFSDM ring, MFCC pool, rolling window, and classifier snapshot. Otherwise the M7 cache would deliver stale samples. Meanwhile, tensor_arena lives in cached D1 AXI SRAM for speed, and the idle-task stack sits in D2 SRAM2, isolated from audio. For the full reasoning, see our previous post on Memory Architecture Considerations on STM32.

The big picture for STM32H7 wake word

Ultimately, the product listens, understands, and responds to natural-language voice commands using exclusively components we built or audited. There’s no vendor lock-in, no closed-box assistant, and no recurring cloud bill for always-on listening. (Cloud is only used for command transcription.) In summary: local low-power wake-word detection with cloud-grade transcription only when needed. That’s the architectural pattern Edge Impulse exists to enable. And it just worked.

23. Looking ahead — STM32N6, Cortex-M55

The H747 pre-dates the explicit AI-acceleration push reshaping the silicon roadmap. However, ML on the H7 still works through clever software (EON), tighter data types (int8), and hand-vectorised libraries (CMSIS-NN). Nevertheless, it leaves performance on the table.

Cortex-M55 and Helium

ARM’s 2020 Cortex-M55 is the first Cortex-M designed with ML as a first-class workload. Notably, its Helium (M-Profile Vector Extension) is 128-bit SIMD. One instruction does 8 × int16 MACs or 16 × int8 MACs per cycle. By contrast, the M7 needs 4–8 instructions for the same work. According to ARM benchmarks, Helium delivers 5–15× M7 ML throughput. Furthermore, it handles int4/int2 quantisation. Finally, the M55 adds TrustZone-M and integration with NPUs like ARM’s Ethos-U55.

STM32N6 — ST’s first AI-first MCU

Launched in 2024, the STM32N6 series is built for ML from the start:

Cortex-M55 up to 800 MHz (with Helium)
Neural-ART NPU — ST’s proprietary accelerator, ~600 GOPS at sub-1 W, on-die
Up to 4.2 MB embedded RAM, external HBM-like memory
DSI display, MIPI-CSI camera, USB high-speed

As a result, tasks taking 100 ms on our M7 take single-digit ms on the N6’s NPU. Consequently, the same wake-word model would have so much slack that you could classify at 60 Hz instead of 2 Hz.

24. Recent ARM cores for neural networks

Beyond the M7 we used, ARM has shipped three newer Cortex-M cores designed specifically for NN workloads. Here they are, newest first.

Cortex-M cores (microcontroller-class)

Core	Year	Key NN feature	What it replaces
Cortex-M52	2024	Helium SIMD in the entry-level silicon footprint — brings NN throughput to cost-sensitive parts	M33 (no SIMD)
Cortex-M85	2022	Helium SIMD + higher IPC than M55; adds PACBTI memory-safety	High-end successor to M7/M33
Cortex-M55	2020	First Cortex-M with Helium (M-Profile Vector Extension) — 128-bit SIMD for NN MACs	M33 for NN workloads

Helium in one line. One instruction does 8 × int16 MACs (or 16 × int8 MACs) per cycle. As a result, ARM’s benchmarks show 5–15× the NN throughput of an M7 at the same clock.

Ethos micro-NPUs — companion accelerators

These sit next to a Cortex-M on the same die. Then the CPU offloads entire NN layers to them.

NPU	Year	Throughput	Killer feature
Ethos-U85	2024	up to ~4 TOPS	Native transformer / attention support — first MCU-class NPU that can run LLM-style or vision-transformer models without falling back to CPU. Pairs with M85 or Cortex-A.
Ethos-U65	2021	up to ~1 TOPS	Works with Cortex-A as well as M (Alif Ensemble uses it).
Ethos-U55	2020	~0.5 TOPS	The classic micro-NPU. Paired with M55 in the original “embedded AI” reference designs.

25. Understanding GOPS — the unit you’ll see everywhere

NPU vendors quote throughput in GOPS (Giga Operations Per Second = 10⁹ ops/sec). One “op” usually means one multiply-accumulate (MAC) — the basic math of any NN layer:

accumulator += weight × input

Consequently, every convolution and matrix-multiply layer is millions of these. In effect, GOPS tells you how many MACs the silicon can sustain per second.

Vendor gotcha. Some count a MAC as 1 op. Others count it as 2 (multiply + add). ARM and ST use the 2-op convention. So “600 GOPS” on the STM32N6 NPU really means 300 G MACs/sec. Always check the fine print.

The scale ladder

Unit	Value	Typical hardware at this level
MOPS (Mega)	10⁶ ops/s	Cortex-M0+ doing pure C MACs
GOPS (Giga)	10⁹ ops/s	Cortex-M7 with CMSIS-NN, small MCU NPUs
TOPS (Tera)	10¹² ops/s	Mobile-class NPUs, dedicated edge AI chips
POPS (Peta)	10¹⁵ ops/s	Data-centre AI accelerators

1 TOPS = 1,000 GOPS = 1,000,000 MOPS.

26. Conclusion

Five years ago, adding voice recognition to an embedded product meant a six-month ML hire. Alternatively, you built a cloud-dependent architecture and paid the latency, privacy, and connectivity bill on every utterance. Today, with Edge Impulse Studio + IAR Embedded Workbench, that has changed completely. An embedded team with zero ML background can ship an always-on STM32H7 wake word feature in two working days. Total cost on chip: ~3 % of Flash, ~13 % of SRAM, and ~25 % of one Cortex-M7 core.

The combination works because each tool does its job. Studio handles every step that used to require a data scientist. IAR’s Project Connection makes re-iteration a 5-minute task instead of a 45-minute file-shuffle. EON Compiler + int8 + CMSIS-NN squeeze the model into resources the M7 already has. Finally, disciplined D2-SRAM placement with Non-Cacheable MPU regions keeps the DMA-fed audio path coherent.

Most importantly, the same workflow scales. The same impulse runs on a Cortex-M0+ for cents. Likewise, it runs on an STM32N6 with a 600-GOPS Neural-ART NPU for 60-Hz continuous classification. As a result, the ML half of the work carries over when the product roadmap moves.

The era when “embedded ML” required exotic hardware or a specialised team is over. The toolchain has caught up with the silicon. Consequently, every firmware team shipping a product in 2026 should be asking what local intelligence their device can credibly add. Our project shipped a wake word. Yours might ship a defect detector, a gesture interface, an anomaly alarm, or a private voice assistant. The on-ramp is now an afternoon in a browser — and the chip is already on your bench.