free trial: integrate NN processing in MCU & DSP with 2 lines of C code
- jimli44
- Mar 10
- 2 min read
Updated: Mar 31
Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into your own embedded application with just 2 lines of C code, without compromise on efficiency of course.

With the most popular embedded NN deployment framework being in C++ only and inherited from framework meant for much more powerful systems, there is a natural hurdle in front of embedded NN deployment and system integration. This hurdle is proven to be costly in efficiency, both engineering effort and CPU cycles wise. I would like to show a different possibility, one that not just provide superior efficiency, but also points a direction which embedded ML might be able to play catch up.
Trial information
The model: GRU (Gated Recurrent Unit) based speech general noise reduction, 119.2k parameters, model summary in appendix
In/output: a frame of 64 PCM samples, Q15 format, 16kHz sample rate, input max amplitude should be >-6dBFS
Processing precision: 16bit
Demo constrain: processing time limited to 120 sec
Target platform: ARM Cortex-M7 & Tensilica Fusion F1 DSP
Test environment: ST STM32H7A3 (arm-none-eabi-gcc 10.3)
Preparation
Download the CM7 version or Fusion_F1 version library zip file (provided for evaluation purpose only)
Unzip and place the .h file in project include folder
Place the .a file in library folder and include lib name “model_processing” in linker. For ARM toolchain, this is done by using “-L” & ”-l” flags or in Cube IDE project setting as following. Note the “lib” and “.a” would be added automatically.

The two lines of code
Include the header
#include "model_frame_proc.h"
then call the model to process a frame
model_frame_proc(p_data, // ptr to data
p_temp_buf); // ptr to scratch buffer
Efficiency
Memory usage: 248kB code+data, 2kB scratch buffer
CPU usage:
STM32H7A3: 86Mhz when processing 16kHz stream realtime, equivalent to 344k cycles per inference (-Ofast compiler flag, D/I cache enabled)
Fusion F1: 34Mhz when processing 16kHz stream realtime, equivalent to 136k cycles per inference (-Ofast compiler flag, TCM RAM)
HOW
There is no magic:
The silky integration come from same concept as Rust, build all dependencies into the static library.
The efficiency is achieved by thoughtfully translating NN layers into most basic operations understood by the compiler.
I hope this shows a different perspective and could help driving embedded ML forward. If you think this could help your project or algorithm, get in touch.
Appendix:
Model summary
_______________________________________________
Layer (type) Output Shape Param #
===============================================
multiply (Multiply) multiple 0
conv1d (Conv1D) multiple 4096
dense (Dense) multiple 5200
gru (GRU) multiple 51264
conv1d_1 (Conv1D) multiple 8536
gru_1 (GRU) multiple 40800
dense_1 (Dense) multiple 5184
conv1d_transpose multiple 4096
===============================================
Total params: 119,176
Trainable params: 119,176
Non-trainable params: 0
Komentáře