top of page

free trial: integrate NN processing in MCU & DSP with 2 lines of C code

  • jimli44
  • Mar 10
  • 2 min read

Updated: Mar 31

Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into your own embedded application with just 2 lines of C code, without compromise on efficiency of course. 

 


With the most popular embedded NN deployment framework being in C++ only and inherited from framework meant for much more powerful systems, there is a natural hurdle in front of embedded NN deployment and system integration. This hurdle is proven to be costly in efficiency, both engineering effort and CPU cycles wise. I would like to show a different possibility, one that not just provide superior efficiency, but also points a direction which embedded ML might be able to play catch up.

 

Trial information

The model: GRU (Gated Recurrent Unit) based speech general noise reduction, 119.2k parameters, model summary in appendix

In/output: a frame of 64 PCM samples, Q15 format, 16kHz sample rate, input max amplitude should be >-6dBFS

Processing precision: 16bit

Demo constrain: processing time limited to 120 sec

Target platform: ARM Cortex-M7 & Tensilica Fusion F1 DSP

Test environment: ST STM32H7A3 (arm-none-eabi-gcc 10.3)


Preparation

  • Download the CM7 version or Fusion_F1 version library zip file (provided for evaluation purpose only)

  • Unzip and place the .h file in project include folder

  • Place the .a file in library folder and include lib name “model_processing” in linker. For ARM toolchain, this is done by using “-L” & ”-l” flags or in Cube IDE project setting as following. Note the “lib” and “.a” would be added automatically.



 

The two lines of code

Include the header

#include "model_frame_proc.h"

then call the model to process a frame

model_frame_proc(p_data,      // ptr to data
                 p_temp_buf); // ptr to scratch buffer

 

Efficiency

Memory usage: 248kB code+data, 2kB scratch buffer

CPU usage:

STM32H7A3: 86Mhz when processing 16kHz stream realtime, equivalent to 344k cycles per inference (-Ofast compiler flag, D/I cache enabled)

Fusion F1: 34Mhz when processing 16kHz stream realtime, equivalent to 136k cycles per inference (-Ofast compiler flag, TCM RAM)

 

HOW

There is no magic:

  • The silky integration come from same concept as Rust, build all dependencies into the static library.

  • The efficiency is achieved by thoughtfully translating NN layers into most basic operations understood by the compiler.


I hope this shows a different perspective and could help driving embedded ML forward. If you think this could help your project or algorithm, get in touch.



Appendix:

Model summary
_______________________________________________
 Layer (type)          Output Shape     Param #
===============================================
 multiply (Multiply)     multiple        0
 conv1d (Conv1D)         multiple       4096
 dense (Dense)           multiple       5200
 gru (GRU)               multiple       51264
 conv1d_1 (Conv1D)       multiple       8536
 gru_1 (GRU)             multiple       40800
 dense_1 (Dense)         multiple       5184
 conv1d_transpose        multiple       4096
===============================================
Total params: 119,176
Trainable params: 119,176
Non-trainable params: 0

 
 
 

Komentáře


Komentování u tohoto příspěvku již není k dispozici. Pro více informací kontaktujte vlastníka webu.

Author

WLi_pic.webp

Weiming Li

  • LinkedIn

© 2025 by MLSP.ai. All Rights Reserved

bottom of page