free trial: integrate NN processing in MCU & DSP with 2 lines of C code

jimli44
Mar 10
2 min read

Updated: Mar 31

Trying is believing. In this post, I would enable everyone to be able to try bringing my example NN processing into your own embedded application with just 2 lines of C code, without compromise on efficiency of course.

With the most popular embedded NN deployment framework being in C++ only and inherited from framework meant for much more powerful systems, there is a natural hurdle in front of embedded NN deployment and system integration. This hurdle is proven to be costly in efficiency, both engineering effort and CPU cycles wise. I would like to show a different possibility, one that not just provide superior efficiency, but also points a direction which embedded ML might be able to play catch up.

Trial information

The model: GRU (Gated Recurrent Unit) based speech general noise reduction, 119.2k parameters, model summary in appendix

In/output: a frame of 64 PCM samples, Q15 format, 16kHz sample rate, input max amplitude should be >-6dBFS

Processing precision: 16bit

Demo constrain: processing time limited to 120 sec

Target platform: ARM Cortex-M7 & Tensilica Fusion F1 DSP

Test environment: ST STM32H7A3 (arm-none-eabi-gcc 10.3)

Preparation

Download the CM7 version or Fusion_F1 version library zip file (provided for evaluation purpose only)
Unzip and place the .h file in project include folder
Place the .a file in library folder and include lib name “model_processing” in linker. For ARM toolchain, this is done by using “-L” & ”-l” flags or in Cube IDE project setting as following. Note the “lib” and “.a” would be added automatically.

The two lines of code

Include the header

#include "model_frame_proc.h"

then call the model to process a frame

model_frame_proc(p_data,      // ptr to data
                 p_temp_buf); // ptr to scratch buffer

Efficiency

Memory usage: 248kB code+data, 2kB scratch buffer

CPU usage:

STM32H7A3: 86Mhz when processing 16kHz stream realtime, equivalent to 344k cycles per inference (-Ofast compiler flag, D/I cache enabled)

Fusion F1: 34Mhz when processing 16kHz stream realtime, equivalent to 136k cycles per inference (-Ofast compiler flag, TCM RAM)

HOW

There is no magic:

The silky integration come from same concept as Rust, build all dependencies into the static library.
The efficiency is achieved by thoughtfully translating NN layers into most basic operations understood by the compiler.

I hope this shows a different perspective and could help driving embedded ML forward. If you think this could help your project or algorithm, get in touch.

Appendix:

Model summary
_______________________________________________
 Layer (type)          Output Shape     Param #
===============================================
 multiply (Multiply)     multiple        0
 conv1d (Conv1D)         multiple       4096
 dense (Dense)           multiple       5200
 gru (GRU)               multiple       51264
 conv1d_1 (Conv1D)       multiple       8536
 gru_1 (GRU)             multiple       40800
 dense_1 (Dense)         multiple       5184
 conv1d_transpose        multiple       4096
===============================================
Total params: 119,176
Trainable params: 119,176
Non-trainable params: 0

free trial: integrate NN processing in MCU & DSP with 2 lines of C code

Recent Posts

Komentáře

Author