Adding Low-Power AI/ML Inference to Edge Devices

The function of artificial intelligence and machine learning (AI/ML) consists of recognition of complicated patterns and fast accomplishment of decisions according to the patterns. That is why nowadays a great number of companies are implying AI and ML in their goods. There are also many chip vendors who provide companies with necessary equipment.

A company have a possibility to select one of 2 options: purchase AI capabilities embedded into a system-on-chip (SoC) or stand-alone hardware AI/ML accelerators. The chips gain more and more popularity in the market, particularly the integrated ones.

When it comes to the choice of AI/ML implementation technology, you discover a giant diversity of options. It is possible to use AI/ML models on the base of un-augmented microprocessors or microcontrollers. However, you should take into account that efficiency in this case will be low. Mostly processor vendors suggest some ways to use AI/ML models on their processors with the help of software libraries supporting some standard AI/ML tools.

Besides, there is another option which assumes usage AI/ML tools created specially for un-augmented processor ISAs. Among such tools there is Tensorflow Lite for Microcontrollers developed for microcontrollers and SoCs embedding Arm Cortex-M processor cores. Being written in C++, the tool has been transported to different processor architectures.

Deciding to work with usual processors that do not possess special hardware for AI/ML problems, you have to keep in mind that the processors are unproductive and work at a slow pace. In order to achieve high level of efficiency, it is better to use vector or tensor hardware. A big number of microcontroller vendors, such as STMicroelectronics, Renesas, NXP and XMOS, have implemented hardware which support AI/ML models so as to get higher AI/ML efficiency of the processors.
One more way you can apply is to append DSP to the processor SoC that is possible to consume as an AI/ML coprocessor. Such a way helps you increase AI/ML efficiency. However, here the number of multiplier/accumulators (MACs) is not very high, so efficiency will be limited.

Augmented processors do not possess sufficient level of efficiency and energy, but there are different options. Due to high energy consumption such options as GPUs and FPGAs are used more often for training and inference in data centers, but not inference at the edge.

Among the options there are also purpose-built neural processors (NPUs) and NPU IP. More than 30 companies provide these technologies that involve MACs and build well adjusted networks for transmission of information. The technologies are characterized by different levels of efficiency and energy consumption which are higher than for microprocessors and microcontrollers with AI/ML instructions. Nonetheless, NPU has a learning curve of adaptation.

Meet MemryX
New company in the market MemryX is launching a new type of AI/ML accelerator. MX3 Edge AI Accelerator chip is able to support AI/ML models without access to outer DRAM because it has “at-memory” architecture. Design of MemryX is especially relevant for integrated edge gadgets since they have to transfer the information with the help of AI/ML models in real time.

According to MemryX, MX3 has the most optimal dataflow architecture for AI/ML inference on continuous-stream data which is the information created by video and security cameras as well as other different sensors. The fact that the MX3’s on-chip memories keep all the data needed by the AI/ML model and also computing is not transferred to the host processor and MemryX chip, leads to the situation where the chip gets only the data and gives back only the result. Such a system suits integrated gadgets transferring real-time data the most.
     

MemryX

AI/ML producers of chips also suggest technology named model zoo. Usually a zoo is a set of AI/ML models adapted in order to comply with the concrete AI architecture having its own distinctive characteristics. To define the size of a model zoo, it is necessary to base on the size of the company’s software resource. The thing is that models have to be transformed and retrained.

AI model zoo produced by MemryX is unusual since the MX3 is able to conduct trained AI models after only a 1-click compilation. The result has been achieved by the company through validation of hundreds of trained AI/ML models that were extracted from the storages on the web and from partners and customers. According to the company, the 1-click compilation process gives utilization rate of 50-80%.

The MX3 does not relate to stand-alone AI/ML gadgets. It has to work in pair with a host CPU, connected by either a PCIe or USB interface. Thanks to this technology it becomes easy to implement the gadget into diverse hardware designs, including the old ones.
Nowadays everyone has an extra port around. The MX3 accelerators do not demand outer memory, so implementation of AI/ML model processing capability to a gadget’s hardware design becomes not harder than supply a port connection between the CPU and the MX3 accelerator.

MemryX MX3 accelerator chip increases AI/ML processing efficiency by about 5 TFLOPS (trillion floating-point operations per second). For the purpose of activations and a layer-by-layer choice of 4-, 8-, the gadget utilizes bfloat16 numbers internally and for weights it utilizes 16-bit integers. MX3 gadgets are aimed to be daisy chained, that is why their processing ability increases as the number of implemented chips grows thanks to “at-memory” and dataflow architecture.

Thereby, a 2-device array of MX3 accelerators gives 10 TFLOPS and a 4-device array gives 20 TFLOPS. An MX3 gadget taken separately utilizes around 1 watt of energy and is able to work with a high number of AI/ML models at the same time in case they suit the MX’s on-chip weight memory. Besides, you can change the models and the procedure takes you no more than 10 milliseconds.
With the help of a 1-click compilation technology the MX3 development flow can work with many wide-spread AI/ML development frameworks such as PyTorch, ONNX, Tensorflow, Tensorflow-Lite and Keras. Accelerators with a dataflow orientation similar to MX3 are able to cope with microprocessors or microcontrollers based on any architecture (Arm, x86, RISC-V, etc.), as well as any operating system. This way, the advantages of MX3 accelerator makes it easy in use for many design teams.  

© Copyright 2023 SEVEL - All Rights Reserved