Multi-Processor System-on-Chip 1. Liliana Andrade
face detection for triggering an alert, accompanied by an image or video, on the owner’s smartphone;
– smart speakers with voice control, employing local speech recognition for a limited vocabulary of voice commands while relaying other speech data into the cloud for more advanced analysis;
– smart sensing devices used in agriculture to monitor and control, for example, soil quality, crop yield and livestock, while sporadically communicating data over cellular connections using, for example, NB-IoT protocols for low power consumption.
Many IoT edge devices are battery-operated and demand an optimized implementation in order to enable a long battery life. Therefore, we must target low power consumption for functions that need to be performed in software locally on the IoT edge device. This, in turn, requires programmable processors that are optimized for executing these software functions efficiently, which is the topic of this chapter.
1.2. Versatile processors for low-power IoT edge devices
1.2.1. Control processing, DSP and machine learning
Low-power IoT edge devices typically perform a range of different functions locally on the device. They run a local application that controls the device, its sensors and other interfaces, such as a communications interface to the network and a user interface. For this purpose, a processor must have capabilities for efficient processing of control code, including low branch overheads, efficient interrupt handling, timers, efficient integration with peripherals, support for real-time kernels, etc.
Furthermore, IoT edge devices typically perform some processing on the data acquired through their sensors. These can be sensors to monitor physical phenomena, such as thermometers, gyroscopes, accelerometers and magnetometers. Let us consider, for example, a personal health device or the smart sensing devices used in agriculture, mentioned above. Data rates for this type of sensors are typically low. Microphones are another type of sensors that have higher data rates. For example, a 16 kHz sample rate is often used for voice data. Even higher data rates can be observed in IoT edge devices that use a camera, such as a smart doorbell performing face detection. Data rates for image and video data can vary largely, based on resolution and frame rates. Data rates of hundreds of MB/s are not unusual in high-end devices, but for more power-sensitive camera-based applications, much lower data rates can be observed.
The processing of sensor data typically involves digital signal processing (DSP) with functions such as filtering (e.g. FIR, correlation, biquad), transforms (e.g. FFT, DCT), and vector and matrix operations. Voice data can be processed by various DSP functions, including noise reduction and echo cancellation. In addition, the IoT edge device can perform encoding and/or decoding of voice or audio data. For example, consider an audio playback function on the device.
Communicating data involves further DSP functions. For example, some key functions in an NB-IoT protocol stack involve FFT, auto- and cross-correlations, and complex multiplications and convolutions. Furthermore, trigonometric functions such as sine and cosine must be performed. In addition, such protocol stacks perform convolutional coding, for example, Viterbi.
We conclude that the efficient processing of sensor data on an IoT edge device requires processors equipped with DSP capabilities. The relevant DSP capabilities are:
– support for fixed-point data types and arithmetic, including fixed-point multiply-accumulate (MAC) instructions, wide accumulators, and efficient saturation and rounding;
– support for floating-point data types and instructions, including fused multiply-add instructions;
– advanced address generation for efficient memory access, including circular and bit-reversed addressing for DSP kernels such as FIR filters and FFTs;
– zero-overhead loops;
– support for complex data types and arithmetic, including complex multiply and MAC instructions;
– support for vector or SIMD processing to enable increased efficiency by exploiting data parallelism;
– efficient divide and square root operations;
– high load/store bandwidth, as DSP functions can be memory-access intensive.
In addition to control processing and DSP, machine learning has recently emerged in various application areas as a technology for building IoT edge devices with advanced functionalities. Some illustrative examples are smart speakers, wearable activity trackers and smart doorbells. These devices apply machine learning technology that has been trained to recognize certain complex patterns (e.g. voice commands, human activity, faces) from data captured by one or more sensors (e.g. a microphone, a gyroscope, a camera). When such a pattern is recognized, the device can perform an appropriate action. For example, when the voice command “play music” is recognized, a smart speaker can initiate the playback of a song. In the following sections, we dig deeper into the requirements and processor capabilities for efficient machine learning in low-power IoT devices.
Integrated circuits for low-power IoT edge devices may use one or more processors for implementing the different types of processing. Multiple processors are required if a single processor cannot handle the complete software workload. A further reason for using multiple processors is that specialized processors can be used for the different types of processing. More specifically, different processors can be used for control processing, DSP and machine learning.
However, there are also good reasons to aim to reduce the number of processors. Lower cost is a key benefit, which is particularly relevant for low-cost IoT edge devices that are produced in high volumes. The use of fewer processors also reduces design complexity, as it simplifies the interconnect and memory subsystem required to integrate the processors. Furthermore, if multiple interacting functions are combined to be executed on a single processor, then this will limit data movements and reduce the software overhead for communication. An additional benefit for software developers is that a single tool chain can be used. To enable the flexible combination of functions, we need versatile processors that can efficiently execute different types of workloads, including control tasks, DSP and machine learning. Such processors are also referred to as DSP-enhanced RISC cores. They add a broad set of instructions for DSP and machine learning to a RISC core. If done well, the hardware overhead of these additions is small, for example, by sharing the register file and having unified functional units (e.g. a multiplier) for control processing, DSP and machine learning. Today, optimized DSP-enhanced RISC cores are available from IP vendors.
1.2.2. Configurability and extensibility
Integrated circuits for low-power IoT edge devices are often built using off-the-shelf processor IP that can be licensed from IP vendors. Since such licensable processors are multi-purpose by nature, to enable reuse across different customers and applications, they may not be optimal for efficiently implementing a specific set of application functions. However, some of these licensable processors offer support for customization by chip designers, in order to allow the processors to be tailored to the functions they need to perform for a specific application (Dutt and Choi 2003). More specifically, two mechanisms can be used to provide such customization capabilities:
– Configurability: the processor IP is delivered as a parameterized processor that can be configured by the chip designer for the targeted application. More specifically, unnecessary features can be deconfigured and optimal parameters can be selected for various architectural features. This may involve optimization of the compute capabilities, memory organization, external interfaces, etc. For example, the chip designer may configure the memory subsystem with closely coupled memories and/or caches. Configurability allows performance to be optimized for the application at hand, while reducing area and power consumption.
– Extensibility: the processor can be extended with custom instructions to enhance the performance for specific application functions. For the application at hand, the performance