Imagination’s META is a multi-threaded processor IP core targeted at complex SoC devices. By combining a common architecture for both RISC and DSP instructions with hardware multi-threading, it allows complex systems to be built around a single processor core where previously two or more different processors would have been needed to achieve similar performance. This has already proven beneficial in volume real-time systems for broadcast and multimedia CE devices, and in this article we seek to explain how Imagination’s innovative approach makes its META architecture uniquely flexible, efficient and suited to an extensive range of applications.
The META core is a high-performance, low-power, multi-threaded CPU core with digital signal processing (DSP) and floating point (FPU) capabilities for use in systems such as audio signal processing and digital wireless applications. META’s real-time architecture can execute multiple general purpose (GP), FPU and DSP tasks on the same core without any cross-task interference.
Its modular nature affords extensive customisation. The major functional components such as the number of threads, caches and DSP resources are optional, giving the SoC designer the opportunity to influence performance, power and die size to develop the optimum solution in silicon.
META supports one to four independent hardware threads that typically work in parallel on independent activities. A high-performance implementation might have four threads, two DSP and two GP, whereas a lightweight implementation might have two threads, one GP and one DSP.
Unlike traditional processors that are underused due to multi-cycle memory latencies or waste time performing context switches in software, META supports multiple threads in hardware, with each thread being an instantiation of the processor. Hardware multi-threading allows META to switch contexts in response to rapid real-time events, without software overhead, which is essential for complex multifunctional real-time embedded systems.
Figure 1: Basic architecture and operation.
Hardware threads share the processor’s core resources such as register execution units (ALUs, multiplier, accumulator etc) and coprocessor ports, but have some discrete resources such as read/write ports.
Although the processing resources are shared, to accommodate multiple thread contexts, each execution unit holds a local register state, an execution pipeline and a program counter (PC) for each thread. A separate control unit holds mode bits and control registers for each thread.
DSP RAM is a global resource shared between threads specifically for use in the extended DSP instruction set. It resides in the data execution units and is accessed like an extended register file, by-passing the main memory system. Multiple execution units, the data cache, and DSP RAMs can be used in parallel to perform complex DSP operations and achieve VLIW-like instructions without the normal bandwidth overheads. The DSP capability of META alone exceeds that of many competing dedicated DSP cores, with support for both SIMD and table-driven VLIW instructions.
Thread switching in hardware
A fine-grained instruction scheduler switches between thread contexts on a cycle-by-cycle basis that depends on all instructions completing inside a known number of cycles.
To manage META’s multiple threads, the instruction scheduler extracts a list of required resources from the next pending instruction on each thread. Over fifty internal resource requirements are matched to resource availability via an interlocking process that yields a set of candidate instructions that could be issued on the next cycle. One instruction is then chosen from this set of possible instructions according a variable-priority schedule.
Each thread can use different processor resources at the same time, or one thread can use all of the processor’s resources. This means key algorithms such as Fast Fourier Transforms are performed extremely quickly and mundane actions such as data movement are done with maximum efficiency.
As not every resource is needed by every instruction, instructions which do not have conflicting resource requirements can execute simultaneously, and often more than one thread can issue an instruction on each cycle. The practical performance improvement obtained depends on the particular instruction sequence, but throughput improvements from 50% to 100% relative to simple single-instruction multi-threading are not uncommon. As an illustration, a performance of 2 DMIPS/MHz can be achieved when running SMP Linux on four threads (-O2 optimised Dhrystone).
Automatic MIPS Allocation (AMATM)
Many embedded applications in the communications and consumer space are required to perform to a minimum level in order to meet user expectations, e.g. video frame rate and audio quality with no dropped packets. These system-level considerations have now migrated down to the architecture level, and can be likened to the hub of an advanced communications network, with many different types of data stream (latency or bandwidth-critical) and peripherals requiring attention.
META’s patented AMA provides automatic resource management in hardware, ensuring that each thread of execution gets the MIPS it needs and has the required response time. AMA allows thread instruction issue rates and relative thread priorities to be controlled in a dynamic fashion based on rate control and priority control.
Rate control is concerned with the number of instructions a thread wishes to run over a given time period and the total load on the system - i.e. how to manage the system when it is very busy. During interrupt level processing, a thread's rate is boosted to the maximum possible level to reduce interrupt latencies. Priority control is handled primarily through a static priority register setting and a deadline counter.
Figure 2 shows a realistic configuration in which we have used a combination of thread prioritisation and AMA to achieve the desired behaviour. Threads 0 and 1 have elevated priorities for real-time response. Thread 2 is controlled by AMA for an execution rate of about 40% instructions/clock, and thread 3 is a free-running background task.
Figure 2: AMA rate control.
Because of interference from the higher priority threads, thread 2 cannot always achieve the desired run rate and builds up a processing deficit. Whenever possible, for example immediately after thread 1’s activity burst, thread 2’s execution rate is increased to make up the deficit, and after catching up the execution rate returns to the set value. Over short periods the desired execution rate cannot be achieved due to total resource demand exceeding what is available, but over longer periods (the major time divisions) the average rate of 40% instructions/clock is maintained, providing guaranteed throughput for the critical task.
META’s architecture has been developed to provide flexibility to designers to realize the widest range of embedded SoC applications. It is available in three main product groups:
META HTP – Highest performance, multi-threaded; tuned for combined Linux and RTOS-based applications and DSP on a common datapath.
META MTP – High performance, multi-threaded but smaller footprint; designed for advanced RTOS-based or native embedded applications.
META LTP – Lightweight, ultra-small, single-threaded, 32-bit microcontroller.
Inside these main groups further selections can be made such as the number of threads (between 1 and 4) and which of those are GP or DSP capable. The GP threads use a highly optimized RISC-like instruction set, with 16-bit instruction sets available to minimize code size. Each DSP thread adds ALU resources and more registers to enable the processor to execute advanced DSP algorithms such as audio codecs and modems. The DSP capabilities are further configurable from a light implementation to full DSP with on-chip DSP RAMs, accumulators, hardware loops, read pipelines and templated VLIW instructions.
So hardware multi-threading allows the separation of real-time tasks having widely differing scheduling requirements into different software schedules. For example, a subsystem based on a complex multi-functional OS such as Linux can run on one thread while a real-time data-driven DSP task runs on another. The DSP task might be based on a simple synchronous IO driven scheduling strategy which would be completely independent of the interrupts and device drivers in the other subsystem. This approach avoids the problems which arise when trying to schedule tasks with disparate event rates and activity patterns under a common OS. Such difficulties are why many conventional systems use two or more processors to achieve the same result.
Figure 3: 4-threaded deployment in a DAB/Internet radio application.
Figure 3 shows a real-world implementation of a 4-threaded META HTP SoC where threads 0 and 2 are dedicated to real-time DSP tasks. Thread 1 runs the META Advanced Audio Framework and is fed from the other threads.
Figure 4: 2-threaded deployment in a DAB/Internet radio application.
Figure 4 shows a similar application using a simpler 2-threaded META MTP SoC where DAB demodulation is performed on independent coprocessors which feed thread 0. The Wi-Fi decode is handled by thread 1 and also feeds to thread 0.
The uniquely flexible and scalable META architecture has all the benefits of multi-processing, but with less silicon resource and development complexity, and is significantly lower cost than a multi-processor approach. When taken alongside Imagination's ENSIGMA UCC technology, it forms the building blocks for extremely capable communications SoCs. Having already established its presence in the DAB and broadcast arena, META is now successfully diversifying into new areas such as power management, Wi-Fi, and beyond, so the road ahead looks bright for META.
Jim Whittaker studied the Electrical and Information Sciences Tripos at Cambridge. He designed the multi-threaded META processor, developed Imagination's chip layout capability and led the hardware team that created the integrated digital radio processors used by PURE. Jim is VP of the META product line.