This techtalk article helps designers design an Application Processor that needs to support 'Data in Transit Protection'. The emphasis is on high-speed cryptographic data processing as opposed to high security or operations requiring high computational loads but little data. The assumption is that the system also has other tasks than cryptographic processing since many systems are designed with a different task in mind. Data in Transit protection is only added after the fact or in a second revision of the product. The challenge we address here is not just about doing high-speed crypto but about doing this while minimizing its impact and footprint on the rest of the system.
Cryptographic coprocessors have evolved to become more powerful and complex. To sketch the broad range of available solutions we'll depict a 'timeline' of solutions below, with every step along the way adding additional sophistication. Systems today may not need the most comprehensive solutions, but your system is bound to be comparable with a location ‘somewhere on this timeline’.
Figure 1: A history of cryptographic acceleration
Software is the simplest form of cryptographic processing. It is the simplest to build and integrate but also has the highest impact on the system. To appreciate why cryptographic processing of data is so hard in software, consider the following:
Efficient software copies data around as little as possible. With Data in Transit protection the processor is already required to touch every byte of the packet data twice.
Crypto operations involve bit level data manipulation not present on most application processors.
Bus systems and external memory interfaces are heavily tasked, leaving little room for 'other' applications using these resources.
With all these resources in full swing, power consumption will be affected as well.
Individual Crypto Engines with DMA support
The first step in cryptographic offloading is adding dedicated hardware to execute the crypto algorithms. By adding DMA capability to the crypto cores the processor can spend its cycles on other tasks. This is relatively easy to support in software, it's a straightforward replacement of the crypto operation in software. The resource utilization on the rest of the system is still high though:
Bus loading/cycle stealing still happens.
Every data byte processed still crosses the bus system three times: read (cryption), write (after cryption), read (hashing).
The processor's interaction with the crypto hardware is inherently synchronous and hardware inefficient.
Often crypto processing can only be scheduled 'one block at a time' to allow the processor to read status and update key material in between data blocks.
If the crypto works on small blocks of data a heavy performance penalty is suffered. The high level of processor involvement causes a lot of idle time on the crypto hardware. A lot of SW (e.g. OpenSSL) uses this popular, but not so efficient acceleration model.
Protocol Transform engine
Offloading on the protocol level is more advanced. Instead of accelerating individual cipher- or hash operations, the hardware takes care of a complete security protocol transformation in a single pass. Typically the hardware has DMA bus master capability allowing it to autonomously update state- and data in system memory. Although this capability makes integration with software more complicated, it allows for a huge efficiency increase for cryptographic acceleration. The processor can 'queue up' multiple operations allowing 'batch processing' or 'interrupt coalescing', resulting in a significant improvement in the number of packets per second a system can handle. The crypto accelerator can work on multiple packets at a time, using its processing pipeline, 'hiding' data access latencies by reading data for the next packet while processing the current packet. Single-pass hash- and encrypt operations keep hash and cipher engines busy at the same time while decreasing the number of times the packet crosses the bus system. These points allow the crypto hardware to achieve an almost 100% utilization, while still reducing the per-packet load on system resources.
Parallelized protocol transform engines
For some systems when even a protocol transform engine is not fast enough the simple answer seems to be to 'just throw more hardware at it'. Reality is hardly ever that simple. In most encryption- and message integrity modes it's not possible to assign multiple cipher- and hash cores to work on the same packet because of the internal feedback loop that requires the result of the current step to be used as input to the next step. An exception is AES-GCM; this mode used the AES algorithm in 'counter mode', which does not use data feedback. AES-GCM can be built to provide throughputs far beyond the limits mentioned earlier.
Operating multiple transform engines in parallel is challenging because of 'state information' that needs to be maintained for a connection. This Security Association (SA) is required before a transform engine can start processing on a packet, and it's updated after processing is done. Processing a packet using old SA data causes IPsec or SSL processing to fail completely.
Even the parallelized protocol transform engine isn't always sufficient to achieve the required data throughput and packets per second. Simply adding more crypto hardware doesn't always do the trick; for various reasons, other system bottlenecks may prevent the crypto hardware from reaching its potential. Examples of performance limiting effects are:
Data Bandwidth limitations
Where originally packet data came in over an external interface (Ethernet, WiFi) and got stored in memory, now the packet needs to be read from memory, get decrypted, and stored back in memory before it can be given to the application (same holds for outbound). A gigabit interface that consumed 1 gigabit of internal bandwidth all of a sudden requires 3 gigabit of internal bandwidth. For every packet processed also the key material and tunnel state (SA) needs to be read and updated by the crypto engine.
Processor Bandwidth limitations
Assuming packet data movement is handled by DMA and cryptographic processing is handled by a crypto accelerator, the processor has to perform all packet handling operations for every packet, regardless of the size. Every system has an upper limit for the number of packets it can handle per second, especially if the amount of available processor bandwidth for these tasks is limited. Most systems don’t just move packets along; they actually need to act on them so would like to reserve, the majority of their bandwidth to other tasks.
Inline Protocol Acceleration engines
The crypto acceleration architectures discussed so far operate in 'Look-Aside mode': packet handling is done completely under software control. Only when the actual cryptographic operation needs to be performed, software 'Looks Aside' to the cryptographic accelerator that hand sthe packet back to the software after completing its task.
With Inline Processing software is no longer involved both before and after crypto acceleration. All cryptographic operations are performed on the packet before the software 'sees' the packet for the first time (or vice versa). This typically called 'Bump in the Stack'. Some systems targeting networking gateway applications take this concept one step further and allow a packet to travel from network interface to network interface completely through hardware without involving software at all ('Bump in the Wire'). Application Processors primarily benefit from the 'Bump in the Stack' model since they actually use (consume) the packet data it receives (and vice versa).
'Bump in the Stack' and 'Bump in the Wire' models present software integration challenges. When properly integrated major benefits can be achieved: From a data plane point of view, Inline acceleration makes the system appear as a regular networking system. Performance becomes predictable since it no longer depends on processor activity or bus- or SDRAM utilization by other system tasks. The system can operate at the performance level without data in transit added, assuming the crypto accelerator has 'line rate performance'. Most or all of the issues raised in the previous section 'go away' when an inline crypto accelerator is deployed.
An on-chip crypto accelerator consumes significantly less power than a general purpose application processor. In addition the application processor typically executes from off-chip SDRAM, lowering the power consumption even more. The most significant power savings are achieved by dedicated protocol engines. Using Bump in the Stack type acceleration provides power optimization compared to a Look-Aside deployment. Bump-in-the-Wire operation improves power consumption even more because packet data does not necessarily have to enter SDRAM any more at all, and the processor is not spending any cycles on packet processing.
In this techtalk we highlighted challenges for achieving high throughput data in transit protection for application processors. Different architectural models have been explained, showing the evolution of cryptographic offloading hardware and the effects the different architectures have on the hardware, software and performance of a system. It will be clear that packet engine design and integration is no longer primarily related to the ability to provide high 'raw crypto throughput'. The requirements the system poses on the crypto hardware to allow the system to tap the acceleration potential have become much more important.
Another trend is that the crypto accelerator is pulling in more functionality from the surrounding system. Virtualization support in the packet engine allows the software in the virtualization layer to become smaller. Bump-in-the-Stack and Bump-in-the-Wire operational modes pull OSI layer 2 and 3 functionality, plus parts of the packet forwarding function, into the packet engine hardware.
Packet engines are overcoming limitations imposed by legacy cryptographic modes and protocols not designed to go up to the speeds offered by modern network technologies. Perhaps today this is of primary use to server deployments, the next generation of applications processors will benefit from the lessons learned today.
AuthenTec's latest packet engines, the SafeXcel-IP-97 and SafeXcel-IP-197 IP core series are built to support all of the presented optimization, acceleration and offloading mechanisms. These IP cores are supported by driver development kits as well as AuthenTec's QuickSec and Matrix toolkits.
AuthenTec is a leading provider of mobile and network security and helps protect individuals and organizations through secure networking, content and data protection, access control and strong fingerprint security on PCs and mobile devices. AuthenTec technology is deployed by the leading mobile device, networking and computing companies, content and service providers, and governments worldwide. AuthenTec's products and technologies provide security on hundreds of millions of devices. Top tier customers include Alcatel-Lucent, Cisco, Fujitsu, HBO, HP, Lenovo, LG, Motorola, Nokia, Orange, Samsung, Sky, and TI. Learn more at authentec.com.
AuthenTec offers silicon IP cores that offer efficient HW acceleration of IPsec, SSL, TLS, DTLS, sRTP, MACsec, HDCP protocols, in Look-Aside, Bump-in-the-Stack and Bump-in-the-Wire architectures. Acceleration performance from a few 100Mbps to 40 and even 100Gbps can be achieved in todays 90, 65, 45, 40 and 28nm designs.
Mr. van Loon is Solutions Architect for the Embedded Security Solutions group of AuthenTec, Inc. He has over 14 years of experience in embedded security, ranging from high speed cryptographic protocol acceleration to low power, high security embedded systems. Mr. van Loon holds an Electrical Engineering Degree from the Eindhoven University of Technology in the Netherlands.