*
*Home|Chinese|Japanese*About ARM|Forums|Events|News|Employment|Contact Us|Investors*
dotted rule
*ARM - the architecture for the digital worldARM - the architecture for the digital world
search
*
*
***
*MARKETS:PRODUCTS & SOLUTIONS:CONNECTED COMMUNITY:TECHNICAL SUPPORT:DOCUMENTATION*
*
hot topics
*
*
****
*.Press Room
*
*
>>Home Page
*
>>News
*
>>Partner News
*
>>ARM In The News
*
>>Downloads
*
>>Customer Testimonials
*
>>Press Contacts
*
  Hot Topics 
*
 >>RSS
*****
Hot Topicask ARM
*
*
*
*
ARM provides commentary on hot topics that matter to the wider microprocessor industry.
*

12 September 2008

Interpreting Benchmark Data

Finding Your Way Through the Benchmarking Maze

Benchmark data and implementation conditions matter when choosing a high-performance microprocessor core but true comparisons are often difficult to achieve.

Consumers want higher-quality music, video and gaming in multimedia products. And businesses want converged products that deliver multiple office applications. To make all of this possible, your embedded designs need high-performance microprocessor cores.

The trouble is that when specifying and choosing them, there are more design and project constraints to consider than you might first expect. As well as performance, cost, power and time to market are also important, especially in consumer markets where being first to market with a competitively priced product is critical to success.

Is benchmark data all that it seems?
Designers typically look at published benchmark figures when choosing cores. Ideally, the data should give them firm technical information on which to base a rational selection and to inform their evaluation of system performance.

Two of the most established processor benchmarks are Dhrystone and EEMBC. Dhrystone, which has been in use for over 20 years, consists of a small amount of code and operates on a small data set. Despite just about every vendor quoting performance in Dhrystone MIPS or ‘DMIPS’, the benchmark does not actually reflect the performance of a processor dealing with real-world operations: it only measures processing within the core, rather than the system performance, and places a disproportionate emphasis on string operations. EEMBC, and other benchmarks such as BDTi and Mediabench, reflect real-world workloads and can give a more accurate indication of how a processor will actually perform.

The exact benchmark conditions play a major part in determining the quoted figures, so you should ask:

- Was the benchmark measured on the first loop, or after running through several times? After the first loop, caches will have been loaded and branch prediction set up, which will give the illusion of faster execution.

- Has the compiler been optimized for the benchmark, or for the processor? Compilers that can take advantage of specific features in the processor can improve performance by 10 to 20%. Compilers that have been specially optimized for particular benchmarks can easily double the performance of the processor for that benchmark. While this makes the processor look good, it is misleading for designers who are trying to interpret the data either to choose between cores or understand whether the processor will meet their application’s needs.

- Which flags have been set for compilation? Has the compiler been set to optimize for code density or performance? Instructing the compiler to produce the fastest code can mean a larger code footprint that may in turn mean that larger memories are needed, which increases power dissipation and cost.

If the benchmark conditions are unpublished, not transparent or impossible to reproduce, it is very difficult to make like-for-like comparisons.

Maximizing system performance
Improving general purpose computation in the core is one way to enhance performance, but there are other things you can do within the system to speed it up. 

Selecting the right memory architecture is key to maximizing system performance and optimizing cost. Deciding how to use SRAM and ROM, the size and partitioning of the cache or tightly coupled memories (TCM), is fundamental to the system design. These choices depend upon the real-time constraints of the application and the flexibility of the processor.

If you use level-2 cache memory, it is possible to bring more data on-chip and closer to the CPU, which helps resolve the performance-limiting bandwidth constraints associated with accessing off-chip memory. Using level-2 cache can dramatically improve the performance of systems with large code or data structures, long memory latencies, or busy system buses. At the same time, considerable power savings are possible since many off-chip transactions are eliminated.

The system’s bus architecture also influences system performance, and determines how backplane and peripheral components will perform with the core. For example, the AMBA® 3 AXITM bus system is appropriate for high-frequency and low-latency designs that maximize the use of interconnect resources, which enables very high data-throughput.

Technologies and features that accelerate specific tasks such as executing Java or media algorithms further enhance processor performance. These are typically hardware ‘extensions’ that can perform the tasks more efficiently than the core processor itself, but the enhanced performance doesn’t come for free. Adding hardware will increase the area by perhaps 5 – 10%, and will probably affect the integer core pipeline, which will reduce the maximum frequency. Then there are issues of compatibility – special operating system and software support is required for the new instructions.

High-speed implementation
When it comes to implementation, it’s important to understand how quoted performance data relates to the physical chip implementation, and what you can do to achieve the highest performance from your core.

Some of the questions to ask include:

- Validation. Are the power, performance and area figures validated post-layout and after floor planning? Do the figures take into account power rails, signal integrity, IR-drop and scan?

- Frequency. What are the process, technology, library and PVT conditions? Is the RAM compiled or custom-designed, and has its speed been taken into account?

- Area. Has the design been optimized for speed or area? Has the area overhead for MMU, RAM, power rails and signal integrity been taken into account? What kind of gate utilization can designers achieve ‘out of the box’?

- Power. Does the measurement use any low-power libraries or additional low-power techniques? What are the measurement conditions?

- Implementation. Was the design implemented through a standard ‘out-of-the-box’ ASIC methodology, or was custom optimization work performed? Custom implementation is manually intensive and can impact time to market, but in general it’s possible to achieve higher performance than standard ASIC implementation.

Improving performance through implementation
When quoting performance data, ARM uses realistic de-rating margins wherever possible. Using performance data without allowing for OCV (on-chip variation) can give an overly optimistic view of a core’s performance.

ARM has gathered much information from Partners about the best implementation techniques for enhancing operating frequency. Judicious use of some LVt (Low threshold voltage) transistors allows for a significant increase in frequency – around 15%. Traditionally, ARM publishes benchmarks for slow-slow (SS) silicon speed, but designing for typical silicon (TT) can raise performance by 20% using low-power 65nm and 45nm technologies. Overdriving the voltage also improves clock frequency, but like many performance tradeoffs, at the expense of higher power consumption.

Choosing a high-performance processor requires a careful approach to interpreting benchmark data. Developers need to consider whether the benchmark conditions enable a fair comparison between different cores. When it comes to predicting performance accurately, the use of realistic de-rating information is essential. Combining performance-focused implementation enhancements can give a significant boost to clock frequencies.

ARM continues to invest heavily to deliver reliable and robust functional and implementation-based benchmarks, helping our Partners make the best choices for their high-performance applications.




*
Hot Topics
*
 12 September 2008 / Interpreting Benchmark Data>>  
 
 14 April 2008 / Automotive Infotainment: The ARM Powered Ford Sync>>  
 
 11 February 2008 / EE Times: Mike Muller Interview>>  
 
 03 December 2007 / The Future of Connected Mobile Computing>>  
 
 30 October 2007 / Highest performance next-generation Mali200 mobile graphics silicon>>  
 
 15 August 2007 / Enabling Communications Centric Design>>  
 
 16 February 2007 / Auto Trends Drive Processor Choice>>  
 
 21 November 2006 / Multitasking Java>>  
 
 31 October 2006 / Low-end Applications Demand 32-bit Processors>>  
 
 14 September 2006 / The Design Dilemma: Multiprocessing using Multiprocessors and Multithreading>>  
 
 view all >>  
 

*
ARM In The News
*
 24 Nov 2008 / Cavium to buy video processor specialist>>  
 
 21 Nov 2008 / Virtualization technology targets multi-core ARM>>  
 

*
*
ARM Forums
*****
 
27 November 2008 /
Smallest Linux ARM-based platform for $49!Right Arrow
 
27 November 2008 /
Vectorizing Compiler supports ARM11 SIMD ?Right Arrow
 
26 November 2008 /
LPC2388 external memoryRight Arrow
 
 
view all Right Arrow

*
Partner News
*
 26 November 2008 / STMicroelectronics’ STM32 Microcontroller Wins EDN China 2008 “Best Product” Innovation Award>>  
 
 25 November 2008 / Micron Provides Optimized Memory Solutions For TI’s OMAP35x Proc>>  
 

**
***Other ARM Websites
*
shadow *LEGAL STATEMENTshadow