*
*Home|Chinese|Japanese*About ARM|Forums|Events|News|Employment|Contact Us|Investors*
dotted rule
*ARM - the architecture for the digital worldARM - the architecture for the digital world
search
*
*
***
*MARKETS:PRODUCTS & SOLUTIONS:CONNECTED COMMUNITY:TECHNICAL SUPPORT:DOCUMENTATION*
*
technical support
*
*
****
*.Technical Support
*
*
*>>Home Page*
*
*.Obtaining Support*
*
*.FAQs*
*
**Development Tool FAQs*
**IP FAQs*
**Embedded Software FAQs*
**Artisan Physical IP FAQs (Login Required)*
*
*.Downloads*
*
*.Documentation*
*
*.Training*
*
*.Where To Buy*
*
*.Keil MCU Tools*
*
*.What's New*
*
*.ARM Newsgroups*
*
*.Active Assist On-site Services*
*
*
*
technical support FAQsask ARM*
*

Technical Support Search
*     (Advanced Search)
  FAQs   Documentation   Downloads   Forums

*

 
downarrowARMulating cached cores (e.g. ARM940T)
Applies to: ARMulator/RVISS, Software Development Toolkit (SDT)

Description
ARMulator models of ARM's cached cores are provided in SDT and ADS, to allow easy benchmarking of code and comparisons between cores.

Executing instructions & counting cycles
Consider the instruction: LDR r0, [r1]

For an uncached processor (such as the ARM7TDMI) operating from perfect memory, the number of cycles to execute a particular instruction is predictable.

However for a cached core this is not so, and there can be many factors affecting the time an instruction takes to execute. For example:

  • Is the instruction cached
  • Is the address contained in r1 cached
  • Is the write buffer draining
  • If the processor has an MMU, has a TLB miss occurred

The situation is made more complex because ARM's cached cores also support streaming - whereby during cache line fills, information is made available to the core at the same time as it is written into the cache.

Under the ADS ARMulator, it is possible to examine cache and TLB (Translation Look-aside Buffer) and write buffer information by enabling verbose statistics for the relevant ARMulator memory model. For non-ARM9 cores, this is done by editing the armul.cnf file contained in the ADS\bin directory and setting the variable Counters= True. It is recommended that a copy of this file is taken before any modifications are performed.

For ARM9 cores, enabling verbose statistics requires an extra line to be added to the relevant memory model definitions. For example:

{ ARM920CacheMMU Counters=True ;; add this line 

(this is only possible with ADS, not SDT).

Note that cached processor models have their caches enabled by default, with the lower 128MB being cacheable. Memory speeds can be set as a fraction of the core speeds using the armul.cnf variable MCCFG. For example, setting this variable to 3 will set the memory clock to be one third that of the core clock.

Certain ARMulator models such as the ARM920T operate by default in Fastbus mode and require a coprocessor write to switch to synchronous mode. This can be achieved by setting bit 30 in coprocessor 15 register 1. For example:

MRC p15, 0, r0, c1, c0, 0
BIC r0, r0, #0xc0000000
ORR r0, r0, #0x40000000
MCR p15, 0, r0, c1, c0, 0

Why does the ARMulator show zero N-cycles?

When ARMulating cores with MMUs and AMBA interfaces (e.g. 740T, 920T, 940T), you will not see any N-cycles in $statistics or $memstats, even if your code contains branch instructions. The only cycle counts shown by the ARMulator for these cores are the two AMBA cycle types:

  • address-only ('A') cycle - an address is published (speculatively), but no data is transferred, and
  • sequential ('S') cycle - data is transferred from the current address.

ARMulator will always (correctly) show N-cycles=0 in its $statistics for these cores, because a non-sequential access is done with an A-cycle followed by an S-cycle ('merged I-S' cycle). Please refer to the Data Sheets for the AMBA interface description of cycle types for each core.

A-cycles are shown in $statistics under the heading 'I_Cycle' to correspond with the ARM7TDMI cycle labelling.

Why does the 940T seem slow when its cache is disabled?

When the ARM940T cache is disabled, each instruction (a single read which misses in the cache) will typically cost 4 I cycles (depending upon clock mode), followed by an S cycle. All the following steps are required in the worst case.

  • 1 cycle cache miss BCLK if fastbus mode, FCLK otherwise
  • 1 internal cycle BCLK if fastbus mode, FCLK otherwise, used for some internal decoding
  • Synchronisation none in fastbus, max 1/2 BCLK in synchronous and 1 max BCLK in asynchronous
  • Write buffer drain number of BCLK cycles is dependent on AMBA interface and is system specific
  • 1 cycle address only This takes longer than 1 cycle but is factored into either the synchronisation period or write buffer drain
  • 1 cycle word fetch BCLK cycle to perform the word fetch

You can enable/disable the ARM940T cache/PU with:
''UsePageTables=True'' in armul.cnf to enable the cache/PU
''UsePageTables=False'' in armul.cnf to disable the cache/PU

An important point to note is that for small sequential code examples, where the cache is empty/unused, any cached processor (not just ARM940T) will perform worse than one with no cache. Cached processors will only show performance benefits with code that contains loops and with memory that requires wait states.






back to top

*
**
*4 dots*Other ARM Websites
*
shadow *LEGAL STATEMENTshadow