Yeah it's pretty frustrating not having in-line ARM assembly available in visual studio. Can't you rewrite a small number of functions in assembly, then cross assemble them using armasm and just link to the resulting object files in VS2012?
Have you checked the disassembly of your release build? I'd hope for some embarrassingly parallel operations like large memcpys to be vectorised by the compiler. I assume you've already taken care of the obvious C optimisations for an interpreter, for me these were:
Using a tree of function pointers organised similarly to the real CPU's decode (such that the hardware decode time is roughly proportional to the call depth of the interpreter for a given instruction).
Passing just the op code into the function tree as a constant. The addressing mode functions return pointers to members of the CPU context struct.
Using pointers that can be leveraged directly by the target platform's addressing modes.
Not using objects where they weren't needed (sacrifice maintainability for speed/reduced overhead).
I have been toying with writing a Megadrive emulator for WP8, I wrote a nice little interpreter for the 68k in plain C, while the VDP is a class (haven't bothered with the audio yet
). But I decided I was going to wait for SDL to be ported to WP, I just can't stand all this DirectX boilerplate and DirectXTK seems half arsed in comparison to SDL 2.