Profiling and Optimization

Now it's time to see how fast our unoptimized code is running.

We'll use the other two hardware timers GBA offers to measure the time it takes for us to complete the update routine. We'll can put the profiling code in main.c (also some code to print the values).

volatile int profile;
volatile int profile_peak;

void profileStart()
{
	// disable timers
	REG_TM2CNT = 0;
	REG_TM3CNT = 0;
	
	// enable timer3 as slave, enable timer2 as counter
	REG_TM3CNT_H = TIMER_START | TIMER_COUNT;
	REG_TM2CNT_H = TIMER_START;

	// (this gives us a 32-bit counter)
}

void profileEnd()
{
	// disable timers
	REG_TM2CNT_H = 0;
	REG_TM3CNT_H = 0;
	
	// delay a few cycles
	REG_TM3CNT_H = 0;
	REG_TM3CNT_H = 0;
	REG_TM3CNT_H = 0;
	
	// combine the two counter values into a cycle count
	profile = REG_TM2CNT_L | ((REG_TM3CNT_L)<<16);

	if( profile > profile_peak )
		profile_peak = profile;
}

int main( void )
{
	// initialize interrupt handler
	irqInit();
	irqEnable( IRQ_VBLANK );

	MOD_Setup();
	MOD_Play( (void*)SAC09_MOD );

	consoleInit( 0, 4, 0, NULL, 0, 15 );
	BG_COLORS[0] = RGB8( 0, 0, 255 );
	BG_COLORS[241] = RGB8( 255, 255, 255 );
	SetMode( MODE_0 | BG0_ON );

	iprintf( "\n Another MOD Player..." );
	
	while(1) {
		int a;
		a = profile;
		iprintf( "\x1b[3;0H Prof: %8i", a );
		a = profile_peak;
		iprintf( "\x1b[4;0H Peak: %8i", a );
		VBlankIntrWait();
	}
	return 0;
}

Then in our MOD_Routine:

/*****************************************************************
 * MOD_Routine
 *
 * Work-routine. Called every time half of the wave buffer
 * gets processed.
 *****************************************************************/
.thumb_func
MOD_Routine:
	
	push	{r4-r5}				@ preserve registers
	push	{lr}
	
	bl	profileStart

	ldr	r4,=bufferSlice			@ if slice == 1 then reset DMA!
	ldrb	r5, [r4]
	cmp	r5, #0
	beq	1f
	bl	MOD_ResetDMA
	mov	r0, #0				@ toggle slice
	b	2f
1:
	mov	r0, #1				@ toggle slice
2:
	strb	r0, [r4]

	bl	MOD_Frame			@ do MOD playback routine
	
	lsl	r5, #6				@ r5 = slice * 64
	ldr	r0,=MOD_WaveBufferL		@ call mixing function
	add	r0, r5				@ use left wavebuffer as target
	ldr	r1,=MOD_VoicesLeft		@ left voices as source
	ldr	r4,=MOD_MixChunk
	bl	_call_via_r4

	ldr	r0,=MOD_WaveBufferR		@ generate right output
	add	r0, r5
	ldr	r1,=MOD_VoicesRight
	bl	_call_via_r4

	bl	profileEnd
	
	pop	{r3-r5}				@ return arm/thumb
	bx	r3

Hm... it says ~38K cycles, that doesn't make any sense. Let's make a percentage out of that.

In MOD_Routine we are outputting 64 samples. At 16KHz, 64 samples will take 65536 CPU cycles to play. So to get a percentage we do ((Cycles*100) / 65536). Let's also slow down the refresh rate so we can read the numbers (redraw every 10 frames).

int main( void )
{
	// initialize interrupt handler
	irqInit();
	irqEnable( IRQ_VBLANK );
	
	MOD_Setup();
	MOD_Play( (void*)SAC09_MOD );
	
	consoleInit( 0, 4, 0, NULL, 0, 15 );
	BG_COLORS[0] = RGB8( 0, 0, 255 );
	BG_COLORS[241] = RGB8( 255, 255, 255 );
	SetMode( MODE_0 | BG0_ON );
	
	iprintf( "\n Another MOD Player..." );

	int c;
	c = 0;
	
	while(1) {
		if( c == 0 )
		{
			int a;
			a = profile;
			iprintf( "\x1b[3;0H Prof: %8i (%3i%%)", a, (a * 100) / 65536 );
			a = profile_peak;
			iprintf( "\x1b[4;0H Peak: %8i (%3i%%)", a, (a * 100) / 65536 );
			c = 10;
		}
		else
		{
			c--;
		}
		VBlankIntrWait();
	}
	return 0;
}

Okay, we are using 57% of the CPU's time... heh :)

BTW, if you don't have hardware to test on, no$gba gives very accurate results... except for one thing, you need a GBA BIOS file to get accurate BIOS function results (division). I'm using no$gba + bios dump to get these results. I can't share my BIOS copy because it's copyrighted, you'll have to get it yourself.

Anyway, first thing we can do to speed things up is use "3,1" ROM cartridge timing. When the GBA boots up, the waitstate controls are set to "4,2" timing, which is a bit slower. Normal cartridges support 3,1 timing. I got this value straight from GBATEK.

REG_WAITCNT = 0x4317; // WS0/ROM=3,1 clks; SRAM=8 clks; WS2/EEPROM: 8,8 clks; prefetch enabled;

Okay, now the CPU load dropped to 37%!

Another thing we can do to drop the load is put CPU intense parts in IWRAM. I suspect a huge boost after this. First of all, ARM code is really really slow when it's ran from ROM. This is because ARM opcodes are 32-bit, so it needs 2 fetches from a 16-bit memory system (like ROM). Second of all, IWRAM has *zero* waitstates, so opcode fetches will be even faster. We'll put the software mixing code in IWRAM (MOD_MixChunk).

.section ".iwram", "ax", %progbits
.arm
.align

/***********************************************************
 * MOD_MixChunk( target, voices )
 *
 * Mix a chunk of audio data and write it to .
 ***********************************************************/
MOD_MixChunk:
	...

Now it says 16%! We're almost there :). This is the end of the easy optimizations though :).

Previous: Many MOD EffectsContentsNext: Optimizing the Sound Mixer