Wednesday 7 December 2016

Part 15: Emulating the loading screen

Foreword

In the previous post we have implemented Tape emulation.

So what are we going to do in this post?

Well, up to now we have been using the KERNEL ROM and BASIC ROM as baseline for evolving our emulator.

For the rest of these posts, I will be using a tape image of the game Dan Dare as the next baseline for evolving our Android emulator.

When the game Dan Dare loads from tape, it shows some visual effects that include flashing borders and a splash screen.

So, as a next step we try and see if we can emulate this flashing borders and splash screen. There will however be a bit of work ahead of us to get to this point.

So, let us start!

Introduction to scanline rendering

The flashing border effect is achieved by the loader by changing the border colors many times while the frame is drawn to the screen in a scan-like fashion.

So, in order to get the flashing border effect in our emulator, we will need to implement a scan like rendering for our emulator. Currently this is not the case. We execute a frame's worth amount of instruction cycles and after that we just render the screen one shot. With this type of rendering you you will always see a solid background.

How do we implement scan-like rendering? With this kind of rendering you don't leave rendering a frame for the last moment when you have executed a whole frame's worth of cycles. Rather, you would do rendering when you have executed a lines worth of cycles. On a C64 a line's worth of cycles is 63 clock cycles.

Interesting enough, in my series on writing a Javascript C64 emulator, we used rendering scheme that was more granular than units of scanline. In that emulator we actually rendered pixels after each 6502 instruction execution! This type of rendering, by the way, is called cycle based rendering.

Cycle based rendering is more complicated than scan line based rendering in many respects.  in fact, I am not sure if an Android mobile device will have enough horsepower to deal with cycle based rendering in real time.

So, with our Android C64 emulator I will stick with scan line based rendering.

Creating a new timer

In effect, we want a a rendering action to kick off each about scan line worth of instruction cycles has passed by.

From the previous section we know that each scan line is worth 63 6502 cycles on the C64.

So, we need to schedule a task that gets executed every 63 cycles.

We already have a mechanism in place for timers that we are currently using for CIA timers and for Tape emulation. So, it should be fairly easy for for us just to add another timer to the equation.

First we need to create a file called video.c. Within this file we define the following:

void video_line_expired(struct timer_struct *tdev) {
  tdev->remainingCycles = 63;
}

struct timer_struct getVideoInstance() {
  struct timer_struct myVideo;
  myVideo.expiredevent = &video_line_expired;
  myVideo.remainingCycles = 63;
  myVideo.started = 1;
  return myVideo;
}


We have defined a function that will a timer_struct instance for video rendering. This timer will expire after 63 clock cycles. When the expire is invoked (e.g. the function video_line_expired), remainingCycles will be reset to 63 clock cycles.

Finally, we need to ensure that our new timer gets add to the list of timers that gets processed after each instruction execution:

void
Java_com_johan_emulator_engine_Emu6502_memoryInit(JNIEnv* pEnv, jobject pObj)
{
  timerA = getTimerInstanceA();
  add_timer_to_list(&timerA);
  timerB = getTimerInstanceB();
  add_timer_to_list(&timerB);
  tape_timer = getTapeInstance();
  add_timer_to_list(&tape_timer);
  video_timer = getVideoInstance();
  add_timer_to_list(&video_timer);
}

Defining the Color Tablet

In my JavaScript C64 Emulator Series, here, I have defined a 16 element array containing the C64 color tablet in RGB values.

We can use this array declaration as is in our Android Emulator, with some minor syntax tweaks:

jchar colors_RGB_888[16][3] = {
{0, 0, 0},
                  {255, 255, 255},
                  {136, 0, 0},
                  {170, 255, 238},
                  {204, 68, 204},
                  {0, 204, 85},
                  {0, 0, 170},
                  {238, 238, 119},
                  {221, 136, 85},
                  {102, 68, 0},
                  {255, 119, 119},
                  {51, 51, 51},
                  {119, 119, 119},
                  {170, 255, 102},
                  {0, 136, 255},
                  {187, 187, 187}
};

A thing to keep in mind, however, is that the bitmap we are writing to is not of format RGB_888, but RGB_565, which is 5 bits for the Red channel, 6 bits for the Green channel and another 5 bits for the Blue channel. Our Bitmap therefore uses 2 bytes per pixel instead of the three bytes per pixel that our palette consists out of.

We could change our 16 color palette declaration by hand to RGB_565. However, this is prone to errors. Instead I am going to do conversion in code at start up and populate a new array:

...
jchar colors_RGB_565[16]; 
...
void initialise_video() {
  int i;
  for (i=0; i < 16; i++) {
    int red = colors_RGB_888[i][0] >> 3;
    int green = colors_RGB_888[i][1] >> 2;
    int blue = colors_RGB_888[i][2] >> 3;
    colors_RGB_565[i] =  (red << 11) | (green << 5) | (blue << 0);
  }
}
...

One should now just ensure that initialse_video() gets called as part of the initialisation process.

Defining Process Flow

Let us define the general process flow when we process a scan line.

This is more or less defined by the following bits of code within video.c:

...
int line_count = 0;
...
extern int frameFinished;
...
void video_line_expired(struct timer_struct *tdev) {
  tdev->remainingCycles = 63;
  processLine();
  line_count++;
  if (line_count > 310) {
    line_count = 0;
    frameFinished = 1;
  }
}

All processing for the current line will be performed within the processLine() function, which we will cover later.

We also keep track of the number of lines we have already processed within the frame. As soon as we reach line 311 we wrap back to zero.

In effect, we also need to keep track of the number of lines so that we know when we are finished with the current frame.

Previously the runBatch method within cpu.c kept an eye on when we were finished with the current frame by looping through exactly 20000 cycles. This will need to change, since we need to sync our runbatch method to the exact moment when video.c is finished rendering  the frame. If we don't do that, we might end off writing frames to the screen containing some residue of the previous frame.

The frameFinished variable will help us with this syncing. As you can see, frameFinished is declared with extern, which means that we have physically declared this variable in another file. In fact, I have declared this variable within cpu.c.

Our modified runBatch method will look as follows:

...
int frameFinished = 0;
...
int runBatch(int address) {
  //remainingCycles = 20000;
  frameFinished = 0;
  int lastResult = 0;
  while ((!frameFinished) && (lastResult == 0)) {
    lastResult = step();
    if (lastResult != 0)
      break;
    if ((address > 0) && (pc == address)) {
      lastResult = -1;
      break;
    }
    processAlarms();
    
  }

  return lastResult;
}


Drawing a scan line

Let us now discuss the process of drawing a scanline. This is performed within the function processLine():

static inline void processLine() {
  if (line_count > 299)
    return;

  updatelineCharPos();
  fillColor(24, memory_read(0xd020) & 0xf);
  int screenEnabled = (memory_read(0xd011) & 0x10) ? 1 : 0;
  if (screenLineRegion && screenEnabled) {
    drawScreenLine();
  } else {
    fillColor(320, memory_read(0xd020) & 0xf);
  }
  fillColor(24, memory_read(0xd020) & 0xf);
}


Firstly, you will see that if the line number is bigger than 299 we exit the processLine function all together. This is because 10 of 312 lines is applicable during vertical blanking, that is they are not displayed at all.

However, we still need to account for these lines to get us closer to true emulation speed.

The updatelineCharPos() ensures that we always have an up to date pointer to the beginning of the current character line within screen character memory with which we are currently busy with.

The process of drawing a line is basically as follows:

  1. Draw left border
  2. Draw a 320 pixel line of main screen
  3. Draw right border
Step 2 can potentially also be drawn entirely drawn with the border color in the following cases:
  • We are currently in the top or bottom border area.
  • The screen is currently disabled, that is bit#4 of location d011 is set to zero.
We use the fillColor method to fill a line segment with the border color. The method looks like this:

inline void fillColor(int count, int colorEntryNumber) {
  int currentPos;
  for (currentPos = 0; currentPos < count; currentPos++) {
    g_buffer[posInBuffer] = colors_RGB_565[colorEntryNumber];
    posInBuffer++;
  }
}

Drawing of the main screen line is performed by drawScreenLine():

static inline void drawScreenLine() {
  int i;
  for (i = 0; i < 40; i++) {
    jchar charcode = memory_read(1024 + i + posInCharMem);
    int bitmapDataRow = charRom[(charcode << 3) | (line_in_visible & 7)];
    int j;
    int foregroundColor = memory_read(0xd800 + i + posInCharMem) & 0xf;
    int backgroundColor = memory_read(0xd021) & 0xf;
    for (j = 0; j < 8; j++) {
      int pixelSet = bitmapDataRow & 0x80;
      if (pixelSet) {
        g_buffer[posInBuffer] = colors_RGB_565[foregroundColor];
      } else {
        g_buffer[posInBuffer] = colors_RGB_565[backgroundColor];
      }
      posInBuffer++;
      bitmapDataRow = bitmapDataRow << 1;
    }
  }
}

We basically have a main loop where we loop through the characters in screen memory for the current screen character row. We fetch the character code for each character and lookup a 8 pixel line of bitmap data from the character ROM for each character.

We then continue and draw each pixel of the 8-pixel line either in the foreground color if it set, otherwise in the background color.

Testing and Debugging

When I took the code changes done in this post for a test drive a couple of bugs surfaced.

Since I am basing this series of blog posts on my previous series on a JavaScript emulator, all this bugs looks kind of familiar :-) We can therefore leverage from past leanings and avoid some painful debugging exercises.

In the process of loading the game from the tape image, I experienced two major bugs. Both these bugs was related to C64 bank switching not being implemented.

So, let us quickly spend some time implementing bank switching. I am going implement bank switching for KENREL ROM, BASIC ROM and the IO area.

As you might have known, bank switching is implemented via the lower three bits of memory location 1. The combinations you need to use for enabling the various banking configurations is not very intuitive, leave alone writing understandable emulation code for these configurations.

To simplify our world a bit, we can create a lookup table accepting the lower three bits of memory location 1 as a parameter and returning a flag byte. This flag byte will then contain a bit for BASIC ROM, KERNEL ROM, IO and CHARROM. Each bit will indicate which of previously mentioned regions are visible.

We add the following code to memory.c:

#define BASIC_VISIBLE 1
#define KERNAL_VISIBLE 2
#define CHAR_ROM_VISIBLE 4
#define IO_VISIBLE 8

int bank_visibility[8] =
{
  0,//000
  CHAR_ROM_VISIBLE,//001
  CHAR_ROM_VISIBLE | KERNAL_VISIBLE,//010
  BASIC_VISIBLE | KERNAL_VISIBLE | CHAR_ROM_VISIBLE,//011
  0,//100
  IO_VISIBLE,//101
  IO_VISIBLE | KERNAL_VISIBLE,//110
  BASIC_VISIBLE | KERNAL_VISIBLE | IO_VISIBLE//111
};
...
inline int kernalROMEnabled() {
  int bankBits = mainMem[1] & 7;
  return (bank_visibility[bankBits] & KERNAL_VISIBLE) ? 1 : 0;
}

inline int basicROMEnabled() {
  int bankBits = mainMem[1] & 7;
  return (bank_visibility[bankBits] & BASIC_VISIBLE) ? 1 : 0;
}

inline int IOEnabled() {
  int bankBits = mainMem[1] & 7;
  return (bank_visibility[bankBits] & IO_VISIBLE) ? 1 : 0;
}


We now have some helper methods that tells us which banks are enabled.

W now change memory_read and memory_write as follows:

...
jchar IOUnclaimed[4096];
...
jchar memory_read(int address) {
  if ((address >=0xa000) && (address < 0xc000) && basicROMEnabled())
    return basicROM[address & 0x1fff];
  else if ((address >=0xe000) && (address < 0x10000) && kernalROMEnabled())
    return kernalROM[address & 0x1fff];
  else if (address == 1)
    return read_port_1();
  else if ((address >=0xd000) && (address < 0xe000) && IOEnabled()) {
    if ((address >=0xdc00) && (address < 0xdc10))
      return cia1_read(address);
    else
      return IOUnclaimed[address & 0xfff];
  }
  else
    return mainMem[address];
}

void memory_write(int address, jchar value) {
  //if (((address >= 0xa000) && (address < 0xc000)) |
  //     ((address >= 0xe000) && (address < 0x10000)))
  //  return;

  if (address == 1)
    write_port_1(value);
  else if ((address >=0xd000) && (address < 0xe000) && IOEnabled()) {
    if((address >=0xdc00) & (address < 0xdc10))
      cia1_write(address, value);
    else
      IOUnclaimed[address & 0xfff] = value;
  }
  else
    mainMem[address] = value;
}

We have defined the array IOUnclaimed for reads and writes when the IO region is enabled.

With all these change applied, let us look at some screenshots of the emulator loading the game:






The last screen appears a bit garbled because we haven't implemented the full VIC-II memory model yet. We will tackle this in the next post.


Performance considerations

With all this scan line Rendering code that we wrote, I was curious to know what performance penalty this functionality caused. This is quite a general concern when developing software for a mobile device because of resource limitations.

I got hold of the code we developed in the post Part 12: Emulating the keyboard and use it to do some baseline benchmarks.

In this post we still did frame rendering after we have executed a frame worth of CPU cycles. So I measured time separately for running runBatch() and running populateFrame.

On average the measured revealed between 1 and 2 milliseconds for running runBatch, and less than a millisecond for running populateFrame.

With baseline in mind, A ran benchmark for the code developed in this post. The rendering functionality developed in this post is woven in between executing cpu code, so it is not feasible to get two separate timings for cpu execution and rendering per frame. So, for this post I only retrieved a single average timing.

This timing value was a bit of a disappointment. The average time per frame, which include cpu execution and rendering, was between 6 and 7 milliseconds.

I thought of ways to reduce this time, and all that I could think of for the moment was to batch together some of memory read requests requested by the rendering code together in a big batch.

This batching happens within the drawScreenline method of video.c:
static inline void drawScreenLine() {
  int i;
  int batchCharMem[40];
  int batchColorMem[40];
  int backgroundColor = memory_unclaimed_io_read(0xd021) & 0xf;
  memory_read_batch(batchCharMem, 1024 + posInCharMem, 40);
  memory_read_batch_io_unclaimed(batchColorMem, 0xd800 + posInCharMem, 40);
  for (i = 0; i < 40; i++) {
    jchar charcode = batchCharMem[i];//memory_read(1024 + i + posInCharMem);
    int bitmapDataRow = charRom[(charcode << 3) | (line_in_visible & 7)];
    int j;
    int foregroundColor = batchColorMem[i] & 0xf;//memory_read(0xd800 + i + posInCharMem) & 0xf;

    for (j = 0; j < 8; j++) {
      int pixelSet = bitmapDataRow & 0x80;
      if (pixelSet) {
        g_buffer[posInBuffer] = colors_RGB_565[foregroundColor];
      } else {
        g_buffer[posInBuffer] = colors_RGB_565[backgroundColor];
      }
      posInBuffer++;
      bitmapDataRow = bitmapDataRow << 1;
    }
  }
}

The implementation of the two batch methods, looks as follows:

void memory_read_batch(int *batch, int address, int count) {
  int i;
  for (i = 0; i < count; i++) {
    batch[i] = mainMem[address + i];
  }
}

void memory_read_batch_io_unclaimed(int *batch, int address, int count) {
  int i;
  address = address & 0xfff;
  for (i = 0; i < count; i++) {
    batch[i] = IOUnclaimed[address + i];
  }
}

Implementing above mentioned code shaved off about a millisecond off the total time per frame.

In Summary

In this post we have implemented color graphics and scan line based rendering.

We have also implemented bank switching of the C64.

In the next post, we will see how far we can get with implementing the other graphic modes of the C64 in order to get the game Dan Dare in a playable state in our emulator.

Till next time!


No comments:

Post a Comment