Friday, November 10, 2017

Robo-Ninja Climb

Well, out of the blue, I decided that I was missing out by not entering something into this year's NesDev competition, so a couple weeks ago I decided to see how quickly I could churn out a simple-but-fun game.

My idea is an endless-climber type of thing, based on the Robo-Ninja character. You just bounce back and forth scaling walls while dodging obstacles.

It's been fun, trying to churn this out as fast as possible.  To be as efficient as I can, I reused a number of routines from my primary nes project, and then have been coding everything in a mix of C and assembly, switching back and forth at whim, using whichever seems easier for any given part of the game.

I just finished my 7th day of development, and it's almost a playable game -- I just need to make collisions with the obstacles actually work, and tweak the controls a bit, and it will be time for some preliminary testing: the standard "is it actually fun" sort of thing. We'll see.

There's a lot more I'd love to say about some of the technical challenges of writing this game, but with trying to finish it as quickly as possible, I end up not wanting to waste time writing about it.

Like last year, I don't have a lot of intentions of winning the competition, but I couldn't pass up a chance to at least enter something.

Tuesday, October 17, 2017

Metatile Designer

For the NES game, I'm building my levels out of 32x32 pixel metatiles.  What is a metatile?  The NES background tiles are 8x8 pixels, but that's too small for a lot of purposes.  Loads of games were built out of bigger blocks (16x16 is very common, which is what Mario or Zelda used, making the standard "block" size you see in many NES games).

The smaller your blocks, or metatiles, the more ROM space a level takes up, but the more flexibility you have in laying out your levels.  For this game, I'm planning to use 32x32 blocks.  (which I've built the whole engine around, so if that changes for some reason, I'm in for a lot of work)

Here's a picture of Super Mario Bros.  By this time your eyes are glazing over
and you don't even remember why I was talking about Super Mario Bros anyway.
Warning: the next paragraph is even worse.

The reason I chose 32x32 is because of how the NES handles palettes. In memory, there's a byte for each 8x8 pixel space on the screen, to indicate which tile should be drawn there. But each tile can have one of 4 palettes, and that byte doesn't include any palette information. There's another block of video memory where you assign palettes to tiles.  BUT, there's a few limitations.  You can only assign one palette for every 16x16 pixel block (group of 4 tiles), which is one reason it's so common to have blocks that size in games.  AND 4 of those groups (a 32x32 pixel block) shares a byte in memory (2 bits for each 16x16 block).  So in my case, each 32x32 metatile can map to a byte of palette memory, which simplifies writing palettes. 

So between the ROM savings for large levels (64 bytes to represent a single screen) and the ease of writing palettes (also called attribute tables), I picked 32x32 metatiles.

Now the problem is that I need a better tool for designing my metatiles.  There are a few tools out there (NES Screen tool by Shiru, Sumez's editor, etc).  But for one reason or another, none of them quite did want I wanted.  So it was time to start tool building, which is a sometimes-miserable, sometimes-nice distraction in the middle of a project.

In the past I've tried a few different technologies for quick-and-dirty gui tools.  It used to be Borland Delphi.  Then C#.  Then the web with either raw javascript, or Angular.  This time I wanted something cross-platform but I hate the web, so I decided to try javafx, which I've heard is decent.

I considered forcing myself to use Kotlin or Clojure for the sake of learning another language, but once again, the pragmatism of getting something done quickly set in, so I cranked it out in java.  Java 8's lambdas makes gui programming a little less unpleasant, so it didn't take long to have my metatile designer working but ugly.

Now it's time to actually try to put together some decent-looking metatiles and start building a cool demo level.

Monday, October 16, 2017

Quad-Games and PRGE

Well, I have 3 games (of hopefully 4) of my "Quad-Games" collection done: Quad-Joust, Quad-Tank, and Bomb-Defense.  Thanks to some friends that came over for an evening of testing (thanks guys!) we found a bunch of bugs and potential new features.

I've spent the past couple weeks hammering out bugs, and re-testing repeatedly, mostly with my family.  Now PRGE is almost here, so I'll have to call it done for now.  I'm pretty excited to have it out to demo, and see what people think when they play it.

Bomb Defense is the preferred game for my wife and daughter, who prefer co-op play.

A few other updates:

  • Some bugs have been reported in Atari Anguna, so I've been working feverishly to fix and test them before PRGE, so new carts sold there can include the bug fix
  • I've also had a request for a PAL-compatible version of Anguna, so I've been working on a PAL-60 version that will play in whatever places used PAL TVs.
  • I've finally gotten back to work on the NES game.  The scrolling engine seems to finally be working, so I was pretty quickly able to add a main character that can jump around and collide semi-properly with the environment.  More on that soon!

Wednesday, September 20, 2017

Atari QuadGames

Well, the first prototype adapter is finished, Quad-Joust is undergoing testing (ie making my kids come down to the basement and play it with me), and I'm almost done with the 2nd game to be included on the QuadGames multicart: Quad-Combat (or Quad-Tank).  Basically, just a 4-player variant of the original tank/combat game.

Using the same basic code layout as Joust was a great starting point, and made it easy to the game 99% of the way there.  The problem, though, was time.  Most of the time during gameplay, it worked great. But under a few worst-case conditions, I ran out of processing time, and the screen would roll.   It didn't help that there were really 8 objects that had to be dealt with every frame -- 4 tanks and 4 bullets.  (as opposed to just the 4 birds from joust)  Reading over the code a few times, there wasn't anything grossly inefficient.

Which meant it was time for optimizing.

Because I wrote this all in batari basic to save programmer time, I wasn't sure how efficient the compiled assembly was.  Step one was to just read through the assembly that bB generates. Which turned out to be not too bad.  Very little that was egregious.  But a lot of places where I could do a tiny bit better by hand.

So I tonight I ended up rewriting about half the game into assembly. Having the compiled basic code to work off of made it pretty fast, but it was easy to find places where it was being dumb. For example, I had code like:

temp1 = player0x + 4
temp1 = temp1 / 4

which naively got transformed into:

lda player0x     ;3
clc              ;2
adc #4           ;2
sta temp1        ;3
lda temp1        ;3
lsr              ;2
lsr              ;2
sta temp1        ;3
                 ;total: 20 clock cycles

Not terrible, but a bit wasteful. Knowing the state of the carry flag going in, I was able to rewrite it

lda player0x     ;3

adc #4           ;2
lsr              ;2
lsr              ;2
sta temp1        ;3
                 ;total: 12 clock cycles

8 clock cycles isn't much, but doing this sort of thing in a handful of places throughout the game ended up making it work!

Now it's time to test some more....

Don't look too closely, turns out I'm REALLY bad at making holes in plastic boxes.

Thursday, September 7, 2017

4-player adapter progress

The past few weeks have been a fun balance between working on my NES game (still slogging through getting my scrolling engine right!) and the Atari 4-player adapter.

I'm too tired tonight to go into too much detail, but the adapter is moving along great!  The way it works is this:

  • The Atari allows setting 4 of each joystick port's pins to either input or output mode. Normally you want them on input mode, because they are the pins that read the 4 directions of the joystick. But they planned well for future-proofing the console, and set it up so that you can control them as outputs as well.
  • So to allow 4 controllers, I use a really simple multiplexing scheme -- the first joystick port is used in normal input mode for reading all 4 controllers, and the second joystick port is used in output mode to select which controller to read at any given time.
So with just a few multiplexer ICs and some pull-up resistors, we have a working adapter!

Well, mostly-working. So far, the left, right, and up directions work. I still need to finish wiring up the down and fire buttons!

Right now my desk is just a giant mess of wires waiting to be soldered.

I was a bit nervous at first that the Atari wouldn't let me write to the output pins fast enough to switch to each joystick during a frame, or that the input might be latched once in a frame and not updated as I switched it.  But neither of those were the case, and so far, it seems to be working perfectly!

Next step: finish wiring the last pins, and put this thing into some sort of sturdy box. Then hopefully get it all cleaned up to possibly show off at PRGE!

Friday, August 18, 2017

Batari Basic and Fixed Point

When I'm too tired/frustrated to work on my NES game, I've been playing around with ideas for another hobby project that I've head for awhile: a 4-player atari adapter.

I'll get to the details of the hardware of the adapter at some other time, but right now I want to talk about a thing called Batari Basic.

Batari Basic was created by a talented guy (Fred "batari" Quimby) as an easy way to make games for the Atari 2600. It's a really-slimmed-down version of BASIC with a bunch of special keywords and such specifically for Atari.   The idea is that you write simple game logic in a very limited set of basic commands, which can easily be compiled to assembly language.  Then, for the display kernel, which is where you normally really have to work to optimize things and get assembly timings perfect, you just select and use one of a handful of pre-written kernels.

The end result is super-easy-to-write games that often have a very similar look (due to using the same kernel).

Blocky playfield with blank lines between chunks.
Nice-but-generic-looking scoreboard at the bottom.
Two sprites and not much else on the screen?
Looks like a standard Batari Basic game to me

The ease of using Batari Basic to make an Atari game has led to quite a few really interesting or cool games being made, but also boatloads of bland shovelware.  I wasn't remotely interested in using it for Anguna, as it removed most of what makes Atari programming interesting to me -- a challenge of working on the metal, understanding what the system can do, and trying to push the system a little (which admittedly, I didn't do much of)

BUT, for my four player adapter project, I'm going to need a sample game or two to sell with the thing.  But I really don't feel like taking the hours to fine tune a display kernel for something like this.  I never expected this, but I've become a Batari Basic programmer.

It helps that there's now a kernel that uses an on-the-cartridge-arm-co-processor (named DPC+, modeled after David Crane's co-processor that he invented for Pitfall 2), that allows multiple sprites, and selectively flickers them if 3 are overlapping.  (Similarly to what I did with Anguna, but that took me AGES to get right. Here, it's built into the kernel).

Batari Basic's DPC+ kernel makes it REALLY EASY to make a multi-sprite game like this.
So based on that, I'm going to use Batari Basic to churn out at least one four player game to go with this adapter.  I'm starting with something that's a modified version of Joust.  But for Joust to work, you need to use fixed point math, so that you can have smooth levels of acceleration and velocity.

Batari Basic supports fixed point math.  But only sort-of.  You can add and subtract fixed point numbers (with certain limitations).  BUT you can't really do comparisons with them.  It just doesn't work right.  Simple stuff like "if foo > 0" doesn't really work right.

Which means I'm doing a lot of hacky workarounds, dropping to assembly and such, to manage my fixed point math.

That said, it's still WAY faster to make a game this way.  Just a few hours of work, and I have a mostly-working joust game.  With four players.  Although I still haven't gotten the hardware in the mail to make the actual adapter, so I ended up hacking the Stella emulator to add emulated support for my controller.  But that's a different post for a different day.

Monday, July 24, 2017

Optimization and C

Starting this next game in C, I've known there will be places where C isn't fast enough, and I'd have to drop to assembly to optimize things.  Particularly because the 6502 processor really isn't a suitable target for C.  There's just too large a gap between how C structures things and how the 6502 needs things to be done.

Well, this week I decided to see how long my background updating routine takes.  The easiest way to see how long a long-running routine takes on the NES is to play with the color/bw register.  You can set a frame to render in B&W, and when the routine finishes, change the register to color.  The screen will switch partway through the frame, and you can see by where the color starts how much of your frame's computation time you've used.

In this example picture, you can see that the routine being checked runs from the beginning of rendering, to somewhere around 10% of the screen height.  So it's taking up somewhere in the general order of 10% of the total processing time available.

Well, my background updating routine (which renders a new slice of background just offscreen to prepare for scrolling) was taking somewhere around 40% of my total time.  It was horrible.  After playing around, it turns out that this loop was the culprit:

    while (temp > 0) {  //temp is just a counter of how many times to do this
        cj = *tempPtr2;  //get the current metatile into cj
        slicePtr = (u8*) metatile_ptr[ci]; //figure out which slice array to use
        ci += 4;
        if (ci == 16) {    //if we've finished a tile, go to the next one down
            ci = 0;
            tempPtr2 += yIncrement;

It's not important to get into the details of exactly what this loop is doing, but a few things ended up being problematic:

1.  I look up slicePtr each time through the loop, although if you pay attention, it turns out there are only 4 different values it can be.  Pulling those out of the loop gained me about 5%.  

2. More importantly, and this is where C starts to fall apart, getting a value by index into an array can be super-slow if the array isn't contant. (ie if it's a non-constant pointer pointing at an array of data).  This is because the 6502 only allows indirect indexed addressing from zero page.  And what exactly does that nonsense mean?  The 6502 has a single special page of memory, the "zero page" that's, well, special.  To do a pointer-based lookup, you first have to copy the pointer to the zero page, then do an index from that.  So for a single slicePtr[cj] lookup, it's something like:
lda slicePtr
sta tempPtr
lda slicePtr+1
sta tempPtr+1
ldy cj       
lda (slicePtr),y

That's 21 clock cycles, if I haven't forgotten all my instruction timings in the months since making an Atari game.

So to improve this, I allocated space on the zero page for 4 pointers, and not only pulled them out of the loop (like I was talking about in step 1), but dropped to inline assembly, and saved them on the zero page once, so I wouldn't have to jump through those hoops every time.   This ended up being a huge savings in time.

3. By this point, I figured I had optimized it that far, I might as well go further, and unroll the loop a little, and do most of the computation in inline assembly:

    tempPtrA = (u8*)metatile_ptr[ci];
    ci += 4;
    tempPtrB = (u8*)metatile_ptr[ci];
    ci += 4;
    tempPtrC = (u8*)metatile_ptr[ci];
    ci += 4;
    tempPtrD = (u8*)metatile_ptr[ci];

    while (temp >= 4) {  //temp is just a counter of how many times to do this
        cj = *tempPtr2;  //get the current metatile into cj
        __asm__("ldx %v", vram_buffer_current);
        __asm__("ldy %v", cj);
        __asm__("lda (%v),y", tempPtrA);
        __asm__("sta %v,x", vram_buffer);

        __asm__("ldy %v", cj);
        __asm__("lda (%v),y", tempPtrB);
        __asm__("sta %v,x", vram_buffer);

        __asm__("ldy %v", cj);
        __asm__("lda (%v),y", tempPtrC);
        __asm__("sta %v,x", vram_buffer);

        __asm__("ldy %v", cj);
        __asm__("lda (%v),y", tempPtrD);
        __asm__("sta %v,x", vram_buffer);

        __asm__("stx %v", vram_buffer_current);

        tempPtr2 += yIncrement;
        temp = temp - 4;

It's certainly a bit uglier, but it went from the whole routine being somewhere around 40% of my frame time, to about 10%.   I'll take that optimization.

Edit: And if you're wondering why my variable names are so awful, that's once again an artifact of the 6502 way of doing things.  C's method of allocating local variables on the stack isn't a particularly good fit for the 6502, so it's much faster to allocate a handful of generic global variables on the zero page that can be reused all over the place.  So most of these oddly named variables are common globals that are used for all sorts of things. (ie ci and cj are common index variable that I use in all sorts of loops)

Robo-Ninja Climb

Well, out of the blue, I decided that I was missing out by not entering something into this year's NesDev competition , so a couple week...