HACKING ON THE GNUBOY SOURCE TREE


  BASIC INFO

In preparation for the first release, I'm putting together a simple
document to aid anyone interested in playing around with or improving
the gnuboy source. First of all, before working on anything, you
should know my policies as maintainer. I'm happy to accept contributed
code, but there are a few guidelines:

* Obviously, all code must be able to be distributed under the GNU
GPL. This means that your terms of use for the code must be equivalent
to or weaker than those of the GPL. Public domain and MIT-style
licenses are perfectly fine for new code that doesn't incorporate
existing parts of gnuboy, e.g. libraries, but anything derived from or
built upon the GPL'd code can only be distributed under GPL. When in
doubt, read COPYING.

* Please stick to a coding and naming convention similar to the
existing code. I can reformat contributions if I need to when
integrating them, but it makes it much easier if that's already done
by the coder. In particular, indentions are a single tab (char 9), and
all symbols are all lowercase, except for macros which are all
uppercase.

* All code must be completely deterministic and consistent across all
platforms. this results in the two following rules...

* No floating point code whatsoever. Use fixed point or better yet
exact analytical integer methods as opposed to any approximation.

* No threads. Emulation with threads is a poor approximation if done
sloppily, and it's slow anyway even if done right since things must be
kept synchronous. Also, threads are not portable. Just say no to
threads.

* All non-portable code belongs in the sys/ or asm/ trees. #ifdef
should be avoided except for general conditionally-compiled code, as
opposed to little special cases for one particular cpu or operating
system. (i.e. #ifdef USE_ASM is ok, #ifdef __i386__ is NOT!)

* That goes for *nix code too. gnuboy is written in ANSI C, and I'm
not going to go adding K&R function declarations or #ifdef's to make
sure the standard library is functional. If your system is THAT
broken, fix the system, don't "fix" the emulator.

* Please no feature-creep. If something can be done through an
external utility or front-end, or through clever use of the rc
subsystem, don't add extra code to the main program.

* On that note, the modules in the sys/ tree serve the singular
purpose of implementing calls necessary to get input and display
graphics (and eventually sound). Unlike in poorly-designed emulators,
they are not there to give every different target platform its own gui
and different set of key bindings.

* Furthermore, the main loop is not in the platform-specific code, and
it will never be. Windows people, put your code that would normally go
in a message loop in ev_refresh and/or sys_sleep!

* Commented code is welcome but not required.

* I prefer asm in AT&T syntax (the style used by *nix assemblers and
likewise DJGPP) as opposed to Intel/NASM/etc style. If you really must
use a different style, I can convert it, but I don't want to add extra
dependencies on nonstandard assemblers to the build process. Also,
portable C versions of all code should be available.

* Have fun with it. If my demands stifle your creativity, feel free to
fork your own projects. I can always adapt and merge code later if
your rogue ideas are good enough. :)

OK, enough of that. Now for the fun part...


  THE SOURCE TREE STRUCTURE

[documentation]
README - general information related to using gnuboy
INSTALL - compiling and installation instructions
HACKING - this file, obviously
COPYING - the gnu gpl, grants freedom under condition of preseving it

[build files]
Version - doubles as a C and makefile include, identifies version number
Rules - generic build rules to be included by makefiles
Makefile.* - system-specific makefiles
configure* - script for generating *nix makefiles

[non-portable code]
sys/*/* - hardware and software platform-specific code
asm/*/* - optimized asm versions of some code, not used yet
asm/*/asm.h - header specifying which functions are replaced by asm
asm/i386/asmnames.h - #defines to fix _ prefix brain damage on DOS/Windows

[main emulator stuff]
main.c - entry point, event handler...basically a mess
loader.c - handles file io for rom and ram
emu.c - another mess, basically the frame loop that calls state.c
debug.c - currently just cpu trace, eventually interactive debugging
hw.c - interrupt generation, gamepad state, dma, etc.
mem.c - memory mapper, read and write operations
fastmem.h - short static functions that will inline for fast memory io
regs.h - macros for accessing hardware registers
save.c - savestate handling

[cpu subsystem]
cpu.c - main cpu emulation
cpuregs.h - macros for cpu registers and flags
cpucore.h - data tables for cpu emulation
asm/i386/cpu.s - entire cpu core, rewritten in asm

[graphics subsystem]
fb.h - abstract framebuffer definition, extern from platform-specifics
lcd.c - main control of refresh procedure
lcd.h - vram, palette, and internal structures for refresh
asm/i386/lcd.s - asm versions of a few critical functions
lcdc.c - lcdc phase transitioning

[input subsystem]
input.h - internal keycode definitions, etc.
keytables.c - translations between key names and internal keycodes
events.c - event queue

[resource/config subsystem]
rc.h - structure defs
rccmds.c - command parser/processor
rcvars.c - variable exports and command to set rcvars
rckeys.c - keybindingds

[misc code]
path.c - path searching
split.c - general purpose code to split strings into argv-style arrays


  OVERVIEW OF PROGRAM FLOW

The initial entry point main() main.c, which will process the command
line, call the system/video initialization routines, load the
rom/sram, and pass control to the main loop in emu.c. Note that the
system-specific main() hook has been removed since it is not needed.

There have been significant changes to gnuboy's main loop since the
original 0.8.0 release. The former state.c is no more, and the new
code that takes its place, in lcdc.c, is now called from the cpu loop,
which although slightly unfortunate for performance reasons, is
necessary to handle some strange special cases.

Still, unlike some emulators, gnuboy's main loop is not the cpu
emulation loop. Instead, a main loop in emu.c which handles video
refresh, polling events, sleeping between frames, etc. calls
cpu_emulate passing it an idea number of cycles to run. The actual
number of cycles for which the cpu runs will vary slightly depending
on the length of the final instruction processed, but it should never
be more than 8 or 9 beyond the ideal cycle count passed, and the
actual number will be returned to the calling function in case it
needs this information. The cpu code now takes care of all timer and
lcdc events in its main loop, so the caller no longer needs to be
aware of such things.

Note that all cycle counts are measured in CGB double speed MACHINE
cycles (2**21 Hz), NOT hardware clock cycles (2**23 Hz). This is
necessary because the cpu speed can be switched between single and
double speed during a single call to cpu_emulate.  When running in
single speed or DMG mode, all instruction lengths are doubled.

As for the LCDC state, things are much simpler now. No more huge
glorious state table, no more P/Q/R, just a couple simple functions.
Aside from the number of cycles left before the next state change, all
the state information fits nicely in the locations the Game Boy itself
provides for it -- the LCDC, STAT, and LY registers.

If the special cases for the last line of VBLANK look strange to you,
good. There's some weird stuff going on here. According to documents
I've found, LY changes from 153 to 0 early in the last line, then
remains at 0 until the end of the first visible scanline. I don't
recall finding any roms that rely on this behavior, but I implemented
it anyway.

That covers the basics. As for flow of execution, here's a simplified
call tree that covers most of the significant function calls taking
place in normal operation:

  main                                                  sys/
   \_ real_main                                         main.c
       |_ sys_init                                      sys/
       |_ vid_init                                      sys/
       |_ loader_init                                   loader.c
       |_ emu_reset                                     emu.c
       \_ emu_run                                       emu.c
           |_ cpu_emulate                               cpu.c
           |   |_ div_advance                           cpu.c *
           |   |_ timer_advance                         cpu.c *
           |   |_ lcdc_advance                          cpu.c *
           |   |   \_ lcdc_trans                        lcdc.c
           |   |       |_ lcd_refreshline               lcd.c
           |   |       |_ stat_change                   lcdc.c
           |   |       |   \_ lcd_begin                 lcd.c
           |   |       \_ stat_trigger                  lcdc.c
           |   \_ sound_advance                         cpu.c *
           |_ vid_end                                   sys/
           |_ sys_elapsed                               sys/
           |_ sys_sleep                                 sys/
           |_ vid_begin                                 sys/
           \_ doevents                                  main.c

  (* included in cpu.c so they can inline; also in cpu.s)


  MEMORY READ/WRITE MAP

Whenever possible, gnuboy avoids emulating memory reads and writes
with a function call. To this end, two pointer tables are kept -- one
for reading, the other for writing. They are indexed by bits 12-15 of
the address in Game Boy memory space, and yield a base pointer from
which the whole address can be used as an offset to access Game Boy
memory with no function calls whatsoever. For regions that cannot be
accessed without function calls, the pointer in the table is NULL.

For example, reading from address addr can be accomplished by testing
to make sure mbc.rmap[addr>>12] is not NULL, then simply reading
mbc.rmap[addr>>12][addr].

And for the disbelievers in this optimization, here are some numbers
to compare. First, FFL2 with memory tables disabled:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  us/call  us/call  name
 28.69      0.57     0.57                             refresh_2
 13.17      0.84     0.26  4307863     0.06     0.06  mem_read
 11.63      1.07     0.23                             cpu_emulate

Now, with memory tables enabled:

 38.86      0.66     0.66                             refresh_2
  8.42      0.80     0.14   156380     0.91     0.91  spr_enum
  6.76      0.91     0.11   483134     0.24     1.31  lcdc_trans
  6.16      1.02     0.10                             cpu_emulate
     .
     .
     .
  0.59      1.61     0.01   216497     0.05     0.05  mem_read

As you can see, not only does mem_read take up (proportionally) 1/20
as much time, since it is rarely called, but the main cpu loop in
cpu_emulate also runs considerably faster with all the function call
overhead and cache misses avoided.

These tests were performed on K6-2/450 with the assembly cores
enabled; your milage may vary. Regardless, however, I think it's clear
that using the address mapping tables is quite a worthwhile
optimization.


  LCD RENDERING CORE DESIGN

The LCD core presently used in gnuboy is very much a high-level one,
performing the task of rasterizing scanlines as many independent steps
rather than one big loop, as is often seen in other emulators and the
original gnuboy LCD core. In some ways, this is a bit of a tradeoff --
there's a good deal of overhead in rebuilding the tile pattern cache
for roms that change their tile patterns frequently, such as full
motion video demos. Even still, I consider the method we're presently
using far superior to generating the output display directly from the
gameboy tiledata -- in the vast majority of roms, tiles are changed so
infrequently that the overhead is irrelevant. Even if the tiles are
changed rapidly, the only chance for overhead beyond what would be
present in a monolithic rendering loop lies in (host cpu) cache misses
and the possibility that we might (tile pattern) cache a tile that has
changed but that will never actually be used, or that will only be
used in one orientation (horizontally and vertically flipped versions
of all tiles are cached as well). Such tile caching issues could be
addressed in the long term if they cause a problem, but I don't see it
hurting performance too significantly at the present. As for host cpu
cache miss issues, I find that putting multiple data decoding and
rendering steps together in a single loop harms performance much more
significantly than building a 256k (pattern) cache table, on account
of interfering with branch prediction, register allocation, and so on.

Well, with those justifications given, let's proceed to the steps
involved in rendering a scanline:

updatepatpix() - updates tile pattern cache.

tilebuf() - reads gb tile memory according to its complicated tile
addressing system which can be changed via the LCDC register, and
outputs nice linear arrays of the actual tile indices used in the
background and window on the present line.

Before continuing, let me explain the output format used by the
following functions. There is a byte array scan.buf, accessible by
macro as BUF, which is the output buffer for the line. The structure
of this array is simple: it is composed of 6 bpp gameboy color
numbers, where the bits 0-1 are the color number from the tile, bits
2-4 are the (cgb or dmg) palette index, and bit 5 is 0 for background
or window, 1 for sprite.

What is the justification for using a strange format like this, rather
than raw host color numbers for output? Well, believe it or not, it
improves performance. It's already necessary to have the gameboy color
numbers available for use in sprite priority. And, when running in
mono gb mode, building this output data is VERY fast -- it's just a
matter of doing 64 bit copies from the tile pattern cache to the
output buffer.

Furthermore, using a unified output format like this eliminates the
need to have separate rendering functions for each host color depth or
mode. We just call a one-line function to apply a palette to the
output buffer as we copy it to the video display, and we're done. And,
if you're not convinced about performance, just do some profiling.
You'll see that the vast majority of the graphics time is spent in the
one-line copy function (render_[124] depending on bytes per pixel),
even when using the fast asm versions of those routines. That is to
say, any overhead in the following functions is for all intents and
purposes irrelevant to performance. With that said, here they are:

bg_scan() - expands the background layer to the output buffer.

wnd_scan() - expands the window layer.

spr_scan() - expands the sprites. Note that this requires spr_enum()
to have been called already to build a list of which sprites are
visible on the current scanline and sort them by priority.

It should be noted that the background and window functions also have
color counterparts, which are considerably slower due to merging of
palette data. At this point, they're staying down around 8% time
according to the profiler, so I don't see a major need to rewrite them
anytime soon. It should be considered, however, that a different
intermediate format could be used for gbc, or that asm versions of
these two routines could be written, in the long term.

Finally, some notes on palettes. You may be wondering why the 6 bpp
intermediate output can't be used directly on 256-color display
targets. After all, that would give a huge performance boost. The
problem, however, is that the gameboy palette can change midscreen,
whereas none of the presently targetted host systems can handle such a
thing, much less do it portably. For color roms, using our own
internal color mappings in addition to the host system palette is
essential. For details on how this is accomplished, read palette.c.

Now, in the long term, it MAY be possible to use the 6 bpp color
"almost" directly for mono roms. Note that I say almost. The idea is
this. Using the color number as an index into a table is slow. It
takes an extra read and causes various pipeline stalls depending on
the host cpu architecture. But, since there are relatively few
possible mono palettes, it may actually be possible to set up the host
palette in a clever way so as to cover all the possibilities, then use
some fancy arithmetic or bit-twiddling to convert without a lookup
table -- and this could presumably be done 4 pixels at a time with
32bit operations. This area remains to be explored, but if it works,
it might end up being the last hurdle to getting realtime emulation
working on very low-end systems like i486.


  SOUND

Rather than processing sound after every few instructions (and thus
killing the cache coherency), we update sound in big chunks. Yet this
in no way affects precise sound timing, because sound_mix is always
called before reading or writing a sound register, and at the end of
each frame.

The main sound module interfaces with the system-specific code through
one structure, pcm, and a few functions: pcm_init, pcm_close, and
pcm_submit. While the first two should be obvious, pcm_submit needs
some explaining. Whenever realtime sound output is operational,
pcm_submit is responsible for timing, and should not return until it
has successfully processed all the data in its input buffer (pcm.buf).
On *nix sound devices, this typically means just waiting for the write
syscall to return, but on systems such as DOS where low level IO must
be handled in the program, pcm_submit needs to delay until the current
position in the DMA buffer has advanced sufficiently to make space for
the new samples, then copy them.

For special sound output implementations like write-to-file or the
dummy sound device, pcm_submit should write the data immediately and
return 0, indicating to the caller that other methods must be used for
timing. On real sound devices that are presently functional,
pcm_submit should return 1, regardless of whether it buffered or
actually wrote the sound data.

And yes, for unices without OSS, we hope to add piped audio output
soon. Perhaps Sun audio device and a few others as well.


  OPTIMIZED ASSEMBLY CODE

A lot can be said on this matter. Nothing has been said yet.


  INTERACTIVE DEBUGGER

Apologies, there is no interactive debugger in gnuboy at present. I'm
still working out the design for it. In the long run, it should be
integrated with the rc subsystem, kinda like a cross between gdb and
Quake's ever-famous console. Whether it will require a terminal device
or support the graphical display remains to be determined.

In the mean time, you can use the debug trace code already
implemented. Just "set trace 1" from your gnuboy.rc or the command
line. Read debug.c for info on how to interpret the output, which is
condensed as much as possible and not quite self-explanatory.


  PORTING

On all systems on which it is available, the gnu compiler should
probably be used. Writing code specific to non-free compilers makes it
impossible for free software users to actively contribute. On the
other hand, compiler-specific code should always be kept to a minimum,
to make porting to or from non-gnu compilers easier.

Porting to new cpu architectures should not be necessary. Just make
sure you unset IS_LITTLE_ENDIAN in the makefiles to enable the big
endian default if the target system is big endian. If you do have
problems building on certain cpus, however, let us know. Eventually,
we will also want asm cpu and graphics code for popular host cpus, but
this can wait, since the c code should be sufficiently fast on most
platforms.

The bulk of porting efforts will probably be spent on adding support
for new operating systems, and on systems with multiple video (or
sound, once that's implemented) architectures, new interfaces for
those. In general, the operating system interface code goes in a
directory under sys/ named for the os (e.g. sys/nix/ for *nix
systems), and display interfaces likewise go in their respective
directories under sys/ (e.g. sys/x11/ for the x window system
interface).

For guidelines in writing new system and display interface modules, i
recommend reading the files in the sys/dos/, sys/svga/, and sys/nix/
directories. These are some of the simpler versions (aside from the
tricky dos keyboard handling), as opposed to all the mess needed for
x11 support.

Also, please be aware that the existing system and display interface
modules are somewhat primitive; they are designed to be as quick and
sloppy as possible while still functioning properly. Eventually they
will be greatly improved.

Finally, remember your obligations under the GNU GPL. If you produce
any binaries that are compiled strictly from the source you received,
and you intend to release those, you *must* also release the exact
sources you used to produce those binaries. This is not pseudo-free
software like Snes9x where binaries usually appear before the latest
source, and where the source only compiles on one or two platforms;
this is true free software, and the source to all binaries always
needs to be available at the same time or sooner than the
corresponding binaries, if binaries are to be released at all. This of
course applies to all releases, not just new ports, but from
experience i find that ports people usually need the most reminding.


  EPILOGUE

That's it for now. More info will eventually follow. Happy hacking!