I mentioned in an earlier post that I would post about getting timing routines into my PLASMA test code as well as the C code. Needless to say, it took me way more banging my head and a late night to get it in and working. But, that was all me. I failed to RTFM and then tried to figure out why it wasn’t working

But, let’s back up a bit to give credit where credit is due. I stole/borrowed/adapted the clock routines I’m using in my code (both for PLASMA and C). I finally found a post on comp.sys.apple2.programer from Bill Buckels that had the code in raw opcode format which was then memcpy()’d into the right location in memory (in this case $0260) and then accessed via inline assembly via a JSR call to the right spot.  Brilliant!

But, with my lack of understanding of how PLASMA is laid out, I figured I had better do something a little more portable.  I tried several things trying to convert it to inline assembly on my own, I tried taking the assembly spit out from the monitor and converting the raw memory locations to logical offsets which involved using Virtual ][, printing the ML from the monitor to the virtual printer, saving as PDF and copy/pasting from there.  Which was a nightmare as the output in the PDF is not sequenced how I would have expected:

Screen Shot 2016-04-11 at 11.27.14 AM

Thanks to a tip from David Schmidt, I took a look at the code in ADTPro for the clock routines.  That’s all in assembly with logical offsets!  Woohoo!  I converted it into the assembly style that PLASMA wants and gave it a shot.  No go.  Time to figure out why.

At this point, I wish I has taken some screenshots of what I was doing as it would be nice to have.  I’ll be better about that in the future.

I had my PLASMA code print out the memory location ($4047, I think it was) for the function that as the inline assembly in it and went into the monitor and took a look.  If you look at the code in the picture above, you can see that the first STA instruction is $7e after the start of the routine ($260).  You’d expected to see the STA of this new code to be $7e past $4047, right?  Nope!  It was at $10B2. Well, there’s your problem.  I could get the offsets to be right in that code if I used “–setpc 16401” on the call to the ACME assembler, but then the entrance location to my PLASMA code was off and nothing would run.

After hours of digging around and trying various things, I decided I needed to reach out to see if I hit a bug (unlikely) or if I was doing something wrong (very likely).   After posting to comp.sys.apple2.programmer, David Schmenk got be straightened around.

Here is where the RTFM failure part comes in.  Here is a section from the PLASMA readme about Native Assembly Functions:

Lastly, PLASMA modules are re-locatable, but labels inside assembly functions don’t get flagged for fix-ups. The assembly code must use all relative branches and only accessing data/code at a fixed address.

Then I set off on a “damn fool idealistic crusade” to implement the code code in C (and then PLASMA) directly.  I tried.  Boy, did I try. But, apparently my reading of the assembly and trying to do it in something else was failing miserably. I tend to do that.  Wanting to do things the “right” or “best” way instead of doing it the “working way”.  Sometimes, it’s best to just use the “working way”.  Especially, since I only wanted it to do some performance testing.

Back to using the raw code and memcpy()’ing it in.  That was working fine, except my loop from 1 to 10 in my test program ran way more than 10 times.  I realize now, this was RTFM failure #2:

Data passed in on the PLASMA evaluation stack is readily accessed with the X register and the zero page address of the ESTK. The X register must be properly saved, incremented, and/or decremented to remain consistent with the rest of PLASMA. Parameters are popped off the evaluation stack with INX, and the return value is pushed with DEX.

David to  the rescue again.  Added in the code to save/restore X and DEX and good to go!

Here is the code for the timers. It’s basically a simple stopwatch with one lap timer included. Start the timer then you can ask for the elapsed time. You can do a lap reset to get individual times while the main timer is unaffected.


import cmdsys
    predef memcpy

const nscdata = $303

byte timer_year, timer_month, timer_date, timer_day, timer_hour, timer_minute, timer_second, timer_hundredth
byte lap_year, lap_month, lap_date, lap_day, lap_hour, lap_minute, lap_second, lap_hundredth
byte tmp_year, tmp_month, tmp_date, tmp_day, tmp_hour, tmp_minute, tmp_second, tmp_hundredth

byte nsccode[] = $a9,$00,$8d,$de,$02,$a9,$03,$09,$c0,$8d,$1f,$03,$8d,$22,$03,$8d,$31,$03,$8d,$3f,$03,$a9,$03,$8d,$df,$02,$d0,$16,$00,$00,$00,$00
byte           = $00,$00,$2f,$00,$00,$2f,$00,$00,$20,$00,$00,$3a,$00,$00,$3a,$00,$00,$8d,$20,$0b,$03,$a2,$07,$bd,$03,$03,$dd,$e0,$02,$90,$0f,$dd
byte           = $e8,$02,$b0,$0a,$ca,$10,$f0,$ce,$df,$02,$d0,$e6,$18,$60,$ee,$de,$02,$ad,$de,$02,$c9,$08,$90,$af,$d0,$1d,$a9,$c0,$a0,$15,$8d,$1b
byte           = $03,$8c,$1a,$03,$a0,$07,$8d,$1f,$03,$8c,$1e,$03,$88,$8d,$6f,$03,$8c,$6e,$03,$a9,$c8,$d0,$95,$a9,$4c,$8d,$16,$03,$38,$60,$00,$00
byte           = $00,$01,$01,$01,$00,$00,$00,$00,$64,$0d,$20,$38,$98,$3c,$3c,$64,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00,$00
byte           = $18,$90,$09,$00,$00,$00,$00,$00,$00,$00,$00,$38,$08,$78,$a9,$00,$8d,$04,$03,$8d,$80,$02,$ad,$a3,$03,$ad,$ff,$cf,$48,$8d,$00,$c3
byte           = $ad,$04,$c3,$a2,$08,$bd,$bf,$03,$38,$6a,$48,$a9,$00,$2a,$a8,$b9,$00,$c3,$68,$4a,$d0,$f4,$ca,$d0,$ec,$a2,$08,$a0,$08,$ad,$04,$c3
byte           = $6a,$66,$42,$88,$d0,$f7,$a5,$42,$9d,$7f,$02,$4a,$4a,$4a,$4a,$a8,$a5,$42,$c0,$00,$f0,$08,$29,$0f,$18,$69,$0a,$88,$d0,$fb,$9d,$02
byte           = $03,$ca,$d0,$d7,$ad,$80,$02,$8d,$83,$02,$68,$30,$03,$8d,$ff,$cf,$a0,$11,$a2,$06,$bd,$c7,$03,$99,$80,$02,$bd,$80,$02,$48,$29,$0f
byte           = $09,$30,$88,$99,$80,$02,$68,$4a,$4a,$4a,$4a,$d0,$0c,$e0,$01,$f0,$04,$e0,$04,$d0,$04,$a9,$20,$d0,$02,$09,$30,$88,$99,$80,$02,$88
byte           = $ca,$d0,$d1,$28,$b0,$19,$20,$be,$de,$20,$e3,$df,$20,$6c,$dd,$85,$85,$84,$86,$a9,$80,$a0,$02,$a2,$8d,$20,$e9,$e3,$20,$9a,$da,$60
byte           = $5c,$a3,$3a,$c5,$5c,$a3,$3a,$c5,$2f,$2f,$20,$3a,$3a,$8d

asm _initnsc
        jsr $0260

asm _readnsc
        jsr $030B

export def loadnsccode
    memcpy($0260, @nsccode, $16e);

export def initnsc

export def gettime(timedata)
    memcpy(timedata, nscdata, 8)

export def timer_start
    memcpy(@lap_year, @timer_uear, 8)

export def timer_elapsed
    word d, h, m, s, hd
    d = tmp_date - timer_date; h = tmp_hour - timer_hour; m = tmp_minute - timer_minute; s = tmp_second - timer_second; hd = tmp_hundredth - timer_hundredth;

    return (((d*24+h)*60+m)*60+s)*100+hd

export def timer_lap_reset

export def timer_lap_elapsed
    word d, h, m, s, hd
    d = tmp_date - lap_date; h = tmp_hour - lap_hour; m = tmp_minute - lap_minute; s = tmp_second - lap_second; hd = tmp_hundredth - lap_hundredth;

    return (((d*24+h)*60+m)*60+s)*100+hd


C Code

Adapted from a post by Bill Buckels

#include <stdio.h>
#include <string.h>
#include <conio.h>
#include "realtime.h"

#define READ_TIME_ADDR 0x260
#define READ_TIME_LEN  366

/* The READ.TIME program Version 1.4 (C) Copyright Craig Peterson 1991 */
char _read_time[READ_TIME_LEN] = {

struct nsctm timer, lap, tmp;

#pragma optimize (push,off)
void initnsc(void)

    char *brunptr = (char *)READ_TIME_ADDR;

    /* bload read.clock to $260 */

	asm("JSR $260"); /* call init clock */

#pragma optimize (pop)

/* read the current date time and time from the NSC */
#pragma optimize (push,off)
void gettime(struct nsctm *output)
	asm("JSR $30B"); /* call read clock */

    memcpy(output, (char *)0x303, 8);
#pragma optimize (pop)

void timer_start()
    memcpy(&lap, &timer, 8);

int timer_elapsed()
    int d, h, m, s, hd;
    d = tmp.date - timer.date; h = tmp.hour - timer.hour; m = tmp.minute - timer.minute; s = tmp.second - timer.second; hd = tmp.hundredth - timer.hundredth;

    return (((d*24+h)*60+m)*60+s)*100+hd;

void timer_lap_reset()

int timer_lap_elapsed()
    int d, h, m, s, hd;
    d = tmp.date - lap.date; h = tmp.hour - lap.hour; m = tmp.minute - lap.minute; s = tmp.second - lap.second; hd = tmp.hundredth - lap.hundredth;

    return (((d*24+h)*60+m)*60+s)*100+hd;

Again, strikingly similar, but that is what I was after. Comparing apples to apples (pun intended!)

Next I’m going to take a look at some more comparisons. Thing I was to look at (some based on suggestions) are things like timings for different routines, cycles for different operations and size comparisons.

I wanted to get some timings for PLASMA vs C for a few operations. I’m sticking with my “moving monster” theme and tracked the time for doing two different operations.

  1. Drawing a frame of the monster (100 times), which involves
    • Flipping HGR pages
    • Getting the page address, getting Y address (lookup),  the byte for X (lookup) and adding them together
    • Getting frame for X (lookup) and calculating the offset to get to the correct frame
    • Memcpy() the data to memory
  2. Do a simple no op for loop from 1 to 500 (100 times)

I fully admit that this is not an exhaustive test, but I just wanted to get an idea of how they compare. Again, this is not a “C is faster/better” post. PLASMA is impressive tech regardless of the times. It’s just out of my pure curiosity.

I added pretty much identical timing routines on the PLASMA and C side (after much time spent banging my head), but I’ll post on that later.


Note: Times are in cs (centiseconds, i.e. 100ths)

100 Frames

C: 147 cs
PLASMA: 228 cs (155%)

Loop 500

C: 530 cs
PLASMA: 1368 cs (258%)


Because, I like videos.



While playing around with PLASMA and working a some timing routines (more on that later), I found I needed to expand my build chain to be able to include multiple PLASMA modules in to one disk when booting.

I also didn’t like having to specify an environmental variable to set the source file for simple builds. For building a single .pla file and running it, I wanted something easier. This new makefile satisfies both of those requirements.

I did end up moving away from generating the # style files that I think CiderPress wants. Mainly because I’m using AppleCommander to build my disk images. I decided to use .mod (PLASMA “module” was the inspiration) as the intermediary file extension.



.PRECIOUS: %.dsk

	-rm -f *.a *.mod

%.run: %.dsk
	osascript plasma_run.scpt `pwd` $*

%.dsk: %.mod $(patsubst %,%.mod,$(EXTRA))
	cp template.dsk [email protected]
	java -jar AppleCommander.jar -d [email protected] $*
	java -jar AppleCommander.jar -p [email protected] $* $(DSKTYPE) 0x$(ADDR) < $*.mod
	-if [ ! -z "$(EXTRA)" ]; then \
		for o in "$(EXTRA)"; \
		do \
			java -jar AppleCommander.jar -d [email protected] $$o ;\
			java -jar AppleCommander.jar -p [email protected] $$o $(DSKTYPE) 0x$(ADDR) < $$o.mod ;\
		done ;\

%.mod: %.a
	acme --setpc 4094 -o [email protected] $?

%.a: %.pla
	plasm -AM < $? > [email protected]


This can be used in a few different ways. This simplest is to just run make passing in the name of your .pla file with “.pla” replaced with “.dsk” to build the disk image, or “.run” to build the disk image and boot it in Virtual ][.

You can technically even run make and use “.a” and get the .a file out of PLASMA. It’s all generic so any of the intermediaries will work. Use “.mod” to get the compiled binary file, you can then use that with whatever tool you’d want to put it on a disk.

To have it build and include additional PLASMA modules, set the EXTRA environmental variable to the list of the files to include without the .pla extension

Note: Besides the .dsk (which is marked as .PRECIOUS) all intermediaries are removed.


Make .dsk
% ls -l hello.pla
-rw-r--r--  1 mfinger  staff  65 Apr  8 21:42 hello.pla
% make hello.dsk
plasm -AM < hello.pla > hello.a
acme --setpc 4094 -o hello.mod hello.a
cp template.dsk hello.dsk
java -jar AppleCommander.jar -d hello.dsk hello
hello: No match.
java -jar AppleCommander.jar -p hello.dsk hello rel 0x1000 < hello.mod
if [ ! -z "" ]; then \
		for o in ""; \
		do \
			java -jar AppleCommander.jar -d hello.dsk $o ;\
			java -jar AppleCommander.jar -p hello.dsk $o rel 0x1000 < $o.mod ;\
		done ;\
rm hello.mod hello.a
% java -jar AppleCommander.jar -ll hello.dsk

  PRODOS  Destroy Read Rename Write SYS  035 09/19/2007 05/06/1993 17,128 $0000 0002 0008 Sapling Changed 0 4
  CMD  Destroy Read Rename Write SYS  010 04/01/2016 04/01/2016 4,141 A=$2000 0002 0029 Sapling Changed 0 0
  HELLO  Destroy Read Rename Write REL  001 04/08/2016 04/08/2016 55 $2000 0002 0037 Seedling Changed 0 0
  PLASMA.SYSTEM  Destroy Read Rename Write SYS  007 04/01/2016 04/01/2016 2,901 A=$2000 0002 002F Sapling Changed 0 0
ProDOS format; 112,640 bytes free; 30,720 bytes used.
Make .run
% make hello.run
plasm -AM < hello.pla > hello.a
acme --setpc 4094 -o hello.mod hello.a
cp template.dsk hello.dsk
java -jar AppleCommander.jar -d hello.dsk hello
hello: No match.
java -jar AppleCommander.jar -p hello.dsk hello rel 0x1000 < hello.mod
if [ ! -z "" ]; then \
		for o in ""; \
		do \
			java -jar AppleCommander.jar -d hello.dsk $o ;\
			java -jar AppleCommander.jar -p hello.dsk $o rel 0x1000 < $o.mod ;\
		done ;\
osascript plasma_run.scpt `pwd` hello
rm hello.mod hello.a
Including EXTRA
% ls -l timer.pla test.pla
-rw-r--r--  1 mfinger  staff   598 Apr  8 21:32 test.pla
-rw-r--r--  1 mfinger  staff  3710 Apr  8 13:26 timer.pla
% EXTRA=timer make test.dsk
plasm -AM < test.pla > test.a
acme --setpc 4094 -o test.mod test.a
plasm -AM < timer.pla > timer.a
acme --setpc 4094 -o timer.mod timer.a
cp template.dsk test.dsk
java -jar AppleCommander.jar -d test.dsk test
test: No match.
java -jar AppleCommander.jar -p test.dsk test rel 0x1000 < test.mod
if [ ! -z "timer" ]; then \
		for o in "timer"; \
		do \
			java -jar AppleCommander.jar -d test.dsk $o ;\
			java -jar AppleCommander.jar -p test.dsk $o rel 0x1000 < $o.mod ;\
		done ;\
timer: No match.
rm test.mod test.a timer.mod timer.a
% java -jar AppleCommander.jar -ll test.dsk

  PRODOS  Destroy Read Rename Write SYS  035 09/19/2007 05/06/1993 17,128 $0000 0002 0008 Sapling Changed 0 4
  CMD  Destroy Read Rename Write SYS  010 04/01/2016 04/01/2016 4,141 A=$2000 0002 0029 Sapling Changed 0 0
  TEST  Destroy Read Rename Write REL  001 04/08/2016 04/08/2016 423 $2000 0002 0037 Seedling Changed 0 0
  TIMER  Destroy Read Rename Write REL  003 04/08/2016 04/08/2016 927 $2000 0002 0039 Sapling Changed 0 0
  PLASMA.SYSTEM  Destroy Read Rename Write SYS  007 04/01/2016 04/01/2016 2,901 A=$2000 0002 002F Sapling Changed 0 0
ProDOS format; 111,104 bytes free; 32,256 bytes used.


Here is a video showing it using the “.dsk” and “.run” versions:

I wanted to compare PLASMA with CC65 on several different points. At this point, with my limited experience with PLASMA, I’ll just start with:

  • Easy of understanding/similarity
  • Speed

I took my “moving monster” test program and rewrote it using PLASMA to compare it to how I had it written in C.  Having read that PLASMA took some inspiration of it’s structure from modern languages, I was pleasantly surprised how similar the code for each is and how easy the port was. It actually helped me improve my C code a bit as well.

C code

// Put image on screen
void putImage(imageData *image, char page, char x, char y) {
    char b, f, r;
    // Convert X to byte offset
    b = xToByte[x];
    // Convert X to needed shift frame
    f = xToFrame[x] * image->height*image->width;
    // Draw frame line by line
    for (r = 0; r < image->height; r++) {;
        memcpy((char *)(hgrpage[page] + yToAddr[y + r] + b), image->data + f + (r * image->width), image->width);

int main() {
    int x = 0;
    int count = 0;
    // Clear both Hi-Res pages (Bad: Clearing holes too!)
    memset((char *)0x2000, 0, 0x2000);
    memset((char *)0x4000, 0, 0x2000);
    // Activate graphics
    POKE(-16304, 0);
    // Full screen graphics
    // Hi-Res graphics
    // Put initial image on non-displayed page so when we flip it's there
    putImage(&image, !page, 0, 30);
    // Move across the screen by 2
    for(x=2; x <= 200; x+=2) {
        // Flip page
        page = !page;
        POKE(showpage[page], 0)
        // Draw new image on non-displayed page
        putImage(&image, !page, x, 30);
        // Pause

    // Go back to page 0 (1)
    POKE(showpage[0], 0)

    // Text mode
    POKE(-16303, 0);



// Put image on screen
def putImage(imgdata, imgheight, imgwidth, page, x, y)
    byte b, f, r

    // Convert X to byte offset
    b = xToByte[x]

    // Comvert X to needed shift frame
    f = xToFrame[x] * imgwidth * imgheight

    // Draw frame line by line
    for r = 0 to imgheight-1
        memcpy(hgrpage[page] + yToAddr[y + r] + b, imgdata + f + (r * imgwidth), imgwidth)

// Clear both Hi-Res pages (Bad: Clearing holes too!)
memset(hgr1, 0, $2000)
memset(hgr2, 0, $2000)

// Activate graphics

// Full screen graphics

// Hi-Res graphics

// Put intial image on non-displayed page so when we flip it's there
putImage(@data, height, width, (!page&$01), 0, 30)

// Move across screen by 2
for x = 2 to 200 step 2

    // Flip page
    page = (!page&$01)

    // Drw new image on non-displayed page
    putImage(@data, height, width, (!page&$01), x, 30)

    // Pause
    for count = 1 to 500

// Go back to page 0 (1)

// Text mode

As you can see, they are very similar. Should be an easy move over for people familiar with C/Java and languages of that ilk. Very impressive.

Next I took a look at performance. When I originally started looking at comparing performance, I was shocked at the speed difference between the two (which I’ll show shortly). That was before I realized that I was wrong about PLASMA.

I was thinking that PLASMA was more of a “pre-assembler” or “pre-compiler” that took high level structures and generated 6502 assembly for the corresponding code. It actually produces byte-code that is then run under the PLASMA VM. This can be sped up by writing raw assembly for routines that need more power. Silly me.

Now, I don’t consider that a bad thing for the same reason I don’t consider it a bad thing for Java vs C. It’s just a different approach and both have their merits.

C Performance

PLASMA Performance

As you can see in the above videos, without some native assembly to do some of the heavy lifting where needed, the C compiled code runs much faster than the PLASMA code. With a byte-code VM, that is to be expected.

Again, I want to reiterate, this is not a bash on PLASMA at all. On the contrary, even with the little I’ve worked with it I’m very impressed with it and it’s an amazing piece of engineering. Especially doing a byte-code/VM on a 8-bit platform. Well done, well done.

I’m working on getting some timing routines in both the C side and the PLASMA side that will read from the No-Slot Clock, since it gives hundredths of seconds resolution. Then I’ll publish some exact numbers comparing the two. Again, not as a “C is faster/better” but just to show some of the trade-offs.

I decided as part of my efforts to get back into programming on my Apple ][‘s that I’d also explore other newer technologies that are available on the development side.

Thanks to a recent issue of Juiced.GS (Vol 21, Issue 1), I thought I’d try out PLASMA (Proto Language AsSeMbler for Apple) from Davis Schmenk. It (like it says) is a proto-assembly language that has a lot of features of modern language normally not available in assembly.  I’ve not dug into the language much beyond reading the article (“Programming with PLASMA: Developing a chat client”) in Juiced.GS and reading through some of the sample code, but it does look very interesting.

But, thanks to the great work on the Xcode build pipeline for C[AC]65 that I mentioned in an early post, I’m spoiled in having a quick build pipeline.  Write code, click build, watch it run.  So, I figured by “standing on the shoulders of giants” I’d put together a proof of concept way to do something similar with PLASMA.

Requirements are simple:  Write code, run a build, watch it run.

Digging into the work Quinn Dunki posted about here, I took  the Applescript code and the makefile and adapted it to work for what I needed.  I did it outside of Xcode for this case for a couple of reasons.  First is that Xcode won’t really understand PLASMA code in a way that is beneficial (no completion or highlighting) and second is that I don’t really like Xcode very much.  So, vi and make it is.  Makes me all nostalgic.

Here is my adapted Applescript code (Really only changed a – to a +):

-- Stolen/Adapted from: Blondihacks Makefile script for Virtual ][ (http://www.quinndunki.com/blondihacks)
-- Boots the disk image for the program and runs it inside PLASMA

on run argv
	set TARGETPATH to item 1 of argv
	set PGM to item 2 of argv

	tell application "Virtual ]["

		tell front machine
			eject device "S6D1"
			insert TARGETPATH & "/" & PGM & ".dsk" into device "S6D1"
			delay 0.5
			delay 0.5
			type line "+" & PGM
		end tell
	end tell
end run

Here is my makefile:

PGM?=$(shell basename $(SRC) .pla)



all: disk

run: disk
	osascript plasma_run.scpt `pwd` $(PGM)

disk: $(PGM).dsk

	-rm -f $(OBJ) $(PGM).a $(PGM).dsk

	-rm -f *.a *\#*

$(PGM).dsk: $(OBJ)
	cp template.dsk $(PGM).dsk
	java -jar AppleCommander.jar -d $(PGM).dsk $(PGM)
	java -jar AppleCommander.jar -p $(PGM).dsk $(PGM) $(DSKTYPE) 0x$(ADDR) < $(OBJ)

%\#$(TYPE)$(ADDR): %.a
	acme --setpc 4094 -o [email protected] $?

$(PGM).a: $(SRC)
	$(PLASM) -AM < $? > [email protected]

Again, this may be too limited at the moment as I don’t have a deep understanding of PLASMA and project structure, linking, etc.  But, for this case simply set the SRC environmental variable to point to your plasma code and run make.

Here is an example (Note: I’ve tweaked the makefile a bit since the video):

Now it’s time to start writing some of my own code and experiment with the language.

I’ve been using GO (golang.org) for the last several months and really like the language, which I can go into at another time.

Lately, one of the processes that I’ve written seems to get into a site where the CPU of the process is extremely high even though the process is basically in an idle state:

top - 13:06:53 up 152 days,  4:04,  1 user,  load average: 11.99, 11.30, 11.25
Tasks: 348 total,   1 running, 347 sleeping,   0 stopped,   0 zombie
%Cpu(s): 48.4 us,  2.4 sy,  0.0 ni, 49.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  32900140 total, 32371752 used,   528388 free,       44 buffers
KiB Swap: 33509372 total,  2151948 used, 31357424 free. 22511692 cached Mem

16115 mfinger   20   0 2637812 1.105g   5972 S 616.7  3.5   3883:11 xxxxxx
16134 mfinger   20   0 2504232 728572   6128 S 610.1  2.2   2909:37 xxxxxx

After looking around, I remembered that go has profiling built in. I added a few lines to my code, namely:

import _ "net/http/pprof"


go func() {
       log.Println(http.ListenAndServe(":6060", nil))

Then I ran the profile tool built into GO:

% go tool pprof -png http://host:6060/debug/pprof/profile &gt; cpu.png
Fetching profile from http://host:6060/debug/pprof/profile
Please wait... (30s)
Saved profile in /Users/Mfinger/pprof/pprof.host:6060.samples.cpu.008.pb.gz

Let’s look at the results:


Let’s look at the heap, as well:

% go tool pprof -png  http://host:6060/debug/pprof/heap > heap.png
Fetching profile from http://host:6060/debug/pprof/heap
Saved profile in /Users/Mfinger/pprof/pprof.host:6060.inuse_objects.inuse_space.006.pb.gz

Very nice. Now to try to fix the issue.

Reading about odd/even frames and bytes was pretty, well, confusing at the beginning.  Took me a few times to get through it and experiment, but I figured it out.

Firstly, there are two choices:

  1. Move one pixel at a time, which really turns into only moving every other cycle.
  2. Move two pixels at a time, which actually looks okay.

There really is no way to move 1 pixel and not change colors which, looking back (again), makes total sense.

Secondly, I figured out the odd/even frames and bytes logic.

In the book, he generates frames at bit shift of 0, 2, 4, 6 as frames 0, 1,  2, and 3 then generates frames at bit shift offset of 1, 3, and 5 as frames 4, 5 and 6.  Which means show even offset frames in even bytes and odd offset frames in odd bytes.  The tricky part is right at the middle of the frames at frame 3.

If we plot this out over 14 shifts (enough to get through both an even and odd byte)

  • X = 0/1 we show frame 0 (even byte, even offset frame)
  • X = 2/3 we show frame 1 (even byte, event offset frame)
  • X = 4/5 we show frame 2 (even byte, event offset frame)
  • X = 6/7 we show frame 3
    • Except 7 is in the odd byte, but if we move it to the first odd frame 4) then we show frame 3 for 1 cycle and show frame 4 for 3 cycles.
    • And we can’t put an even offset frame in an odd byte or the color will change.
    • The fix is that for X = 7, we actually just do X = 6 again.  Put frame 3 in the even byte
  • X = 8/9 we show frame 4 (odd byte, odd offset frame)
  • X = 10/11 we show frame 5 (odd byte, odd offset frame)
  • X = 12/13 we show frame6 (odd byte, odd offset frame)

Not sure that made it any clearer, maybe some code will.  I have a lookup table that you index into with your X value and you get back byte # and frame #.  Unlike the book, I generate the frames in bit-shift order to keep even/odd consistent between byte, offset and frame #.

char xToByteFrame[280][2] = {
{ 0, 0 },
{ 0, 0 },
{ 0, 2 },
{ 0, 2 },
{ 0, 4 },
{ 0, 4 },
{ 0, 6 },
{ 0, 6 },
{ 1, 1 },
{ 1, 1 },
{ 1, 3 },
{ 1, 3 },
{ 1, 5 },
{ 1, 5 },

Notice there are 8 entries that update byte offset 0 and 6 that update byte offset 1.  The second { 0 , 6 } handles the fix for X = 7.Screen Shot 2016-03-18 at 11.58.30 PM

You can see that we have 2 at each X off set.  This is a move from 0-13 moving by 1 pixel.  I put each frame below the previous one for comparison.

Here is the final product of the little monster man moving across the screen.  I opted to move 2 pixels at a time, I could have halved the delay between moves and moved by 1 but why copy unneeded data around.

Here is my main() code.

Screen Shot 2016-03-19 at 12.17.04 AM

putImage() takes care of figuring our what frame needs to be displayed based on the X value passed in.

More progress In the right direction.

Apparently, moving a (mostly) white object is as easy as I thought it was.  My tool generated the 7 needs frames and they progressed nicely across the screen.  The only tricky part is I need to turn an X value into two different values:

  1. Byte # in row
  2. Frame # to display

This was pretty easy (or so I thought, more on that below).  Take X divide it by 7 and round down to get the byte # in the row.  Take X mod 7 (i.e remainder) and you get the bit offset with in the byte which corresponds to the frame.  I’m worried that math is also too much work to do every movement, so I generated and lookup table for X to Byte/Bit but it’s 2 bytes for each column so that’s another 560 bytes for lookup tables.  Remember we’re working with things on the order of magnitude of 32-48k.  So that’s a total of 944 bytes for lookups, almost a whole K.  I’ll need to figure out which is better doing some testing, for now lookup table it is.

We’re good, right?  For non-white objects, no so much:

Reading further in the book (yes, I end up working ahead when perhaps I shouldn’t), looks like I need to handle odd/even frames for odd/even bytes differently.  Oh, the fun never ends.

Which, when I think about it, makes total sense. Here is the first frame as a bitmap:

Screen Shot 2016-03-18 at 9.07.22 PM

And here is the second:

Screen Shot 2016-03-18 at 9.07.38 PM

The first one is all on green pixels (the G at the bottom) and the second is all on violet pixels (the V). I could just move two bit at a time, the the second (displayed) frame would be:

Screen Shot 2016-03-18 at 9.19.12 PM

So, it’s back to green like I want.  That seems cheap like a 2-bit suit (Ok, I had to).  But, it does feel like cheating.  Maybe that is what I’ll need to do and what games do and we don’t know it, but I don’t think so.

Time to read more and see, looks like the parts I’ve moved through so far at the “easy parts”.  Figures.


(This post was repurposed and embellished from my FaceBook post).

Well, more baby steps. Did a bunch of reading in the arcade book about bitmap images and rendering. Learned a lot.

Realized I need 7 shift frames for each bitmap I want to move around since 1 byte in hi-res is 7 pixels.  That is because the expense to calculate the image shift of the fly will be complex and costly.  Remember, with the 7-bits and a high bit for color control (g/v vs b/o) you can’t simply shift each byte by one bit and be done.


That’s because the high bit needs to be left alone so it doesn’t change color, so the 7th bit needs to moved the low bit of the next byte (if shifting right), etc.  Not to mention doing this EVERY time.

Time for graph paper? Umm, no… If necessity is the mother of invention, then laziness is the father. I wrote a tool to draw my bitmaps and added code to generate each of the 7 shifts.  Needs some cleaning up, but it works great.


  • Drawing the bitmap (duh)
  • Buttons to manually shift left/right
  • Button to clear
  • Generarates output data in textarea boxes. (C array for main frame, JSON for loading/unloading, C array for all frames).  These are updated in real-time.


The video shows my bitmap tool, the build pipeline in Xcode and the results running.

Proof is in the pudding they say. So, without further ado, the pudding.

Moving something around the screen needs to be fast, and drawing each line pixel by pixel isn’t going to cut it.  This is where bitmap graphics comes in.  Basically, drawing out the pictures pixel by picture before hand, keeping it in memory then copying it to the write spot on the screen for each frame of the movement.

We can do this with block moves over data, but can only move blocks of sequential data.  Each line (even if they were in sequential order on the screen) is at a different memory location.  Pixels byte 2 on line 2 are not sequential with byte 2 on line 3.  So, we can only block move in one line of the image at time.  To complicate things we need to figure out where in memory each line starts since they are not in order.

We can do a couple of ways.  Compute the start address of the line, which will take a lot of instructions include division and multiplication (remember we’re on an 8-bit machine here at 1Mhz) or we can do look up tables.  A lookup table is basically a list of 192 addresses (the number of lines in Hi-Res) in line order. that we can index into with our Y coordinate.  Takes up 384 bytes of memory, but saves us a bunch of time.

Here is the formula from the book:

Screen Shot 2016-03-18 at 6.54.53 PM

Needless to say, I went through and wrote something to dump out all the line addresses and generated the lookup table.  You’ll notice “SN” above referring to Hi-Res page 1 or 2.  The Apple ][ screen has two pages, only one of which can be visible at a time.  To avoid flickering images when moving them, it’s a common (not trivial) to erase/redraw on the page not being displayed, then switch which page is visible, then repeat.

Then I took a bitmap example from the book and wanted to get it moving across the screen.  Since I’m learning here and wanted to do it the right way, I decided to also do the page flipping.

You can see my first attempt:

It’s not the cleanest, but not too bad.  At least it moves and I can tell the pages are flipping (the garbage at the bottom).

But, notice that it’s moving jerky.  That’s because I’m moving it a whole byte (7 pixels) each frame instead of moving individual pixels.  That’s going to take some more work, but at least we have movement.  Baby steps.