March 2020 – RETRO Innovations

CoCo DMA: Missing Without a Trace

Radio Shack’s introduction of the Color Computer 3 introduced extensive changes to the Color Computer design. Gone were the 6847 VDG and 6883 SAM, functionally replaced and enhanced with the TCC1014 Advanced Color Video Chip (ACVC) which most folks reference as the GIME. Other additions included an additional 64kB of RAM with an option to upgrade all the way to 512kB, support for an additional joystick fire button, and composite and RGB video output options. Almost every addition brought new capabilities or an expansion of capabilities to the Color Computer…

Interestingly, for a machine design that I understand focused on parts and cost reduction within TANDY, 1 part was added to the CPU’s data path along the memory bus, an addition that renders our previous work unusable, an addition that breaks the very idea of sharing the bus: the 74LS245 octal bus transceiver. Ironically, the ‘245 transceiver was designed to allow bus sharing. As the name suggests, it contains 8 logic elements, each able to pass one signal from the ‘A’ pin to the ‘B’ pin or vice versa, depending on the state of the “direction” pin. More importantly, when an “enable” pin is de-activated, the logic element effectively “removes” the signal from the pin on the IC. In the industry, the pin is said to be in high impedance mode or “Hi-Z” mode. Such pins are commonly called “tri-state” pins, as they can either output 0, 1, or Hi-Z. The mechanics vary, but the term implies the logic element effectively places such a large resistance in between the signal and the pin that the rest of the circuit can essential ignore or will have no awareness of the signal. The 8 elements, Hi-Z operation, and direction function fit the demands of an 8-bit CPU bidirectional data bus perfectly. We can’t blame the IC itself for this issue, so why does the inclusion of this IC in the Color Computer 3 design render bus sharing unusable?

Let’s peruse the CoCo3 schematic, located on page 103 (or 127 for PAL) of the TANDY Color Computer 3 Service Manual. A snippet is included below in Figure 2. The 74LS245 connects directly to the MC6809E CPU data bus, with the DIR line connected to the CPU R/W line. During normal operation, the CPU drives the ‘B’ lines during a write action, and the ‘245 reads those lines and drives the ‘A’ lines with the same value. On a read activity, the ‘245 reads the data bus ‘A’ lines and drives the ‘B’ lines with that value, which the CPU then reads on its data pins and processes.

So far, so good. And, all works fine when we perform a DMA-based read action. After halting the CPU, The DMA engine places an address on the bus, and raises the R/W line to read data from memory, just like the CPU. Doing its part, the ‘245 transceiver pulls the data from the ‘A’ pins and places it on the ‘B’ pins, which are connected to the CPU. It’s a nice gesture, though unneeded, since the CPU isn’t running.

Writing, though, is where the issues arrive. Let’s run through the scenario. After halting the CPU, the DMA engine places an address on the address bus, places some data on the data bus, and pulls the R/W line low to signal a memory write. The ‘245, assuming the CPU is trying to write some data to memory, grabs the values on the ‘B’ pins and places the results on the ‘A’ pins, which are connected directly to the shared data bus. The first problem: The CPU is halted, so it is not outputting any data on the CPU data pins. The ‘245 is effectively pulling residual charge from the tri-stated CPU data bus, cleaning it up and driving the data bus. The second problem: Someone else is already driving the bus: the DMA engine. At the very least, the data bus represents an interdeterminate composite of the expected value from the DMA engine and the unknown value from the CPU data pins. More importantly (and more concerning), one output might be attempting to drive a data line low while the other tries to drive it high. In some sense, this is like connecting a battery’s positive post to its negative post. While the amperage involved is much lower, it’s no less damaging to the ICs.

Focus attention on pin 19 of the 74LS245, the “enable” pin, labeled “G”. (I have not yet been able to determine why “G” was used, maybe “gate”. Other versions of the IC use OE, which means “output enable”). Regardless of the name, the bar over the name indicates that it is “active low”. A logic zero (0) on the pin will enable the function. Ground the pin, and the data passes from A to B or B to A. Connect the pin to a “one” (1) signal and all transceiver channels switch into Hi-Z mode. Notice in Figure 2 that the enable line is connected to a symbol that just ends. That symbol is called “chassis ground”, representing a wire connected to a large sheet of metal (the “chassis” of old electronic equipment was made of metal). I won’t go into the specifics of various grounding options (signal ground, earth ground, and chassis ground), but we can all agree this pin is permanently tied to zero (0). Thus, the transceiver is permanently enabled. Good for the CPU, bad for DMA.

On previous Color Computer designs, the 74LS245 was not even included. I don’t know exactly why the functionality was added to the CoCo3, but we can theorize. While dedicated bus transceivers like the ‘245 clean up marginal signals, it’s highly doubtful that signals originating from the MC6809E showed any integrity loss. Bus transceivers like the 74LS245 can also act as “buffers”, protecting (in this case) the CPU from dangerous voltages or skipes on the data bus. Again, this seems dubious, since dangerous voltages on the address bus or other control lines could just as easily damage the CPU, and no buffers are placed on those lines. However, bus transceivers are also used to increase the “drive” capability of a signal. Think of the bus transceiver like an 8 channel amplifier. Each IC connected to the bus uses a bit of power, which must be supplied by the IC driving the bus. ICs thus often specify their drive characteristics in terms of Transistor to Transistor Logic (TTL) “load”, a current requirement of 1.6 milliampere (mA). A standard TTL IC input comprises one (1) TTL load, and a TTL output can typically drive ten (10) TTL loads (16mA). In electronics, this is called the “fanout” rating. Most ICs provide their fanout capability, and the MC6809E specifies a fanout capability of 4 LSTTL loads. A LSTTL (Low Power Schottky TTL), as the name suggests, requires less power than standard TTL. As such, an LSTTL load is equivalent to ¼ TTL load, or 0.4mA. Even if we simplify and consider everything on the CoCo3 data bus as requiring only 1 LSTTL load (0.4mA) each, the CoCo3 attaches 6 items besides the cartridge port and CPU to the databus, which exceeds the 4 LSTTL limit. For comparison, the CoCo1 attaches 5 items, again not including the CPU and the cartridge port. While the MC6809 can probably drive more than 4 LSTTL loads, and CMOS devices like the GIME probably consume less than 1 LSTTL of load, the ‘245 was most likely placed into the design to address this load issue. Now, some will ask, “But, what about the need to boost the drive on the address lines”? It turns out that the address lines are not connected to as many devices (excepting the cartridge port, most address lines only connect to GIME and the ROM, with A0,A1, and A5 signals connecting to 2 more devices).

We’ve determined the bus transceiver is designed to allow bus sharing, and our theory suggests it was a useful addition to the CoCo3 design. So, how should we proceed? As the title of this article suggests, the problem lies in a missing wire in the schematic. Such lines, called printed circuit board (PCB) traces or just traces, connect all of the IC pins in the CoCo3 design. As noted previously, we need one connected to the “enable” pin of the bus transceiver, one that will be zero (0) when the transceiver should be working, and one (1) when the transceiver should be turned off.

It turns out that the MC6809E CPU “BA” (Bus Available) line appears to offer the correct operation and polarity. When zero (0), the CPU is utilizing the bus, and while one (1), the bus is “available” for others. Tying the “enable” line to the CPU BA line should allow all existing functionality to continue while enabling CoCo3 DMA operation. That’s the good news. The bad news: adding this trace in some fashion requires hardware modification. Take heart, though, there are a few different ways to proceed.

For reference, let’s look at an NTSC CoCo3 PCB in Figure 4, showing the 74LS245 bus transceiver. You will notice the transceiver sits in between the CPU and the cartridge port (the PAL PCB is slightly different):

The “enable” pin on the 74LS245 has been denoted with the red arrow, as has the BA pin on the MC6809E (this particular CoCo3 has had the CPU removed and a socket installed).

NOTE: While testing suggests this modification will not adversely affect any current CoCo3 usage, more testing is needed to conclusively prove the point. The below instructions are designed for those who want to modify and confirm operation. Make these modifications at your own risk.

Option 1:

This option requires the least amount of work, but also looks the least pleasing. Simply snip the 74LS245 pin 19 at the spot where it goes into the PCB, bend it up off the PCB, and solder a small wire from it to pin 6 of the CPU. To return to stock, simply remove the wire, bend the pin down, and resolder to the stub in the PCB. Alternatively, desolder and replace the 74LS245.

Option 2:

This option looks neater, but does require cutting a trace.

For this option, remove the PCB from the case, remove the little retaining clips on the shield under the board, and locate pin 19 of the 74LS245. Cut the trace going to that pin, and then solder a wire from pin 19 to pin 6 of the CPU on the bottom of the PCB. To return to stock, unsolder the wire and create a small solder bridge over the cut trace.

Option 3:

This option requires the most work, but it is easiest to convert back to stock.

Desolder and socket the 74LS245. Build and populate the DMEnabler PCB shown below, installing the ‘245 into the adapter PCB and then installing the adapter into the CoCo3. Attach a test grabber lead to the BA pin on the adapter PCB and clip the test lead to pin 6 of the CPU. (Enterprising PCB designers could also create a dual-header PCB that fits in both the CPU and the 74LS245 sockets, removing the need for the test lead)

With one of these alterations, normal CoCo3 operation should continue as before, and DMA functionality will be enabled. As teased in last week’s article, with this modification in place, our test application now transfers data from the CoCo3 to external RAM and back:

Figure 6: CoCo3 Transfer to External RAM

While running in FAST mode (poke &hffd9,0), we can see the double read of $ff67/68 trigger the DMA transfer, which then pulls data from $4000-$4003 at ~2MBps.

Figure 7: External RAM transfer to CoCo3

Going the other way, the two cycle $ff67/68 read triggers the engine to push our data to $5000-$5003, again at ~2MBps.

Given that the addition of a single PCB trace into the CoCo3 circuit board would have added no cost and that the inclusion of that trace appears to maintain full compatibility with earlier CoCo systems while also fully enable bus sharing, one can only assume TANDY did not foresee DMA or other bus sharing activities being needed on the cartridge bus or they did not care to support such functions. Alas, requiring a hardware modification to support CoCo3 bus sharing severely limits adoption of this technique. At this point, folks socketing their CPU to upgrade to the Hitachi 63C09 CPU could also start socketing the 74LS245 or otherwise asking this modification be performed, but the majority of CoCo3 units remain unable to utilize this DMA technique.

Still, one cannot help but wonder… What if there were a way to coax an unmodified CoCo3 into sharing the bus? 🙂

CoCo DMA: Invisible RAM

Now that we’ve learned a bit about how the 6809E wants to be treated during a direct memory access (DMA) request, we can put all we’ve learned into action. But, before we launch into the details, let’s address some questions that arose during previous article discussions:

Some asked why the logic doesn’t just watch for M6809E signals BA and BS to both equal 1 (this condition indicates that the CPU is halted), commencing DMA actions at that point. This is my fault. As I started the project effort, I didn’t make it clear that I am trying to determine what DMA capabilities exist at the external side of the TANDY Color Computer expansion port (game cartridge port). I did comment about the expansion port initially, but I should have called that out in more detail. It’s worthwhile to also answer how one enables DMA when inside the computer, but that effort requires modifying the internals of the machine (at least plugging and unplugging ICs, which might not be socketed), and thus trades design complexity for end user complexity. The lowest entry point for DMA exploration (as with most things) is the expansion port, so I’ve concentrated my efforts there. But, for the record, we now know that an internal DMA engine need only pull HALT low, watch for BA=BS=1, and then transfer data as desired.

Others asked if halting the CPU and performing data transfers can really be called “DMA”. Instead of just answering the question by articulating the literal definition of DMA, I believe the question speaks more about how the term “DMA” has evolved over the years. In the beginning, DMA actions made no promises about CPU activity. The act of offloading memory access from the CPU and the act of running the CPU while that alternative access occurred were completely different efforts. In fact, Motorola manufactured and sold a DMA controller IC (the 6844 DMAC) that performed DMA actions by stopping the CPU (in various ways). That said, in today’s IBM PC-based world, where the memory bus, the peripheral bus, and the CPU bus are all separated and tied together with “bridge” ICs, it’s expected that a DMA action will not impact CPU operations. That’s the question being asked. I still believe the answer is “yes”, not only because of the literal definition but also that period-correct implementations would have stopped the CPU. But, I will agree that performing data transfers while not impacting CPU performance would be ideal. The conversation did bring up some neat ideas on how one might share the memory bus, not stop the CPU and perform DMA activities, which we can consider in later installments.

Concerning our primary objective, we now know how to safely stop the CPU and gain control of the bus, and we also know how to manipulate the bus to transfer data from one place to another. But, transferring a single piece of data is boring, so let’s add some more value. On the Color Computer 1 and 2, a fully expanded machine would include 64kB of RAM. Expanding RAM beyond 64kB typically takes 1 of 2 paths:

Internal RAM expansion. While CoCo systems were being manufactured, some vendors offered internal memory expansion options that “paged” RAM in 32kB banks. The RAM was easy to access, but the granularity was poor (if you needed 1 byte from another bank, you had to swap out 32kB of code and/or data to access it). Internal expansion solutions also had to contend with motherboard layout differences and lack of socketed ICs.
External RAM expansion using the cartridge port. Since the entire address and data bus resides on the expansion port, we can place additional on the bus there. However, due to technical reasons, only 32kB of internal memory can be protected from external memory writes, and I believe external memory cannot be seen by the video subsystem. The currently produced MOOH expansion memory and SD card interface by Tormod Volden represents this category.

There may be value in supporting a third option: DMA memory expansion. Place a large amount of memory external to the machine and enable DMA actions to swap that external memory with the data inside the machine.

Pros:

1 byte granularity. If you need to map in 1 data value, DMA expansion can support that. No 16kB or 32kB banking granularity.
No restrictions on memory location. Map in new data anywhere internal RAM exists.
No restrictions for video. Map in new data anywhere and the MC6847 VDG can see it.

Cons

Slower than banked memory. Each mapped byte takes ~1uS to map.
Still limited by internal RAM size. If the CoCo has 4kB of RAM, one can only map in 4kB of data at a time.

Still, in spite of the potential drawbacks, we should implement the idea, if only to use as a stepping stone to more expansive capabilities.

We’ll implement this invisible memory idea by adding some IO registers to our test logic:

3 bytes to hold the memory location in external memory we wish to transfer to/from
2 bytes to hold the memory location in internal memory we wish to transfer from/to
2 bytes to hold the length of memory to transfer
1 byte to configure what type of transfer we want (and some other flags), like:
- Transfer from internal memory to external?
- Transfer from external memory to internal?
- Use a constant address for internal memory (don’t increment internal address after each transfer)?
- Use a constant address for external memory (don’t increment external address after each transfer)?
- Enable DMA transfers?

As presented during the last discussion, let’s use the 2 byte “length” parameter as our trigger to start a DMA transfer. This allows the programmer to make most efficient use of code, by leveraging the 16 bit length register as both information sharing and process initiation.

Let’s get into the Verilog code now:

always @(negedge e_cpu or negedge _reset_cpu)
begin
   if(!_reset_cpu)
   begin
      flag_write <= 0;
      flag_mem_hold <= 0;
      flag_sys_hold <= 0;
      flag_active <= 0;
   end
   else if(ce_ctrl & !r_w_cpu)
   begin
      flag_write <= data_cpu[0];
      flag_mem_hold <= data_cpu[5];
      flag_sys_hold <= data_cpu[6];
      flag_active <= data_cpu[7];
   end     
end

flag_write = Are we doing a read from internal memory or a write to internal memory?
flag_mem_hold = Do not increment the external memory address after every transfer
flag_sys_hold = Do not increment the internal memory address after every transfer
flag_active = Enable DMA engine

This code just sets the operational flags from the various data bits, or resets them during a reset activity.

always @(posedge e_cpu or negedge _reset_cpu)
begin
   if(!_reset_cpu)
      begin
         flag_halt <= 0;
         flag_knock <= 0;
         flag_run <= 0;
      end
   else if(!flag_dma & ce_knock)
      begin
         flag_halt <= 1;
         flag_knock <= 1;
      end
   else if(!flag_dma & ce_knock2 & flag_knock)
      begin
         flag_knock <= 0;
         flag_run <= 1;
      end
   else if(!flag_dma & !ce_knock2 && flag_knock)
      begin
         flag_knock <= 0;
         flag_halt <= 0;
      end
   else if(flag_dma && (!len))
      begin
         flag_halt <= 0;
         flag_knock <= 0;
         flag_run <= 0;
      end
end

flag_knock = access to $ff67
flag_run = access to $ff68 after an immediately preceding access to $ff67

This represents a finite state machine with 3 states (IDLE, KNOCK, RUN). I will later optimize this to use a 2 bit “state” value instead of 3 binary bits. Still, I think you can see the sequence. On reset, reset the flags. If $ff67, set flag_knock. If we next see $ff68 (ce_knock), set run flag. The additional check for flag_dma is simply to prevent retriggering a DMA activity while performing a DMA activity that accesses $ff67:68 😊.

always @(negedge e_cpu)
begin
   flag_dma <= flag_active & flag_run;
end

We quantize DMA actions to begin and end on the falling edge of E, and DMA can only occur if the criteria is met and the DMA engine is enabled.

always @(*)
begin
   if(e_cpu & r_w_cpu & ce_addre)
      data_cpu_out = address_mem_out[23:16];
   else if(e_cpu & r_w_cpu & ce_addrh)
      data_cpu_out = address_mem_out[15:8];
   else if(e_cpu & r_w_cpu & ce_addrl)
      data_cpu_out = address_mem_out[7:0];
   else if(e_cpu & r_w_cpu & ce_addrh_sys)
      data_cpu_out = address_sys[15:8];
   else if(e_cpu & r_w_cpu & ce_addrl_sys)
      data_cpu_out = address_sys[7:0];
   else if(e_cpu & r_w_cpu & ce_lenh)
      data_cpu_out = len[15:8];
   else if(e_cpu & r_w_cpu & ce_lenl)
      data_cpu_out = len[7:0];
   else if(flag_dma & !r_w_cpu)
      data_cpu_out = data_mem;
   else
      data_cpu_out = 8'bz;
end

This code looks complicated, but it’s just allowing the developer to read various values. The only condition of interest is the one where flag_dma is active and R/W is 0 (a write to internal memory. In this case, we want to bridge the external memory data bus to the internal memory databus.

always @(*)
begin
   if(!flag_write & flag_dma)
      data_mem_out = data_cpu;
   else
      data_mem_out = 8'bz;
end

Conversely, if we are transferring data from internal memory to external, bridge the CoCo data bus to the external memory data bus.

always @(negedge e_cpu or negedge _reset_cpu)
begin
   if(!_reset_cpu)
      address_mem_out <= 0;
   else if(ce_addre & !r_w_cpu)
      address_mem_out[23:16] <= data_cpu;
   else if(ce_addrh & !r_w_cpu)
      address_mem_out[15:8] <= data_cpu;
   else if(ce_addrl & !r_w_cpu)
      address_mem_out[7:0] <= data_cpu;
   else if(flag_dma & !flag_mem_hold)
      address_mem_out <= address_mem_out + 1;
end

Again, the Verilog looks complicated but is not. We’re simply storing the various pieces of the starting memory address via writes from the CoCo, in 8 bit chunks. During a DMA activity without the address being held, we increment during each falling edge of E.

always @(negedge e_cpu or negedge _reset_cpu)
begin
   if(!_reset_cpu)
      address_sys <= 0;
   else if(ce_addrh_sys & !r_w_cpu)
      address_sys[15:8] <= data_cpu;
   else if(ce_addrl_sys & !r_w_cpu)
      address_sys[7:0] <= data_cpu;
   else if(flag_dma & !flag_sys_hold)
      address_sys <= address_sys + 1;
end

Same story here. We store the internal starting memory address in registers in 8 bit chunks, and we increment the counter by 1 during each cycle if the internal memory address is not being locked into position.

always @(negedge e_cpu or negedge _reset_cpu)
begin
   if(!_reset_cpu)
      len <= 0;
   else if(ce_lenh & !r_w_cpu)
      len[15:8] <= data_cpu;
   else if(ce_lenl & !r_w_cpu)
      len[7:0] <= data_cpu;
   else if(flag_run)
      len <= len - 1;
end

Finally, perform the same action for the length, storing in 8 bit chunks and incrementing while the DMA activity is occurring.

always @(*)
begin
   if(flag_dma)
      begin
         address_cpu_out = address_sys;
         r_w_cpu_out = !flag_write;
      end
   else
      begin
         address_cpu_out = 16'bz;
         r_w_cpu_out = 'bz;
      end
end

Normally, we don’t mess with the address bus, preferring to read it for information. But, during a DMA cycle, we need to place an address on the CoCo bus.

assign _halt = (flag_active & flag_halt ? 0 : 'bz);
assign _ce_ram =!flag_dma;
assign _we_mem =!(flag_dma & !flag_write);

One would think we could use flag_run to configure HALT, but flag_run only goes active after both trigger accesses have happened. Thus, using flag_halt lets us start the HALT condition after the first trigger (access to $ff67) if the DMA engine is active. External memory is selected only during DMA cycles, while the !WE signal to external memory is only enabled if we are reading from internal memory (this signal is overqualified, in that the external memory won’t be active unless flag_dma is active, which means this signal could be simplified to assign _we_mem = flag_write;

After compiling and downloading the firmware into our test cartridge (which contains 512kB of static RAM), let’s see what we can do. We’ll enable the DMA engine (128 => $ff69), set the external address to $000000, internal address to $4000, and length to $0004:

I’d like to interrupt for a second and express appreciation to David Wood (jbevren on IRC and Discord) for pointing me to the VNC server capabilities on my HP logic analyzer. Starting VNC allows me to capture better screenshots and remotely control the logic analyzer.

In this case, we performed a read from $ff67:68, since the program was written primarily in BASIC with a small EXEC to perform LDD $ff67:rts. We can see the $ff67:68 reads, the HALT line going low after the $ff67 access, the wait for $ff68 access, and then 4 reads from internal memory. Before issuing the DMA transfer, I populated $4000-$400a with the ascending values 0-10, and we see them pulled across the data bus in the trace above.

Now, let’s got the other way, transferring those external values back into internal memory at $5000 by turning the DMA engine on, switching transfer direction (129 => $ff69), setting the external address to $000000, setting the internal address to $5000, and keeping the length at $0004:

Again, we see the trigger condition $ff67:68, the HALT condition, and the transfer of our 4 data values from external memory to internal locations, and we see the address incrementing with each transfer. Additional tests show “pinning” an address works as well, which can be useful in the following situations:

setting memory to a constant value at 1byte/us by pinning the source address at a location that contains that value
Dumping data to the Orchestra 90 or other CoCoSDC at 1MB/s (Yes, the DMA action works even if both the DMA device and the peripheral are external to the CoCo!)

This invisible memory implementation nicely illustrates the DMA capability available on the TANDY Color Computer 1 and 2. Perhaps this knowledge triggers a fellow enthusiast to develop software that will play to the advantages of this memory expansion option. To that end, all resources associated with this effort have been placed under the Creative Commons Share-Alike 4.0 license and uploaded to a GitHub project repository:

https://github.com/go4retro/PhantomRAM/

If interest warrants, PhantomRAM can be turned into a technology solution, though that is outside the scope of this research effort.

Next time, we turn our attention to the 3^rd system in the Color Computer lineup and the challenges it poses concerning DMA functionality. Until next time, I present the following two screen shots (hint: check the timestamps 😊)

CoCo DMA: “Fighting on the bus”

Picking up from last time, we were able to successfully place a byte into the internal CoCo memory from the cartridge port without the use of the CPU by utilizing a direct memory transfer procedure. However, after the transfer, BASIC programs would stop with errors at times, machine language programs would simply lock up, and the IO registers of the cartridge device would be corrupted. Clearly, our initial implementation has issues. Since education and understanding drive this effort, we need to dig deeper into the actual bus activity. For that, we must turn to the digital logic designer’s tool of choice: the logic analyzer

For anyone who has even a passing interest in digital circuitry, I (and so many others) strongly recommend obtaining a 10-20MHz dual channel oscilloscope. A multimeter may be the first tool purchased, but an economical dual channel scope should be next on the list. That said, while oscilloscopes are great to see transients and strange signals (like NTSC or audio), they don’t handle digital logic investigation as well. Enter the logic analyzer. Instead of trying to replicate the shape of a signal like a scope, the LA simply detects whether a signal is 0 or 1 (typically using TTL voltage levels as a reference, where 0 = 0-0.8V, and 1= 2.1V-5V). It performs this limited action across many channels, as opposed to the 2 or 4 of a scope. After a scope, I strongly recommend obtaining a small 8-channel USB-based analyzer. They are inexpensive, easy to use, and 8 channels supports simple parallel testing and a plethora of serial protocol testing (RS232, SPI, I2C, etc.). That said, debugging a single board computer or large interface card can take a long time with 8 (or even 16 or 34) analyzer channels. Thus, I also keep a larger professional grade logic analyzer (it used to be the only option, before the USB analyzer options came on the market). Found on eBay from test equipment manufacturers like Tektronix and HP/Agilent/Keysight at reasonable pricing, these units support dozens and sometimes hundreds of analysis channels at frequencies far beyond what the 1980’s computer enthusiast will regularly see.

HP/Agilent/Keysight 16702A Logic Analysis System

I’ve recently upgraded my bench analyzer from a 1980’s era HP 1650b (great unit, cheap to buy and own, now passed onto a fellow hardware designer to continue its usefulness) to a 2000’s era HP modular system (HP 16702a). This project gives me my first opportunity to learn how this unit works. I first connect the individual analyzer channels to the various 6809E signals using a 40-pin 3M Test Clip. I highly recommend adding this to your toolbox for signal inspection (the units are expensive if purchased new, but they never wear out, and eBay has more reasonable pricing), as they simplify moving the signal investigation to a new IC or new system. Since we want to investigate all of the bus activity, I place address lines 0-15 under test, as well as data lines 0-7, R/W (read/write), HALT, and the clock signals (E and Q). As we want to know what happens after the access of $ff61, we set a trigger on access to that memory location.

To more accurately pinpoint the issue, we’ll use a hastily (and terribly, I’ll add) written 6809 assembly application to exercise the test device and which exhibits the crashing behavior. A snippet is included below:

…
lda LOC	* grab initial data at $4000
jsr CONVERT	* convert binary to hexadecimal and return in D
std SCREEN	* store at $0400
lda DMA	* execute the DMA action ($ff61)
lda DATA	* grab data at $ff60 (the external IO data location)
jsr CONVERT	* convert binary to hexadecimal and return in D
std SCREEN+2	* store at $0402
lda LOC	* grab data at $4000 (the internal IO data location)
jsr CONVERT	* convert binary to hexadecimal and return in D
std SCREEN+4	* store at $0404
…

Since the application originates at $0e00, the lda DATA instruction after our $ff61 access resides at $0e4c. When we run the test program, the logic analyzer triggers and the program crashes. But, the evidence surfaces. On the logic analyzer, we see the following:

Address	Data	R/W	HALT	Notes
0e49	b6	1	1
0e4a	ff	1	1
0e4b	61	1	1
ffff	27	1	1	Dead Cycle
ff61	27	1	0	Nothing at $ff61, but bus rests at $27
4000	2d	0	1	Our data write ($2d = 45)
				We sample on falling E, so HALT=1
0e4d	ff	1	1	Why did we jump to $0e4d?

Folks probably start to see what is going on, but it pays to be sure. The Verilog is modified to not activate the address bus or R/W line, hold HALT low forever, and the test is repeated. Here is the result:

The problem begins to show itself. Even though the device has pulled HALT low before the end of the instruction execution, the CPU reads and executes one more instruction before releasing the bus. We can tell because of the state of the BS and BA lines. The 6809 datasheet notes that BS=BA=1 signifies a HALT condition in the CPU. Further attempts to pull HALT low earlier in the $ff61 read cycle make no difference.

Darren Atkinson (of CoCoSDC design fame) emailed after hearing about the project effort, asking about HALT line triggering. I initially misread his email as inquiring whether I had pulled HALT low early enough in the instruction cycle. But, Darren responded again and pointed to a key portion of the datasheet I had misinterpreted:

6809E Datasheet HALT Condition Specifics

Hint: It’s the text at the top of the page. I noticed the “2^nd to Last Cycle of Current Instruction” notation, but interpreted it to be illustrating to the reader that activating the HALT line before the last cycle of an instruction would not cause an immediate reaction. From my days working with the 6502, I knew that 8-bit processors are not designed to maintain an intermediate instruction state for any length of time. The CPU assumes that once an instruction is started, it must complete before anything else will be processed. While this doesn’t seem to be a concern for the HALT condition (just simply stop the processor, wait for HALT to become inactive, and continue on), it makes sense that HALT would use the same sense logic and internal handling logic as interrupts, and stopping the CPU in mid-instruction to handle an interrupt would require saving an intermediate instruction state. Thus, for that reason, the simpler 8-bit CPUs just don’t do that. Once an instruction opcode is fetched, the CPU will fully execute the current instruction before handling any event. If the datasheet showed the HALT line going low on the last cycle, it could imply that execution would stop at the end of the current instruction cycle. Thus, I thought this text was simply reinforcing the text describing the HALT pin: “A low level on this input pin will cause the MPU to stop running at the end of the present instruction and remain halted indefinitely without loss of data”.

However, my assessment was plain wrong. As Darren pointed out, the “2^nd to last instruction” text carries crucial significance. As well, it’s the key to the problem we are experiencing. The Verilog is activating HALT on the last cycle of the current instruction, which is actually too late. The CPU moves ahead to read and process the next instruction, only then latching and acting on the HALT condition. I wish this prerequisite had been noted somewhere in the datasheet, but I checked the 6809E, the 6809, the 63C09E and the 63C09 datasheets online and in my possession and found no mention of the constraint. That said, a Facebook commenter also pointed out this requirement, so perhaps everyone in CoCo land knows this tidbit of information.

Now that we know the issue, how do we solve it? I first added a NOP to the code, thinking that doing so would allow the next real instruction to be read correct (the lda $ff60). However, before testing, I quickly realized that would not work. Since the CPU is still reading an opcode, it would interpret any data on the databus during that cycle as the next opcode, potentially altering the program anyway. I next added Verilog code to wait an arbitrary 8 cycles after HALT activation before accessing memory:

The read from $ff61 happens, and HALT is activated
A NOP is read
The next instruction is read (the second cycle of a NOP instruction appears to read and discard the next opcode)
Then the CPU is halted (as shown by the BS=BA=1 condition
The Verilog waits a few more cycles (shown as $ffff:$27 read cycles)
Then the data is transferred to 4000 (in this test, the written value was $27) while HALT is deactivated.
The CPU takes a cycle to acknowledge the deactivation
The CPU takes another cycle to prepare for startup
Then, the $b6 opcode (of lda $ff60 instruction) is read

Success! BASIC test applications begin running to completion with no unexpected errors, and the test machine language application no longer crashes the machine, instead it runs to completion and illustrates that all 256 values can be transferred from outside the machine to internal memory locations.

That said, adding dead cycles to the DMA engine is far from ideal. In fact, after Darren noted that a full register stack-up could take 20+ cycles on a 63C09, I quickly found the dead cycle wait idea untenable. Clearly, we need to find a better way to trigger a DMA transfer. I quickly sketched out a few requirements and some preferences:

The action must pull HALT low prior to the second to last instruction cycle
The condition must not occur over two consecutive opcodes (interrupt could occur in between)
The condition must not require scanning the databus for specific opcode/operand sequences (i.e. watch the databus for $b6,$ff,$61)
1. Doing so requires constantly activating the SLENB line when installed in an MPI, since the data bus is hidden from MPI slots unless an IO access is made or SLENB is activated. And, SLENB being activated on arbitrary memory locations would cause other problems
2. There’s no guarantee such a sequence would never occur in any other way.
Ideally, the “trigger” action should be an instruction that would be otherwise needed (so as to not waste any time performing some action JUST to start the transfer)
If a memory address trigger, the address should be constant

I just as quickly decided that CPU instructions that perform 2 memory accesses in a single instruction would ideally suit these constraints. HALT could be triggered to go low on the first memory access, and if the second memory access did not occur 1 or 2 cycles later, the HALT condition would become inactive and the system would not perform the transfer. If the condition was met, the system initiate a transfer after the second memory access. I first gravitated to INC $ff61, which performs a ®ead-(M)odify-(W)rite action (reading the memory location, adding 1 to the value, and writing it back to memory). Since INC $ff61 performs a read cycle, then takes an internal cycle to perform arithmetic (ALU) operations, and finally writes the new value back, the Verilog state engine needed to detect the initial access, wait a cycle, and then detect the second access. That’s not hard to support, but it’s wasteful. Supporting preference #4 proves the larger challenge. Rarely would someone need to increment a value in the DMA engine register set. While I was debating options, fellow enthusiast and Nitros9 developer L. Curtis Boyle suggested LDD, which performs 2 consecutive memory accesses in 2 consecutive cycles. While LDD may not be of great use, STD (store D) would often be used to place a 16 bit value in the registers, as would other 16-bit memory store operations. Since it also simplifies the Verilog, we will utilize it as the DMA transfer start mechanism. In essence:

Once the lower IO register is accessed, bring HALT low and set flags
In the immediate next cycle:
- If the next higher IO address is accessed in the next cycle, continue to hold HALT low and prepare for a DMA transfer
- If not, release the HALT line and do not prepare for a DMA transfer

If the first condition is met but not the second, the CPU will likely notice the HALT condition and stall for 2 cycles after the current instruction, but no harm will arise. To prevent accidental DMA activity while setting register values, we should also add a bit in the eventual control register to enable or disable DMA transfers.

More testing is needed to understand other issues, if any, with assuming a transfer can begin immediately following an LDD/STD type instruction. Since Darren Atkinson already broached the subject of interrupt handling, he performed a test that activated NMI and HALT at the same time, attempting to understand which behavior took precedence. Darren utilized the following circuit and program snippet:

By organizing the stack right above screen memory and driving NMI and HALT low with an SCS access to $ff40, NMI precedence would show as characters on screen (stack values). When executed, no such values appeared, which strongly suggests HALT takes precedence over interrupts.

Now that we have verified the ability to safely transfer data from outside the CoCo to internal memory without disrupting program execution and a way to reliably and atomically trigger such a transfer, we will put all of the pieces together into action. Many thanks to those following along on this journey so far, and a special thanks to Mr. Atkinson for his insights.