I finished a debugging session yesterday where I had to hunt down a troublesome bug in the chibiArduino stack. It was first brought to my attention last week by a user of the chibiArduino stack who couldn't figure out why it would hang sometimes. He was running four devices with each device broadcasting every 0.1 seconds. It unwittingly created a an interesting stress test in the form of a broadcast storm with an average packet being sent and received by each node every 25 msec. This exposed a problem that I hadn't encountered before, which is strange since I feel like I've used the stack quite a bit.
Before I begin, I'd like to announce that there's an update to the chibiArduino stack with a bug fix in it. You can get it by going here. The rest of this post is long and somehow turned into a story about the hunt for this bug so I'm making the announcement at the beginning for those not interested in hunts for nasty bugs :)
Although it's true that I've used the stack a lot, I personally haven't had many use cases for a node that's simultaneously transmitting and receiving. That's probably why I haven't encountered this issue, but there are quite a few use cases where this could exist, ie: in a repeater mode, router hopping, or request/response scenarios. My use cases generally veer towards the data sink pattern, where all data flows in one direction towards an aggregator (i.e. sensor network), or data source pattern where data flows outward from a source to multiple receivers (i.e. actuator network).
That's why I'm glad that this stack is open source and used by people other than myself. The fact that nasty bugs lie dormant in all software is freaky so the more usage and transparency you have in software, the easier it is to flush out these types of problems and achieve high levels of stability. That's why open source kicks butt. There's a reason the majority of software that runs the internet is open source as well as rock solid. In any case, the bug that was found was a particularly nasty one. There's nothing that strikes as much fear into the heart of embedded developers as an intermittent bug. No matter how much experience they have, it always feels like going to a poorly skilled dentist with shaky hands and extremely sharp tools. Luckily, I was given a lot of details about the bug including detailed instructions on how to reproduce it.
I often get asked what the difference is between an experienced embedded developer and an inexperienced one. I guess there are quite a few differences, but the main one that stands out in my mind is debugging. The Arduino is great in that it opens the world of embedded to all types of people and makes embedded development accessible to almost everyone. But the dark side of this is that people get exposed to the nastiness of embedded systems debugging as well. Embedded system veterans have all types of war stories about nasty bugs they've had to do battle with. You're not only struggling with potential software issues, you're also up against hardware issues and sometimes the laws of physics. To fix the really nasty bugs, it requires a deep understanding of hardware, software, and sometimes, electromagnetics. Intermittent problems are probably the worst class of embedded systems bugs. These are the kind that result from signal integrity issues, intermittent noise, timing issues, buffer overflows, and wild pointers, among others. These classes of bugs seemingly defy logic and often have symptoms that seem nowhere related to the real problem that's occurring. In this case, the symptom of the problem was that one of the four systems would hang after some random amount of time.
The first step I take when tackling an intermittent bug, or any bug for that matter, is to reproduce the issue. In some cases, this is an extremely painful part, especially if the bug only occurs once every few days. I've had some experiences where I've parked a logic analyzer or an oscilloscope on a PCB with special trigger options just to catch an occurrence of an intermittent bug. Luckily, I was able to reproduce the bug fairly quickly since I had detailed instructions on it. Unfortunately, it took a while for the hang to occur so my next step is usually to try and accelerate the bug occurrence. This is especially important in statistical bugs, where an event would happen once in say every 1 billion events. In cases like that, I try to increase the event speed or event density so I can force the bug occurrence as quickly as possible.
In this case, I just increased the broadcast transmission frequency which resulted in an increased frequency of bug occurrences. After that, I wanted to locate exactly where the code was hanging. I set up an output pin and set it high or low to indicate when the code successfully passed through certain lines. It's similar to debug print statements but when debugging intermittent issues in embeddeds systems, print statements often alter timing and thus alter the bug. This could lead to chasing down false leads since the system integrity is no longer intact. Toggling an I/O is generally much faster and has less of an impact on system timing.
There's also a common complaint that the Arduino platform doesn't support AVR debuggers. I'm fairly neutral on this, but I feel like a debugger is usually a crutch. When chasing after serious bugs like timing issues, a debugger is too intrusive to the system to be used at all. In most cases, I think a debugger is best used for reverse engineering code. In some cases, if you're lucky enough to have stack traces and breakpoints on memory accesses, then it could be used for more advanced debugging like wild pointers, but then it won't actually tell you where the wild pointer is occurring. I think debuggers are good to have and I'd probably use one if it were available, but I don't feel like I'm missing out on much. Ooops, back to the story....
Once I have the toggling set up, I then start from the broadest point, i.e. the sketch and isolate the problem as much as possible by drilling deeper and deeper into the code. This is why it's good to be able to force problems quickly. When you're able to do that, you have the luxury of debugging iteratively by moving the I/O toggle points and slowly isolating the location of the bug.
I continued to isolate the bug and also test different theories I had about the problem. One theory was that a buffer overflow was somehow occurring in the stack, although I thought I had enough checks to make sure this didn't happen. Still, it was possible due to issues with atomicity. Atomicity is whether or not a statement is divisible. One common mistake is to assume a C statement is atomic, and this mistake is usually made by people that have never programmed in assembly. When C gets compiled, the C code gets broken down into a series of machine instructions. Each machine instruction occurs sequentially. For example, a register read would usually require multiple machine instructions to execute. A frequent mistake is to have a single line of C code to read a register and assume this happense atomically. In reality, that line of code gets compiled into multiple machine instructions and its possible that after the instruction to read the register, an event occurs to alter that register. This results in invalid data and if that invalid data is acted upon, it creates an intermittent bug. The usual safeguard against this is to make the register read atomic which usually means shutting off interrupts. In my case, one theory was that I didn't make all writes to the buffer atomic so there could be a case where I check to make sure there's room in the buffer for more data but due to atomicity, an event occurs to invalidate that. After careful inspection and head scratching, I couldn't find anywhere in the code where this could happen.
Eventually I isolated it to the transmit routine where I would transmit a frame, wait for an interrupt to set a flag signaling the transmission is finished, and then go on with the code. For some reason, the flag wasn't being set. Logically, I couldn't think of any reason this would occur and I had to dive deep into both the chip datasheet and the code to analyze the problem. Eventually, it occured to me that after a transmission, the radio waits for an ack. After the ack, it issues an interrupt to signal the transmission has ended and the radio state machine switches over to receive mode automatically. When I service the interrupt, I check the mode to make sure its in transmit mode when the interrupt occurs. Normally there's plenty of time to service the interrupt in transmit mode before it automatically switches over to receive mode. However if something happened to delay servicing the interrupt, then it's possible that the interrupt could be serviced after the hardware state machine switched over to receive mode, in which case, the flag indicating the transmission was accepted never gets set. It just so happens that print statements in the Arduino library use an interrupt service routine and since the AVR doesn't allow nested interrupts, any interupt must wait until the current interrupt is finished before being serviced. This means that a print statement could delay servicing the transmit interrupt long enough for the state machine to change. And further testing showed that the hang would disappear if nothing was being printed.
To summarize, the bug was a race condition between the hardware state machine and the AVR print statements. If a print statement occurred right after data was transmitted and blocked the radio interrupt service routine long enough for the radio state machine to transition back to receive mode, then the flag would not get set and the system would hang.
Whew! Now that the bug was isolated, then I just needed to make a fix. The fix was fairly simple. I had an if statement in the interrupt service routine to check that the hardware state machine was in tx mode. I removed the conditional so the flag would be set in either tx or rx mode. In rx mode, it would normally just get ignored except in the special case where the race condition occurred. In that particular condition, transitioning to receive mode itself indicates that the transmission is finished. Hence, no impact. The timing issue no longer exists and the race condition is gone. The painful part was hunting the bug down.
This post ended up a lot longer than I expected. I originally just wanted to talk about the bug and wasn't really meaning to go into much depth about hunting the bug down. But since I did, hopefully someone finds it useful in their bug hunts :)