Tuesday, March 3, 2015

But it works on my machine!!

I've heard this phrase numerous times while testing and communicating an issue/bug to a developer: "But it works on my machine!". For some of them, it's the first thing they'd say, sometimes even before I finish describing the exact sequence of events. And you'd think that I would have learned to handle this situation gracefully by now, but I still have to resist the urge to smack them and drag them to my desk or wherever the tests ran, and show them the error.

Well...I wouldn't be writing this just for that. I recently faced this issue myself. I've been working on creating an MQTT keyword library for Robot Framework. This library provides keywords to publish/subscribe to an MQTT broker. Source code is here: https://github.com/randomsync/robotframework-mqttlibrary

One of the keywords that is a part of this library is 'unsubscribe'. This lets a durable client (one which subscribed with clean session set to false) unsubscribe from a topic so that it doesn't receive any further messages published to the broker on that particular topic. If the client doesn't unsubscribe and disconnects, the subscription is still valid and the broker will deliver all messages received when the client next reconnects.

A test for this keyword is:
Step 1. Connect, Subscribe and Unsubscribe from a topic with a durable client (Client A)
Step 2. Publish messages to the topic with a different client (Client B)
Step 3. Connect as Client A, Subscribe and ensure that messages published by Client B are NOT received.

I wrote the test using Robot Framework and it worked on my mac. To run these tests, I'm using a local mosquitto broker and also a public broker provided by eclipse project at: http://iot.eclipse.org. While running the tests from my mac on both local broker and the eclipse broker, it verified that after unsubscribing and reconnecting, no messages were delivered. I pushed the change.

I also have the project set to build on travis-ci.org: https://travis-ci.org/randomsync/robotframework-mqttlibrary.  To my dismay, that test failed on travis-ci. WTF? "But it works on machine!!"

Typically, unless there's something obvious that you overlooked, the only way to tackle these kind of issues is process of elimination. We try to account for differences between local vs. remote server and determine if any one, or a combination of those differences might be the culprit. Of course, in these kind of scenarios, it helps if the local machines that you build on are as similar to the build/deploy servers as possible. (At Amazon, all engineers are given a RHEL VM instance to develop on, which is what is used for production deployments as well)

In my case, differences were:
Local environment: Mac, Python 2.7.6, pip 1.5.6,
Travis build instance: Ubuntu 12.04, Python 2.7.9, pip 6.0.7

Other dependencies were installed through pip and *should* be the same:
paho-mqtt: 1.1
robotframework: 2.8.7

Target server iot.eclipse.org is running mosquitto version 1.3.1 and locally, I have version 1.3.5 running.

So the first thing I could eliminate easily was the broker. I ran the tests from my machine using iot.eclipse.org as the target and they passed. Still, I went through the release notes for mosquitto server to see if there were any changes between 1.3.1 and 1.3.5 that might provide a clue.

Next thing I looked into was to somehow re-create locally the VM instance travis uses so I can better debug, because not having access to any logs or the machine where the tests fail is a major hinderance. I found some helpful articles [1] [2] [3]. There's also an option to upload the build artifacts to S3 as described here.

At that time, I didn't get a chance to try any of these. Ideally and as I mentioned before, you should have a build environment easily accessible that is as close to production as possible. So long term, it will help in debugging build issues to have a local instance similar to what travis-ci uses. In this case, I found that the tests failed when running on a local windows platform as well. So that made it easier to debug.

One of the things I had a hunch about right from the start was that I was not waiting long enough for 'unsubscribe' to complete. What if I send a 'disconnect' very quickly before the broker even finishes processing the 'unsubscribe' packet. I was able to confirm this by adding a 1 second sleep after unsubscribing on the windows machine. After adding that, the tests passed.

Obviously adding sleeps is not the correct fix. Paho client's documentation suggests to use one of the 'loop*' functions: http://eclipse.org/paho/clients/python/docs/#network-loop. These allow you to wait and confirm that the message was sent or received. I had overlooked these before but I went ahead and added those to the connect and subscribe functions (still need to do that for publish, disconnect) and was able to verify that the unsubscribe test worked without the sleep.

Conclusion:
  1. Inconsistent test failures are the bane of test automation. They undermine the value provided by test automation. Follow these as best practices:
    1. Design robust automated tests. DO NOT add an automated test if it's not 100% reliable. I would much rather have 1 reliable test than 10 unreliable tests.
    2. Have a build environment available locally which is very similar (if not the same) as the one used by your CI hosts
  2. But, just because a test is failing inconsistently doesn't always mean it's a test issue. It can be a bug in the code, as seen above. It definitely helps if the test automation engineers know how the application is implemented and can look at and understand the code. Sometimes, just looking at the code gives you ideas on what kind of edge conditions to test for. Sometimes, you just get lucky and find an issue which may have been overlooked.
  3. I still don't know why the tests pass on a mac and fail on a windows/ubuntu (travis) machine consistently. Python version is different but I didn't get to evaluate that. Could there be a difference in how the network packets are sent/received in whatever libraries the 2 versions of Python are using? There's also a slight chance that there's a bug in some client/broker implementations if the tests fail inconsistently.
    Next steps:
    • Setup a virtualenv on mac so I can use different versions of python 
    • Setup a local image used by travis-ci