Tuesday, January 12, 2016

Using Raspberry Pi to digitize the "Weeks Since Friday Deploy" counter

At Payoff, we use github flow for our software development and release process. It means that among other development best practices, we consider our master branch always deployable to production, and also that we release to production several times in a week, sometimes several times in a day.

One additional rule we follow is to not deploy to production on Fridays except for emergency deployments. This is because inevitably things go wrong during production deployments and we do not want our engineers to spend Fridays or weekends trying to debug a production deployment issue. We also make use of dark deployments and feature flags to further ease the production release process.

To make everyone in the team aware of how often we are breaking the "No Friday deployment" rule due to emergency deployments, we put up a small poster which tracks the number of weeks it has been since we did a Friday deployment. This is visible throughout our engineering area. Obviously the higher this number, the longer we have been able to adhere to this rule. Any Friday deployment to production will bring this number back to 0, which is what seems to have happened below:

Weeks Since Friday Deployment Counter

This number was initially being updated manually on sticky notes. But since we use Bamboo as a continuous integration tool, we have access to its REST API to query the dates of production deployments. With this API and a little bit of code on a Raspberry Pi, there was no reason not to automate it and display the number on a digital counter.


First off, the display we use is an 8x8 LED matrix, available here with a display driver. It’s big and bright enough to be visible within our engineering room and can be daisy chained to other matrices. Also, there is a convenient python module available to drive the display. After a bit of soldering and hooking it up to Raspberry Pi's GPIO ports, here's what the initial setup looked like:

Raspberry Pi with MAX7219 Dot Matrix Module

And here's a snippet of python code to query Bamboo's REST API to get the weeks since a production deployment was done on a Friday:

BAMBOO_URL = "https://<bamboo host>/rest/api/latest"
DASHBOARD_URL = BAMBOO_URL + "/deploy/dashboard"
RESULTS_URL = BAMBOO_URL + "/deploy/environment/{}/results"
def get_weeks_since_friday_deploy(user, passwd):
   weeks_since_friday_deploy = sys.maxint
   r = requests.get(DASHBOARD_URL, auth=(user, passwd))
   prod_env_ids = {}
   # first get all production environments' ids which will
   # be used to get deployments
   for project in r.json():
       name = project['deploymentProject']['name']
       envs = project['environmentStatuses']
       for env in envs:
           if env['environment']['name'] == 'Production':
               if 'deploymentResult' not in env:
                   prod_env_ids[name] = env['environment']['id']
   for key, value in prod_env_ids.iteritems():
       # no need to continue if it's already 0
       if weeks_since_friday_deploy == 0:
       r = requests.get(RESULTS_URL.format(value), auth=(user, passwd))
       deployments = r.json()['results']
       for dep in deployments:
           start = datetime.datetime.fromtimestamp(dep['startedDate'] / 1e3)
           end = datetime.datetime.fromtimestamp(dep['finishedDate'] / 1e3)
           if start.isoweekday() == 5 or end.isoweekday() == 5:
               weeks = (datetime.datetime.now() - start).days / 7
               if weeks < weeks_since_friday_deploy:
                   weeks_since_friday_deploy = weeks
               weeks = (datetime.datetime.now() - end).days / 7
               if weeks < weeks_since_friday_deploy:
                   weeks_since_friday_deploy = weeks
               # it is possible to get a negative number in an edge case,
               # if the script runs very close to the time of deployment
               if weeks_since_friday_deploy < 0:
                   weeks_since_friday_deploy = 0
           # once it's 0, no need to continue further
           if weeks_since_friday_deploy == 0:
   return weeks_since_friday_deploy

Using the code above and the python library to drive the display (https://github.com/rm-hull/max7219, it is easy to display the number on the matrix and update it automatically. It is setup to run as a cron job every 5 minutes on Fridays.

Weeks Since Friday Deployment Counter (Automated)


Next step: We haven’t reached there yet but if the number goes above 9, another matrix can be daisy chained together to form a cascaded display.

Thursday, November 12, 2015

What can you do with code?

I recently started mentoring a local high school's FRC team. Even though the challenge hasn't been announced yet, the team has started putting back together last year's robot just to get in the rhythm. We are also trying to recruit more students for the software team, since those who programmed last year's robot will be graduating this year.

So I was tasked with getting these students familiar with the code. Now these are students who have had just a little intro to programming, either through previous involvement in a robotics team or through an Intro level programming course. I myself would have to spend some time with the code and the API and understand how it all works before I can guide them. I couldn't get my hands on the code before our first meeting, so instead I thought of showing them some other real life code and their applications. To make it fun, I showed them a snippet of the code first and had them try to guess what the application is.

Here are the 4 snippets of code and the applications (the slides are below as well):

1. I am a big fan of FPS games and I thought the students must have played some kinds of those games and it would be a good start to get them excited. So I included the Doom 3 source code as explained at http://fabiensanglard.net/doom3/index.php

2. I had to include the code that started the OSS revolution, so I included the starting point of linux kernel.

3. At this point, I didn't want the students to get overwhelmed to see that code can only be written by a team of very talented software programmers and takes years to write. So I included some code from the project that won the Astro Pi contest (http://astro-pi.org/competition/winners/). The code was written by students just like them and I explained to them what it does and that it would be sent to the International Space Station in an upcoming launch.

4. Lastly, I wanted to include something that would be fun to show that code doesn't always have to have world changing implications. I searched for some cool Raspberry Pi projects and found this: http://www.scottmadethis.net/interactive/beetbox/.

In the end, I told them that it would be great fun to work on this project as a team. In the last slide, I asked them to not think "What can you do with code?", but "What will you do with code?".


Friday, October 16, 2015

Test Automation Frameworks

In the test engineering team, one of the choices we have to make often is to pick the best automation tool for a particular project. In terms of automation frameworks, one size does not fit all. Each tool/framework has its own strengths, which make it suitable to be used for specific types of projects. In this blog post, I'd like to introduce to you (in no particular order) the various frameworks we use for test automation and the reasons they were chosen for that particular application.

Robot Framework

Robot Framework is a generic test automation/execution framework. It's generic in the sense that it doesn't provide any real automation capabilities itself, but has pluggable underlying libraries that can be used for most automation needs. There is a Selenium library for browser automation, Appium library for mobile and a Requests library for HTTP web services testing. The strengths of Robot Framework are in its ease of use, flexibility and extensibility. You can create your tests in easy to read text (or markdown or HTML) format and you can use the lower level keywords provided by the libraries to create higher level keywords that describe your application better. This is particularly beneficial for test engineers because it lowers the automation learning curve for other team members, like Product Managers and Developers. Test Engineers can provide the keywords in such a way that writing tests becomes easy and just a matter of picking up the right keywords and providing right arguments.
We use Robot Framework for some of our front end Rails projects. These projects have mostly design and text changes and are tested using Selenium library. Once the features of these applications are described appropriately in keywords, it's easy to create or update tests. For example, here is a test that opens our home page, clicks on a link and validates that the "Apply Now" button in present.

*** Test Cases ***
| Validate About Page
| | [Documentation]             | Validate content in 'About Payoff' page
| | [Tags]                      | about
| | Open Browser To Home Page
| | Click Element               | link=About Payoff
| | Title Should Be             | About Us \| Next-Generation Financial Services \| Payoff
| | Page should contain element | id=btn-apply-now-get-started
| | [Teardown]                  | Close Browser

Example Robot Framework Test Case


For testing our loan UI application we use Capybara, a web automation framework. We use it with Selenium WebDriver to run the tests in a real browser, but it can also be used with headless browsers like capybara-webkit or poltergeist. While Robot Framework works great for validating simpler UI applications, we need a more robust programming environment for testing loan application because of multiple steps and conditionals involved. We also have an external data source for this application, and it gets tedious to write data driven tests with multiple steps in Robot Framework.
Capybara has a nice DSL to write expressive UI tests. Another advantage of writing tests in Capybara is that we can use our internally developed gems (or external gems, if needed) to enhance the automated tests. For example, we use feature flags to disable/enable certain functionality in our applications and to test them, we might use these gems to turn the feature flags off or on, run the tests and then toggle them back to their original state.
Here's a snippet of a typical feature test written in Capybara:

feature 'Loan application flow' do
  scenario 'complete and submit an application for approval' do
    # homepage
    visit '/'
    # start application
    # loan amount
    expect(page).to have_content('How much do you owe on your credit cards?')
    fill_in('loan-amount', :with => '10000')
    # FICO score
    expect(page).to have_content('What\'s your FICO Score?')
    select('Excellent (720+)', :from => 'credit_score_bracket')
    # name
    expect(page).to have_content('What\'s your name?')

Feature Test in Capybara


Karma is a javascript test runner for unit testing of javascript in web applications. The tests can be described in frameworks like Jasmine. It is primarily used by our front end developers. Karma integrates with our development workflow as the tests are included as part of the rails package and run every time a build runs in our CI environment.


Airborne is a gem that makes it easy to write webservice API tests using Rspec. The reasons for using this gem as opposed to others is to make the tests and validations as descriptive as possible. We also have some APIs that have a lot of endpoints or intermediate states to test, so it especially important to make it easy to add new tests. 
Here's a snippet of hypothetical tests for httpbin.org:

describe 'httpbin api service' do
  before(:all) do
    @url = 'http://httpbin.org'
  it 'get returns original url in json response' do
    get @url + '/get'
    expect_json(url: 'http://httpbin.org/get')
    expect_json_sizes(args: 0)
  it 'get with url params returns the params in json response' do
    get @url + '/get?spec=true'
    expect_json_sizes(args: 1)
    expect_json('args', spec: 'true')
  it 'post returns data in json response' do
    post @url + '/post', { :spec => 'true' }
    expect_json('json', spec: 'true')

API Tests using Airborne


TestNG is a Java based test framework, which provides capabilities to write automated tests in Java. So for example, browser automation tests can be written using Selenium WebDriver's Java bindings, or for REST based services using libraries like rest-assured. We use TestNG for testing some of our internal REST services, specifically because TestNG makes it easy to use external data source such as CSV or Excel file using DataProviders. It also integrates nicely with build tools like maven and CI tools like Jenkins and Bamboo.
Overall, TestNG works well for automation projects in Java but to leverage our internal toolset and Ruby expertise, we are using Rspec + Capybara or Airborne for automation.

Galen Framework

We have recently started investing some time in Galen Framework for testing the UI/UX of our front-end applications. This framework promises to solve the widespread problem of testing the layout of web applications on multiple display resolutions. So far, the framework looks promising and we are continuing to automate the layout testing of our marketing websites. We plan on partnering with our UX designers so that they can use this tool to test the layout as they make changes to the applications.
Galen Framework uses Selenium as the underlying automation tool, and the layout specs are written in its own spec language. In the below layout spec, we are verifying some elements on our marketing home page on 2 different layouts (desktop and mobile) with specific display resolutions:

    how-payoff-works        id hdr-how-payoff-works
    rates-fees              id hdr-rates-and-fees
    about-payoff            id hdr-about-payoff
= Verify Marketing Landing page Elements =
    @on desktop
            text matches "How Payoff Works"
            width ~131px
            height ~70px
            aligned horizontally all how-payoff-works 2px
    @on mobile
            text matches "How Payoff Works"
            width 345 to 360 px
            height ~40px
            aligned vertically all how-payoff-works 2px
            text matches "Menu"
            width ~81px
            height ~40px
            aligned horizontally all login-mobile 2px

Galen Framework Layout Spec

Galen Framework can handle differences between a mobile or desktop screen, like cases where an element is visible in one and not the other. For example, the menu-button above is not visible on a desktop screen resolution and so, it can be specified to be absent on that.
The interaction with UI can be automated using Java (or Javascript) and as mentioned earlier, it uses Selenium WebDriver for that. Here's an example that use the above spec to validate the layout:

test("Desktop Landing Page Elements", function(){
    var driver = session.get("driver");
    checkLayout(driver, "./specs/marketingPage.gspec", "desktop");

Galen Framework Test


Appium is a mobile automation framework that allows writing tests using a common API for both iOS and Android platforms. Since it uses WebDriver protocol, tests are written similarly as browser tests on a desktop using WebDriver. Currently, we don't have a use case for native app testing so we are content with using Appium for testing our websites on mobile browsers. Once we do, we will explore other tools/frameworks like UIAutomator for Android, UI Automation for iOS or Calabash for both Android and iOS etc.

Appendix: Selenium WebDriver

I couldn't end this post on Test Automation frameworks without talking more about Selenium WebDriver, which is the underlying tool for browser automation. There was a time in distant history when Mercury's WinRunner and later, QuickTest Pro were the tools of choice for UI Automation. At that point, the term Continuous Integration hadn't been coined by Martin Fowler and the idea of Test Automation for most companies was to have some QA engineers write some scripts on the side and run them manually to make repeated, time consuming tests execute faster. QuickTest Pro was a commercial tool and cost quite a bit, AND it only supported Internet Explorer on Windows. Nevertheless, it was good for what it did, which was to reduce a bit of manual testing overhead for repeated regression testing. It also integrated well with Mercury's TestDirector or Quality Center for test management, so you could envision tying together requirements gathering, test planning, test execution and defect reporting within one tool.
But as IE lost ground to Firefox and Chrome and mobile browsing grew more popular, and as HP acquired Mercury, QuickTest Pro started falling behind. More and more teams wanted to adopt agile methodologies so HP's tools seemed very heavy to use and didn't fit in an agile environment. Around that time, Selenium started gaining popularity. It was open source, so test engineers could easily prototype an automation solution with it and convince their teams to start using it. Support for testing on different browsers was a huge bonus, as well as its easy integration with continuous integration tools. The open source community behind it continued improving it with support from companies like Mozilla, Google and ThoughtWorks. The ability to write automated tests in any of the popular languages and Selenium's integration with a number of frameworks like TestNG, Junit, Capybara, Cucumber, Watir etc. added to its appeal.
Paul Hammant wrote a blog post with his thoughts on why QTP was not relevant anymore. He has some graphs showing the popularity of Selenium vs. QTP on job site indeed.com. Those are a bit old (2011) so I pulled up a similar graph myself, and it is very telling:

Job Trends from Indeed.com - Relative Growth (UFT vs QTP vs quicktest vs Selenium)

Appendix: Apache JMeter

While this post is about functional automation, I want to briefly mention Apache JMeter, which is the tool we use for load/performance testing. In a future post, I'll describe how we use it and how it integrates in our continuous deployment process.

Thursday, June 25, 2015

What Erno Rubik can teach you about hiring

I recently came across this video of an interview of Erno Rubik, the inventor of Rubik's cube which possibly is the most popular puzzle/toy ever. The interview has fascinating insights on what led to the creation of the cube.

Apart from everything else he has to say, there are 2 things which really struck me. One, he says he's a "very ordinary man". Considering that he built something that has captured the imagination and has had such a positive influence on millions of people, it's truly inspiring to see how humble he is. Second, he says if there's any special thing about him, it's that he loves what he does.

Thinking about it, those are the 2 most important qualities we look for when interviewing people for our team. One, that they are humble. No matter what your accomplishments or skills, if you go around beating your chest about them, you're not going to be able to work cohesively in a team. We will not hire so called rockstars who can deliver 10 times more than a normal engineer, but have trouble having a conversation without berating or criticizing someone.

The 2nd quality - to love what you do - is just as important. It is what drives people to pursue mastery of a skill regardless of the end result. It is what makes you continuously improve yourself whether you're successful in the short term or not. And it is what makes good team players and helps deliver successful products.

In essence, there is a special talent of celebrating your skills and accomplishments in a humble way, which Erno Rubik portrays. And someone with that skill, and the skill of loving your work is always a pleasure to know and work with.

Tuesday, March 3, 2015

But it works on my machine!!

I've heard this phrase numerous times while testing and communicating an issue/bug to a developer: "But it works on my machine!". For some of them, it's the first thing they'd say, sometimes even before I finish describing the exact sequence of events. And you'd think that I would have learned to handle this situation gracefully by now, but I still have to resist the urge to smack them and drag them to my desk or wherever the tests ran, and show them the error.

Well...I wouldn't be writing this just for that. I recently faced this issue myself. I've been working on creating an MQTT keyword library for Robot Framework. This library provides keywords to publish/subscribe to an MQTT broker. Source code is here: https://github.com/randomsync/robotframework-mqttlibrary

One of the keywords that is a part of this library is 'unsubscribe'. This lets a durable client (one which subscribed with clean session set to false) unsubscribe from a topic so that it doesn't receive any further messages published to the broker on that particular topic. If the client doesn't unsubscribe and disconnects, the subscription is still valid and the broker will deliver all messages received when the client next reconnects.

A test for this keyword is:
Step 1. Connect, Subscribe and Unsubscribe from a topic with a durable client (Client A)
Step 2. Publish messages to the topic with a different client (Client B)
Step 3. Connect as Client A, Subscribe and ensure that messages published by Client B are NOT received.

I wrote the test using Robot Framework and it worked on my mac. To run these tests, I'm using a local mosquitto broker and also a public broker provided by eclipse project at: http://iot.eclipse.org. While running the tests from my mac on both local broker and the eclipse broker, it verified that after unsubscribing and reconnecting, no messages were delivered. I pushed the change.

I also have the project set to build on travis-ci.org: https://travis-ci.org/randomsync/robotframework-mqttlibrary.  To my dismay, that test failed on travis-ci. WTF? "But it works on machine!!"

Typically, unless there's something obvious that you overlooked, the only way to tackle these kind of issues is process of elimination. We try to account for differences between local vs. remote server and determine if any one, or a combination of those differences might be the culprit. Of course, in these kind of scenarios, it helps if the local machines that you build on are as similar to the build/deploy servers as possible. (At Amazon, all engineers are given a RHEL VM instance to develop on, which is what is used for production deployments as well)

In my case, differences were:
Local environment: Mac, Python 2.7.6, pip 1.5.6,
Travis build instance: Ubuntu 12.04, Python 2.7.9, pip 6.0.7

Other dependencies were installed through pip and *should* be the same:
paho-mqtt: 1.1
robotframework: 2.8.7

Target server iot.eclipse.org is running mosquitto version 1.3.1 and locally, I have version 1.3.5 running.

So the first thing I could eliminate easily was the broker. I ran the tests from my machine using iot.eclipse.org as the target and they passed. Still, I went through the release notes for mosquitto server to see if there were any changes between 1.3.1 and 1.3.5 that might provide a clue.

Next thing I looked into was to somehow re-create locally the VM instance travis uses so I can better debug, because not having access to any logs or the machine where the tests fail is a major hinderance. I found some helpful articles [1] [2] [3]. There's also an option to upload the build artifacts to S3 as described here.

At that time, I didn't get a chance to try any of these. Ideally and as I mentioned before, you should have a build environment easily accessible that is as close to production as possible. So long term, it will help in debugging build issues to have a local instance similar to what travis-ci uses. In this case, I found that the tests failed when running on a local windows platform as well. So that made it easier to debug.

One of the things I had a hunch about right from the start was that I was not waiting long enough for 'unsubscribe' to complete. What if I send a 'disconnect' very quickly before the broker even finishes processing the 'unsubscribe' packet. I was able to confirm this by adding a 1 second sleep after unsubscribing on the windows machine. After adding that, the tests passed.

Obviously adding sleeps is not the correct fix. Paho client's documentation suggests to use one of the 'loop*' functions: http://eclipse.org/paho/clients/python/docs/#network-loop. These allow you to wait and confirm that the message was sent or received. I had overlooked these before but I went ahead and added those to the connect and subscribe functions (still need to do that for publish, disconnect) and was able to verify that the unsubscribe test worked without the sleep.

  1. Inconsistent test failures are the bane of test automation. They undermine the value provided by test automation. Follow these as best practices:
    1. Design robust automated tests. DO NOT add an automated test if it's not 100% reliable. I would much rather have 1 reliable test than 10 unreliable tests.
    2. Have a build environment available locally which is very similar (if not the same) as the one used by your CI hosts
  2. But, just because a test is failing inconsistently doesn't always mean it's a test issue. It can be a bug in the code, as seen above. It definitely helps if the test automation engineers know how the application is implemented and can look at and understand the code. Sometimes, just looking at the code gives you ideas on what kind of edge conditions to test for. Sometimes, you just get lucky and find an issue which may have been overlooked.
  3. I still don't know why the tests pass on a mac and fail on a windows/ubuntu (travis) machine consistently. Python version is different but I didn't get to evaluate that. Could there be a difference in how the network packets are sent/received in whatever libraries the 2 versions of Python are using? There's also a slight chance that there's a bug in some client/broker implementations if the tests fail inconsistently.
    Next steps:
    • Setup a virtualenv on mac so I can use different versions of python 
    • Setup a local image used by travis-ci

Wednesday, February 29, 2012

How I’m NOT going to delete my Windows profile folder again?

So I deleted the complete profile folder under C:\Documents and Settings\<my user id> on my work machine. Everything under that was gone, except for a few read-only folders that survived. My Documents, Application Data, Desktop, Local Settings, Favorites…gone. My Outlook archives, my OneNote notes and a few years worth of accumulated documents…all done for. Plus all my Java projects, since I configured git under that folder and all my git projects were there.
How did I do that? Well…I was creating a new Java project using Eclipse to demo the use of excel-testng. But since I didn’t want to use the default workspace location, I specified the location where I keep the git projects, which happens to be the profile folder. Usually when I use the default location, eclipse creates a sub-folder under the default location with the same name as the project as below:
But when I didn’t use the default location, it didn’t create sub-folder with the project name. It created the project right under that location!

And having not realized that and since I wanted to remove the project (can’t remember why), I proceeded to delete the project.

Now, to be fair to myself and going back to that moment just before I clicked OK, I have never ever before needed to not delete the project contents from disk. The whole idea of deleting it from eclipse workspace has always been to get rid of the project, completely. And that’s what I did. At that time, I found it a little surprising that it was taking a little longer that usual. But it was 5pm on Friday evening and things usually get a little strange around that time, so I was ready for anything. Just not ready for what happened next.
It probably took me less than a few seconds to realize what had happened since I had an explorer window open on that folder, and I saw blank white screen where I had my profile folder contents previously. And I’m not going to mention the words that came out of my mouth next, or the next few hours I spent trying to restore the contents unsuccessfully and recreate the git projects.
There’s a happy ending to this: I had got my laptop replaced just a week ago and had a backup of most of that folder. I’ll most probably never find out if there was anything I was NOT able to recover, and that’s a good thing!

Thursday, February 23, 2012

excel-testng: driving TestNG tests through MS Excel

Over last few days, I have been working on a small project in my spare time. It's called excel-testng and it provides a way to drive TestNG tests through MS Excel. The code repository is here: http://code.google.com/p/excel-testng/ and the jars can be downloaded from here. I also put together a small automation project using Selenium WebDriver to demonstrate its use here: https://github.com/randomsync/excel-testng-demo.
During functional testing projects, after we create tests for a certain application, we need to review them with the rest of the team (developers, analysts, project managers etc). We may use a test management tool to document the test cases and then export them into an easily distributable format like MS Excel or PDF. Sometimes, we may also create the tests in Excel directly specifying the test name, description, parameters and other data in the spreadsheets. And then finally, we may automate a part (or all) of the tests.
So after the tests are automated and when executing the tests, we need to specify which tests to run and the test data (parameters). In Quality Center, it involves creating a test set (kind of like a test suite) and then adding tests to it. If they are automated in QTP which integrates with QC, they can be executed by running the test set. But we have started using Selenium WebDriver for its cross browser capabilities and that means we need to specify the tests in a format that can drive the Selenium tests. We use TestNG as the framework for test execution, assertions and reporting.
Extending TestNG
TestNG is a great framework for test execution and its input can be in form of an XML file that specifies which tests to run, where to find the test classes/methods, test parameters and a whole lot of other features that gives you a fine grained control over each test execution. However for our UI functional tests, we wanted to be able to specify the tests and test executions in an easily distributable format (like MS Excel) as mentioned above. The good thing is that TestNG provides the capability to extend it so that its input can be created and executed programmatically. So this way, the test specification in Excel files can be parsed and driven through TestNG. And this is exactly what excel-testng does. It externalizes all Excel parsing and TestNG XmlSuite creation so that the focus can be in creating the test classes and methods. Once that is done and Excel specifications created, all that needs to be done is to provide a main method that does this:
   1: ExcelTestNGRunner runner = new ExcelTestNGRunner("input.xls"); // this can be a single file
2: // or a directory, in which case all spreadsheets in that directory are parsed
3: runner.run(); // run the tests

With ExcelTestNGRunner class, you just specify the input location of the Excel file(s) that constitute the test specifications and then call the run() method, which parses the excel file using default parser (an instance of ExcelSuiteParser) into XmlSuites, creates a TestNG object if not already created and then runs them. After that TestNG takes care of test execution and reporting.
Excel Test Specification
ExcelTestNGRunner parses each worksheet in the Excel file(s) into a separate suite using the included parser (ExcelSuiteParser). Each suite can have suite level parameters specified in the worksheet and the test specifications that specify which tests will be run, their name, description, parameters and the test classes. Here's what a test suite in Excel looks like (a demo spreadsheet can also be downloaded from here and can be used as a starting point):

Specifying TestNG tests in Excel File
The top few rows in the worksheet provide the suite information. ExcelSuiteParser looks for the string "Suite Name" and retrieves the name of the suite from next cell in the same row. Similarly, it looks for string "Suite Parameters" and retrieves suite parameters from the next cell. "Suite Configuration" is not currently used. You can customize the location of these values by providing your own map to the parser. See "Customizing Input" for more details.

To retrieve the tests that will be executed, it looks for the 1st row containing "Id" in the 1st column. This will be the header row, below which each row is a separate test. The header row and tests must have columns specifying Id, Test Name, Test Parameters and Test Configuration (which specifies the classes containing the Test methods). If "Id" is left blank, the test will not be added to the suite. Finally, it parses each row under the header row into a TestNG XmlTest and adds it to the suite. The test specifications are provided as:
  • Test Name: generated by concatenating Id & Test Name
  • Test Parameters: retrieved from "Test Parameters" column and need to be provided in valid properties (<key>=<value> etc.) format. You can also specify functions as parameter values and then add the logic to parse and evaluate the functions in your test classes (maybe a Base Test class)
  • Test Classes: specified under "Test Configuration" column as classes property. Currently, you can only specify a single test class of which, all @Test annotated methods will be executed as a part of this test execution. I'm going to add the ability to select the test methods in later releases.
Customizing Input
If your test cases are specified in Excel but in different format, there are 2 levels of customizations you can do with ExcelTestNGRunner on how to parse the input spreadsheet(s):
  1. Custom Parser Map (currently not implemented): You can use the in-built parser but specify your own parser map, which tells the parser where it can find the suite and test data
  2. Custom Parser: You can create your own parser by implementing IExcelFileParser interface. You need to parse the spreadsheet file and return a list of TestNG XmlSuites.
ExcelTestNGRunner also provides helper methods to customize the TestNG object it uses to execute tests. For example, you can specify any custom listeners using addTestNGListener() method. If you need to have more control, you can create your own TestNG object and then pass it to ExcelTestNGRunner. Please see javadocs for more details.
Putting it together
You can see the project at https://github.com/randomsync/excel-testng-demo for a complete working demo of excel-testng to parse the input test specifications in Excel. It uses Selenium WebDriver to automate the testing of basic Google search functionality.

Update (3/12/2012): I'm now using google code to host this project because of the provision to host the downloadable jars, javadocs and wiki easily. I'll keep it synced with github but you can find the project documentation and downloads there.