Fostering Credibility in Customer Tests.
We had a guest in a recent iteration planning meeting. He is our main customer representative's boss's boss. His attendance is most welcome, and sadly rare.
At one point during the meeting, I described an Acceptance Test, talking the whole group through it and then executing it. We got green. All seemed right with the world. Until I said something like "... proving that this functionality is working as specified" to which our guest replied "if you say so".
"If you say so".
This broke my heart. Think about that. These "Acceptance Tests" that are so core to our process, that we rely on so heavily, that we believe in so deeply, in his view have little to do with Acceptance. To him, they are part of our (developers) toolset and don't serve the business in any other way.
Now, don't get the wrong idea. We have a lot of conversation with our customers and the primary tool that facilitates these conversations is FitNesse. However we have failed to get our liason to sing the praises of FitNesse higher up the chain.
And there's an even deeper problem.
Early on in this project, when we were introducing agile development to this organization for the first time, we had an opportunity (a responsibility, one might argue) to make believers out of them. We told them a story that says "work with us to specify your system, document that work in these Acceptance Tests, and believe in them. They will always tell you the truth about the health of your system."
We used the word Acceptance, but we never really defined what that means. We just took it for granted that they understood it meant for them to accept the software we delivered, it had to pass these tests. We did tell them the Acceptance Tests would reflect the state of the system as a whole. And that if the entire FitNesse suite turned green, then everything that had been specified was working as specified.
So where did our Acceptance Tests lose their meaning to the customer - lose their credibility?
In an early iteration planning meeting, we showed them one test in which the story says that the background color of a particular GUI widget should turn blue, given a certain state, gray if otherwise. We didn't have a good way to demonstrate the Blue/Gray thing in FitNesse, but we wanted 100% automated Acceptance Tests, so we did what anybody might do in this position. We fudged.
We wrote code in a fixture that output the text "Blue" if the system was in that state, "Gray" if not.
Anybody smell anything funny here? We did, but we rationalized it. It was a good way to "express what was going on in the system". And THAT, my friends is where the credibility began to wane.
THE FITNESSE TEST WAS A LIE.
The test did not prove that the system met the spec at all. It proved that the Fixture met an "expression of the story".
The (albeit unintended) deception continues.
In our exploration for balance between unit and integration tests, we decided that, for our purposes, we would set up all of our FitNesse pages so that they could be run pointing to a real database or a fake in-memory database. This helped us (developers) to feel confident that the business logic was nicely segregated from the persistence logic, and that the business logic would work regardless of the persistence implementation. I think that we tried to explain that to the customers as well, but I don't think they really got the concept that proving out that the middle tier of a system works proves that the heart of the system works.
And there's a more subtle, almost sinister counterpart to this as well. The Acceptance Tests that we claimed would verify the health of the system MIGHT be running against the database, or MIGHT be running against a fake database. They could find out which (and change it themselves) if they remembered where the configuration page was.
There were times when we accidentally left the public server to which they have access in "in memory" mode. Think about that. How is the customer supposed to have confidence in ... belief in these tests if they don't even know if they're running against the database (read: really testing what they view as the system) or not. And with that, the credibility of the tests as the last word on system health diminished even further.
There are other examples, but these two show the symptoms of a subtle, yet rich problem. We technical folk have a responsibility to help our customers see the value in the tests, to feel ownership of them. There are a lot of factors that play into this, but credibility of the tests, fostered by simplicity, consistency and, above all, honesty (from the TESTS) is among the (if not the single) most important.
!commentForm
There's a solution to one of the issues you raise: have whatever sets the environment (fake or real database, etc) put a note in the summary mapping, and then use the Summary fixture.
A somewhat better solution would be to put that kind of configuration information up front: make the first table a reporter of configuration information.
It does, however, point out another issue: to change the test configuration you've got to change something in the test itself (if only an included page). This leads, as you've observed above, to the possibility of forgetting to change it back. This type of "oops" is something that the Lean people would say is ripe for mistake-proofing.
John Roth
A somewhat better solution would be to put that kind of configuration information up front: make the first table a reporter of configuration information.
It does, however, point out another issue: to change the test configuration you've got to change something in the test itself (if only an included page). This leads, as you've observed above, to the possibility of forgetting to change it back. This type of "oops" is something that the Lean people would say is ripe for mistake-proofing.
John Roth
John - I would like the reporter/summary ideas in a scenario in which the customer wanted to be able to use a fake database or a real database. In this case, the fake database was there to make developers happy. The customer never asked for it and I don't think they really ever bought into it. So for the developers, that facility (ability to run against a fake database) should be there, but I think that it should be in another hierarchy, on another server, etc - somewhere completely out of the field of vision of the customer.
One of the motivations of having Avignon move in the direction of real Internet Explorer interaction in the acceptance tests was precisely this issue of buy-in. Many FitNesse tests I have seen are developed by developers for the purpose of separating business logic from presentation logic. While that separation is good, the resulting test is no longer a customer test - it is a developer test.
There is something to be said for seeing a real browser window pop-up with links being clicked (with corresponding clicking sounds) and real visual display of content, even if the tests are automated and the screens are only briefly seen. Nothing gives a warm fuzzy quite like engaging the senses.
There is something to be said for seeing a real browser window pop-up with links being clicked (with corresponding clicking sounds) and real visual display of content, even if the tests are automated and the screens are only briefly seen. Nothing gives a warm fuzzy quite like engaging the senses.
Joeseph said "the resulting test is no longer a customer test - it is a developer test."
While I appreciate that this might be true for some customers, FIT excels at business logic specifications that result from conversations with the customer. In my experience, this is very satisfying to customers, is more succinct than more workflow oriented tests, and easier to maintain (because business rules, though they do change, changes less often than the UI).
The issues I'm describing above are more to do with whether the FIT tests are actually doing what they appear to be doing.
While I appreciate that this might be true for some customers, FIT excels at business logic specifications that result from conversations with the customer. In my experience, this is very satisfying to customers, is more succinct than more workflow oriented tests, and easier to maintain (because business rules, though they do change, changes less often than the UI).
The issues I'm describing above are more to do with whether the FIT tests are actually doing what they appear to be doing.
I'm thinking that the automated builds could spit out each of these specific configurations. Only the one using the real database is your "release candidate". The others are for developers.
The rule I'm suggesting is: No code *or configuration* gets changed between testing and release. If you change it, then you've got to test it again.
I can also see some appeal in actually driving everything through IE. I would suggest that if the tech for that is ready - and you're sure your users are 100% IE, which they often are - there should be a way to use FIT data to drive it. Although if it's not already using FIT-style data, then that may be... well... nontrivial.
The rule I'm suggesting is: No code *or configuration* gets changed between testing and release. If you change it, then you've got to test it again.
I can also see some appeal in actually driving everything through IE. I would suggest that if the tech for that is ready - and you're sure your users are 100% IE, which they often are - there should be a way to use FIT data to drive it. Although if it's not already using FIT-style data, then that may be... well... nontrivial.
If I wanted to only show the customer "real" tests, I'd do something with TestRunner[?] and put the ouput on a different Wiki; not on the FitNesse wiki.
The reasons for using fake databases, etc., is performance and stability. If you can get the performance and stability during the day, great! If not, a display wiki for the last night's tests might well be a good way to go. It could also be organized for that specific function: displaying what the system is supposed to do, and how well it's doing it.
John Roth
The reasons for using fake databases, etc., is performance and stability. If you can get the performance and stability during the day, great! If not, a display wiki for the last night's tests might well be a good way to go. It could also be organized for that specific function: displaying what the system is supposed to do, and how well it's doing it.
John Roth
I wasn't clear about this before, I think, so I wanted to restate my comments more clearly.
It's a good article, and I agree. On the specific point of the database configuration: I'm just distinguishing between the developer version of the acceptance tests and the user version of the acceptance tests. The one is fast, which we need; the other is real, which users need. The difference, as I see it, is pretty much a config difference. But config can break things, so we better show the user the tests of the config we're releasing. Thus, I recommend each being separate builds with separately scheduled tests in the automated build.
I'm not that attached to the IE thing, though I do see a theoretical advantage in "realness".
(What I'm actually doing is trying to find parallels for my "smart client" systems anyway. So if I have a different take or a misunderstanding, that's probably why. I'm exploring how to get FIT-type testing done on a Windows app.)
It's a good article, and I agree. On the specific point of the database configuration: I'm just distinguishing between the developer version of the acceptance tests and the user version of the acceptance tests. The one is fast, which we need; the other is real, which users need. The difference, as I see it, is pretty much a config difference. But config can break things, so we better show the user the tests of the config we're releasing. Thus, I recommend each being separate builds with separately scheduled tests in the automated build.
I'm not that attached to the IE thing, though I do see a theoretical advantage in "realness".
(What I'm actually doing is trying to find parallels for my "smart client" systems anyway. So if I have a different take or a misunderstanding, that's probably why. I'm exploring how to get FIT-type testing done on a Windows app.)
FitNesse is like a metaphore you use describe your system. It's not perfect if you really get down into the details, but that's not it's purpose. The purpose is to quickly describe something complex. FitNesse is the only tool I've seen that can express a complicated system in terms of executable business senarios that make sense to customers.
Sure, in the end they have to "drive" the thing for real using it's UI, browser or otherwise, but the business complexity often isn't in the UI. It also takes 100 times longer to develop a functional UI that would be useful to test with, and even then you haven't any automated tests.
Sure, in the end they have to "drive" the thing for real using it's UI, browser or otherwise, but the business complexity often isn't in the UI. It also takes 100 times longer to develop a functional UI that would be useful to test with, and even then you haven't any automated tests.
Chris - I agree that FitNesse pages function as metaphors for the system, but I think they serve more than just that purpose. The fact that you can execute the tests means that they also provide some measure of the health of the system. Whether you position it that way with your customers or not, they still respond to the green bar in a similar way to that which developers respond to the green bar provided by xUnit. If these executable specs become confusing, or turn out to be downright misleading, there's a problem.
You know that frustration you feel when you realize that an xUnit test in your system isn't really testing what you think it's testing? Something got refactored at one point and the passing test is no longer relevant, but nobody noticed it at the time. Then you change something in the test, expecting it to fail, and it still passes, and you realize that it's not really testing anything at all. For me, this creates a lack of confidence in my system. What else is wrong with it? What other tests are lying to me? I think customers feel the same way about the FitNesse tests.
You know that frustration you feel when you realize that an xUnit test in your system isn't really testing what you think it's testing? Something got refactored at one point and the passing test is no longer relevant, but nobody noticed it at the time. Then you change something in the test, expecting it to fail, and it still passes, and you realize that it's not really testing anything at all. For me, this creates a lack of confidence in my system. What else is wrong with it? What other tests are lying to me? I think customers feel the same way about the FitNesse tests.
David - I see your point. A lack of confidence is created each time we find something in the system that is lying to us, be it comments, design documents, or tests. It seems to me that development techniques like using self describing names for objects and messages, unit testing, and now FitNesse testing are used to reduce the posibility that what we read is a lie. But, like metaphores they are imperfect, we will always find something wrong with them.
Good article. It is normal and healthy for customers and testers to be suspicious of developers. It is possible to create high level Fitnesse tests that seem divorced from the underlying application. Our testers are use to testing at the socket message and log file level. When they see a pretty green page that says it processed 5000 transactions they have good reasons to doubt this because they know how hard it is to do such tests. What has helped is to provide an audit trail in the test. Basically it is a list of all the low level actions and tests that occurred. It also included all of the log messages produced by the system. This has had a bunch of benifits: it gives testers confidence that something is actually happening, it lets them diagnose problems easier ("Oh the DB was off line, let me fix that"), and it lets them suggest better tests ("Hey, could the test check that the DB is on line?").
Once testers are happy they can suppress the audit trail.
I have no experience with real customers. Sigh.
Once testers are happy they can suppress the audit trail.
I have no experience with real customers. Sigh.
I think that there is a point to seeing value, but I think that the answer is almost never "more advocacy" or "stronger advocacy". Credibility is seldom improved by advocating more forcefully or more subtly. Ultimately, it has to do with trust, which comes from transparency and experience. If the system really does work in reality as it does in the tests, then the real customer who uses it can tell. Customers who don't really use it will have to trust someone who tells them. I understand that the automated AT doesn't really look like the system is running, and could be faked or wrong. That it isn't is an article of faith.
But yes, it's hard.
But yes, it's hard.
A long time ago, I was part of a team that was 'fired up' about using use cases. We brought in our customers and showed them how we could draw the ovals and the little stick men. We showed the customers how we could review the diagrams with them and how confident they could be that we understood the requirements. Well, they couldn't have cared less about use cases. In retrospect, user stories may have been a better fit, maybe not.
The thing I took away from it, though, is that people across organizations see software differently than we do. To a user, the application is the GUI, to a customer, it's often the GUI + everything they hear from developers and users. To a manager above that, often it's an abstraction; it's how much the software costs, any lingering problems that impact the organization, and the impact of the software itself (positive or negative). The hard thing is that we often have to communicate with people no matter where they are in that continuum, and it's tough because they all see it differently. Not all of them are going to be excited, and I guess that's okay as long as the people who need to be excited are, and as long as everything works out okay.
The thing I took away from it, though, is that people across organizations see software differently than we do. To a user, the application is the GUI, to a customer, it's often the GUI + everything they hear from developers and users. To a manager above that, often it's an abstraction; it's how much the software costs, any lingering problems that impact the organization, and the impact of the software itself (positive or negative). The hard thing is that we often have to communicate with people no matter where they are in that continuum, and it's tough because they all see it differently. Not all of them are going to be excited, and I guess that's okay as long as the people who need to be excited are, and as long as everything works out okay.
Add Child Page to FosteringCredibilityInCustomerTests