Had a fierce discussion with one engineer, which started from reviewing her automated test case. It’s written in robotframework, one criteria of good quality test case is descriptive naming, for keyword names, test case names, etc.
* Discussion 0 : Background
Well, the case title was not descriptive, and I asked the regular question “what does this test case test?” The functionality it tests is,
- Process Under Test (PUT) send message to another process while restarted and just started up
- The target process should send back acknowledge message to the PUT
- If PUT received acknowledgement message, then it’s fine, nothing happened <—-the functionality to be verified
- Additional :If PUT didn’t receive it, PUT will resend it for totally 5 times, after that, write error log for recording
But those test steps didn’t reflect the purpose, it just reset the unit (the PUT shall restart when the host restarted), then check there is no error log. While my opinion is, those 2 steps can’t prove that the functionality happened, therefore you can’t say “no error log” is the result of “message – acknowledgement”, since if there is no message sent out or other problems happened, there may be no error log too. You must prove the connection between “message – acknowledgement” and “no error log”.
Well, then the discussion or argument or even say a debate started, moved to software testability, software design, system architecture, organization, etc. And finally after 1.5 hours, I’m too agitate to catch-up my breath, then we went for lunch.
* Discussion 1 : Testability
I questioned that she could start a monitoring process on serial port (which connects to the unit), then restart the unit, the corresponding messages can be captured on serial port. This is a minor misunderstanding, she soon explained that the PUT sends out the message after restarted, if we want to monitor messages on this unit, we should do it when the unit is restarting, well, it’s kind of impossible.
Naturally I asked “why not monitoring the target process?”, so we could judge by it received message and send out acknowledgement. The answer is
“the target process resident unit was developed at another site, by another team, from another subcontracting company Aricent.”
“we don’t know how to operate on that unit, there is no documentation, we asked them if we could check their functional test cases, they said they have it in QC, but we can not find them, after asked several times, they finally said they actually haven’t write any test case.”
What a ****!!
I proposed that they could add some debug logs in their code, as workaround to indicate PUT sent and received acknowledgement, but they have to use production compiled build for test. Then I thought they could apply conditional compilation, she thought that would add some additional effort. However, we have some level of agreement that the workaround may work, it’s just a matter of cost.
* Discussion 2 : Design
I was too curious to ask the question “what’s exactly the functionality you’re testing by this case?”. Since she said they had one test case which restarts PUT and verify the similar functionality, but they were reported one bug regarding PUT during unit restart. Seems abnormal PUT restart is different from restarting triggered by unit restart.
By clarifying this question, it leads us towards the proper granularity of testing point. Therefore we will also clarify the preconditions, setups, execution steps, results expectation, cleanups. It is the design of testing.
Well, I didn’t got the idea why they need to implement this functionality, now I got the answer “for bug fix”. Then I’ questioned, why this solution for this fix? Where is your requirement documentation?
They are in a legacy product, it’s fairly mature now, not too many new feature development work, mainly product maintenance work. Due to those technical transfer processes, there is no real experienced developers anymore, nobody is really clear about the design of this part code. And it’s actually the team themselves created the requirement, so they chose this design solution.
I’ was so opinionated, I didn’t like introducing new things while not really necessary, especially if you can do it within current context. The mechanism was used to ensure the target process have received message from PUT, so they add the demand of acknowledgement. In the scenario she described, the target process will send some data to PUT by certain message, each message contains one single entry data.
I insisted in using the certain message with 0 entry data, as well as the acknowledgement message. Their reasons (or excuses) are, the message fields were already frozen, change its definition would requires a lot of changes, since they don’t have a field for “entry amount” before.
* Discussion 3 : System Architecture Choice
I can not convince them, nor they can. But I’m still not satisfied. Then I moved to question them “why you need to ensure the target process received message?”, since their messaging mechanism in existing system is reliable, there is no need to ensure from message sender point of view. The reason is, there is a new unit where the target process locates, which is not in the original system (HW-wise), communicates with other processes in old system via UDP which is unreliable. Because of this, they need to handle it.
The system has a so-called message-bus component (module, process), and a dedicated component for transport layer issues. Of course I doubt this, why can’t those components encapsulate the underlying UDP transportation as usual, and expose the same messaging interface to those application processes (e.g. the PUT).
Another team member joined the discussion, and shared the historical discussion related to this. They had the same opinion as me, and challenged the requirement, but they’re told it’s a decision, just do it. I’m fine with upper level decisions, if they can convince me, or they really understand the consequences and accept it. But if not, I’ll be really challenging and noisy. They can’t tell me if the chief architect provided reasonable explanation, they didn’t remember. They also tried to find if there is any other team facing the same problem, and having any solutions for reference, but seems they’re the 2nd one, and the solutions are pretty much the same.
The answer sounds unacceptable to me, it’s so obvious that, compare to solution 1 solving UDP problem within single component, the solution 2 that each process solve problems by themselves will introduce much much more duplication code and algorithm. Well, I didn’t know programming too well, I can not tell if the solution 1 will be more costly in run time, since it verifies each UDP packets reached the target.
* Discussion 4 : Organizational Dysfunction
I had a huge problem with failure in duty behaviors. I think the problem now is, the system architect didn’t explain the reasons why he/she wants this specific design solution, and obviously those people are not satisfied, and it may complex our code more, further more make maintenance harder. This is the duty of system architect.
I already had a huge problem with Software Design activity in the stage-phased model. How can they prove that the solution is the right one to fulfill customer requirements? And how can they prove it’s the best solution among those ones? The SDLC didn’t answer this, Waterfall didn’t, nor CMM/CMMI.
Plus the problem of communicating their design solutions to implementation engineers, it re-enforced the dysfunction. They didn’t want to or didn’t know how to explain the reason of the design solution, or they didn’t describe clear enough, they fixed the design solution into stone, then passed it to the implementation team by deliverables. The implementation teams implement it, and confirm it by verifying the expected system behaviors. While at the same time, they may not understanding if it has any side-effects besides the expectation part, nor it’s too much for just meet the expectation (in another word : over-design).
Agile provides the correct answer : “Simplicity” principle. We encourage emergent design, just enough design, as simple design as possible, etc. The system level architecture and design was reflected by division of classes, in terms of component (class, module) responsibilities. From top down, we communicate based on requirements, or say “what we need”, instead of a direct solution to an unknown problem. Let the people who implements the requirement decide the solution.
Together with Test Automation and Continuous Integration, you enhanced the common understanding of design of working code into working tests. And those nonstop quality safeguards continuously provide you the feedback of current software functionality consistence.
* Thoughts : Design Infected Organization, or vise versa
The thought process used in designing software has a huge impact on designing the organization, or say the software design is mainly inherited from the thoughts of organizational structure.
If we designed software in one way, then it is very natural to organize people in the similar way, because the most convenient and cost-efficient approach is put most-in-common-duties people together.
While there is an organizational structure, it re-enforces the software design reversely. Because people want to do their job better, better according to the role responsibilities, and which is exactly formulized during the initial software designing.
She came to me today, updated me some information regarding the problem afterward.
"We could test in a way that, we check and save the information on unit before restart, then we restart, after the unit started up, we check the information again, it should be recovered to the same as before restart."
I agreed on this idea. Since the feature / functionality is actually a redundancy mechanism, while the master node restarted, it will require the tributary nodes to send previous information, so it could recover back to the original state. There is no 2N redundancy, because in some cases, there is only one unit, no spare ones.
Then we elaborated a bit more.
- From unit test level, they could cover the message exchanging logical, both receiving and sending.
- From the overall feature level (blackbox), they could verify that the information is recovered after restart, before the feature doesn’t necessarily related to the detail design implementation.
- For the unreliable UDP transfer, it’s not such a big deal, from both unit test and feature level, they’ll verify that if the target process can or can not receive messages, if the information of PUT can or can not be recovered. The underlying transferring method is not the feature’s business.