Today we have our review meeting for current sprint. In this sprint, we mainly implemented 2 parts – active supervision, non-critical process recovery.
Cooperate with the starter, these functionalities are provided to the whole system. All processes can register active supervision, for itself or another process, then this process is supervised. Two types of supervision are provided to users, one is polled supervision and the other is checkpoint. Polled, the supervisor will actively supervise the process, it sends a supervision message to the process, and expects for one acknowledge message which indicates that the process is alive and well operating. Checkpoint, just like a boss says to you "report your status to me every one period time", so the process send one notification messages to the supervisor after an interval time.
Another important part in this sprint is non-critical process recovery, it will try to recover the failed process, in other words restart the process. If the active supervision and passive supervision and the starter report process failures, recovery will be notified. After that, recovery trys to do actions to revival this process.
There is several requirements – limitations and restrictions defined to describe these 2 features. Some of them are interesting.
It is required that if the recovery process itself has malfunction, the unit should be restarted. Since the recovery process is the only one trys to recover the unit (the system) from malfunction, so if this process is died or can not work well, nothing can be recovered any more, it means this unit may not work normally. The solution we selected is restarting this unit, after a new restarting round, the unit should work properly.
Another restriction is restarting the unit if there are more than 10 processes recovered in 10 minutes. That means this unit can not provide exact functionalities since some many processes have malfunction and are under recovery.
These 2 features are considered as important features, have higher priorities on implementation schedule. You can consider them as core features in this system.
I’d like to suggest you comparing it with our society. A social country has a system, it needs a supervisor definitely. A powerful system should not execute without supervision, if the powers are out of control, troubles are introduced, chaos are brought in too. So someone or some departments must be set as supervisor or observer, the tasks are supervising society system in an reasonable interval, of course this interval should be adjusted to a feasible value, big enough in order to avoid waste of society’s resource, small enough in order to find all possible faulties.
Also, recovery functionality is needed to save the failed part of the society system. Sometimes it means candidates are prepared before, to make sure that there is at least one alternative successfully working. And I think that is the reason why redundant concept is very important in telecommunication industry, becouse we can not afford the cost of lacking core functionalities. Recovery occurs that try its best to ensure the whole system can work normally, provide the resistentance of small malfunctions.
I understand that supervisor and recovery are important to a mature system.