Failship Mechanism in RTOS

HANUMAGOUDA · Jul 25, 2012

Hi All,
Can any one explain or send me the notes regarding failship mechanism in RTOS.
Please help ASAP
Thanking you in advance

Regards,
Hanumagouda Patil

sky_123 · Jul 25, 2012

Do you mean failsafe?
Not sure there is one failsafe mechanism. I imagine there are many methods to achieve reliability in a design (if that is what you mean),
but failsafe in what aspect?

HANUMAGOUDA · Aug 7, 2012

Hi,
Sorry for the spell mistake, that is fail safe mechanism. I want to know what is fail safe mechanism in software and hardware? with respect to RTOS
Reply ASAP.

Regards,
Hanumagouda Patil

sky_123 · Aug 7, 2012

There are different ways, but they are not specific to RTOS, they are common engineering practices. An example is using a watchdog timer to restart the system rather than stay in a stuck state. In hardware it is common to have redundant elements so that a single failure can be accomodated. Some industries (space, comms) will have more than one device and will maybe have a "heartbeat" to check on each other. (Possibly also aircraft, medical equipment). In space they have more redundant hardware to accomodate more than one failure (check NASA documents). Reliability can be assured through testing, a good specification, reduced software and time-dependent (i.e. race condition) errors. Note that sometimes an RTOS feature set may lend itself to race conditions through developer mistakes, unless it has good methods of process synchronization for developers to use (some OSs offer the bare minimum). Higher-layer code or "middleware" is often used to provide a framework that offers reliable methods to run processes and share data.
Maybe you can find some case study on a medical system or a NASA developed system, I'm sure it would be easy to find, and it may help you further.

EDIT: I seem to remember some medical scanner or some other device that had a fault once in a while and gave the patient a lethal dose : ( I'm sure there are dozens of interesting case studies on reliability.
EDIT2: Also, there are RTOS process sceduling schemes that have their own quirks/flaws such as priority inversion, where a higher priority process cannot run because it is waiting for data from a low-prority process that cannot run because the OS has pre-empted it. All sorts of stuff in the history of OSs.. These could be considered to impact reliability too I suppose.

jasonc2 · Aug 8, 2012

There are some good general principals, the complexity depends on how far you want to go. Here is what I can think of off the top of my head:

* Failure Detection - This is fundamental. You have to be able to detect a failure to respond to it. Even the most basic failsafes have *some* form of feedback present. An important thing to keep in mind is that you can't necessarily rely on the system itself to report a failure, as it may not be able to do so in a failed state (e.g. a watchdog system monitoring the critical system for failures is a good approach).
* Transparency - Not always possible but of course in an ideal world the end user wouldn't even realize a failure had occurred (e.g. a seamless transition from mains power to a backup generator).
* Notification - Notification and logging are very valuable; if a failure does occur you'd ideally want to be able to find and fix the cause, no matter how effective the failsafe.
* Redundancy - An obviously important principal. Redundancy in critical / failure-prone components provides a backup when one fails. For example a type 6 RAID array of hard drives, or a second PC running the same software that kicks in if the primary controller fails. Off site data backup is another example.
* Automation - Failures can happen at any time, failsafes must kick in without active intervention. E.g. automatic fire sprinklers.
* Isolation - A failsafe that is isolated from the system it is protecting will less likely be affected by the cause of system failure. E.g. watchdog software on a PC would not help in, say, a power failure. Automatic fire sprinklers are a good example here, with a plumbing system that is separate from the main plumbing and designed to withstand a fire (also water reserves in buildings are a good example of redundancy in a fire protection system).
* Speed (related to transparency) - In an ideal world there is no downtime between system failure and recovery.
* Recovery / Effectiveness - The failsafe, obviously, needs to work. At minimum it should restore critical system functions or prevent a disaster. Ideally it should restore the system to a fully functional state. Depending on the application, the system may need to be designed such that it can process requests missed during a failure (e.g. external queuing systems varying in complexity).
* Persistence (Data and State) - Comes into play especially with software. In some applications, critical data and state should ideally not be lost during a failure, e.g. a server that stores state on a hard drive. A post office is an OK example of a system with persistence built in - if e.g. the employees go on strike, the mail is still waiting in trucks, storage rooms, drop boxes, to be sent -- it is not lost. Or your web browser stores info about open tabs to the hard drive, deleting it on proper shut down, so in the event of a crash and improper shut down, the next time you start the browser your tabs are restored.
* Robustness - Mostly applies to the system being protected. If e.g. data corruption is the result of (or cause of) a failure, the system itself should not get stuck on broken information after recovery (e.g. pulling the plug on your machine during a BIOS firmware flash is a good example of a problem in this area - a redundant backup BIOS chip is a possible system level failsafe option).
* Mitigation / Damage Control - If a system cannot be restored to functionality, a failsafe could take actions to prevent other dependent systems from also failing, or prevent a more serious disaster from occurring.

These all intertwine. There are other important principals too.

In software, many times the software itself can be designed with the failsafe in mind, or in ways that make recovery easier. For example, an SMTP server constructed in two parts, one that receives message to send and e.g. drops them on the hard drive, and the other part that reads the files and sends the messages, deleting the files (maybe not the best design for a speedy mail server but this is just an example). If there is a network outage, or the power fails, or the server crashes and some watchdog process restarts it, then when the system comes back online it can resume where it left off and data loss is hopefully minimal. Think queues, persistence, state machines, etc.

It really depends on your requirements and what you are trying to protect.

It also helps to assess and specify likely modes of failure, so that you can test and document your failsafe. However, be careful, as things most certainly fail in ways you didn't think of -- so it also helps to approach things not just from a failure mode point of view but also from the critical functionality end, e.g. in the medical device example that gave the patient a lethal dose on failure, the most important thing in the end is that the device doesn't kill the patient, so instead of just concentrating on e.g. "if the power fails there is a backup" also concentrate on the end effect "if the dosage given exceeds a limit for any reason, cut the line", and that will help make your system more robust.

It is a whole field in itself and as much as I want to keep typing here, I do have to get back to work. :lol:

As sky_123 points out with examples, NASA, the military, aviation, medicine, also nuclear power, etc. these are good fields to find case studies in as they tend to have super critical systems and take failsafes to the extreme. I'll leave you with an interesting point quoted from https://www.airliners.net/aviation-forums/tech_ops/read.main/255972/:

PITIngres in Reply #2 said:
The Shuttle flight control computers are like this. Four of the five computers run the same software, one runs a different program written independently.

Software "glitches" might be deterministic logic errors, non-deterministic timing errors, or requirements (specifications) errors. There are lots of ways of classifying this stuff, I'm just taking the big picture. A fault due to a simple logic error or a spec error will most likely recur when you drop to an identical backup. A timing error or multi-thread interaction error may not.

Hope that helps!

- - - Updated - - -

P.S. I kind of like the concept of a "failship" :lol:

Welcome to EDAboard.com

Failship Mechanism in RTOS

HANUMAGOUDA

Member level 1

sky_123

Advanced Member level 4

HANUMAGOUDA

Member level 1

sky_123

Advanced Member level 4

jasonc2

jasonc2

Full Member level 4

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics