![]() |
||
![]()
MISC
SKILLS
HARDWARE
SCIENCE |
=== Failure Mitigation ===
Failure Mitigation
We all know how hard it is to build a robot. You program it, it doesn't work as planned, then you go back to the code and wiring to figure out what went wrong. Maybe a wire came loose, a battery went dead, or a special case in your code that caused an infinite loop. Whatever happened, you're able to 'reset' and try again and again until it works. Not so with space robots! The cost of a launch is expensive in both time and money. After you release that balloon, and something goes wrong, tough luck. You can't fix a thing after it's launched - it either works or it doesn't. There is little room for trial and error, so you need to plan for everything and fend off every possible route for failure. As such, as you're building the system, for every single sub-system/component you should ask yourself these three questions:
2) What steps can I take to reduce the chance of it failing? 3) If it fails, does the entire mission fail? For #1, you should generally assume the component will fail. Especially the more complex stuff. But don't guess! Run the code on the hardware during a simulated flight. Make it as realistic as possible, and for longer than you expect the entire flight to run (several hours). In a later section I will go over testing methods. For #2, be as professional as you can with building your robot. DO NOT use 'ghetto' methods like duct-tape and hot-glue, they will fail. It needs to be sturdy enough that you can violently shake it with your hands and nothing will wiggle loose (the jet stream is equally violent). Next, implement automated failure detection and mitigation systems. Your microcontroller should have software to detect when a sensor isn't connected, isn't powered, isn't calibrated properly, etc., and then send you a warning before launch by blinking LEDs and such so you can fix it. Label all your wires and firmly secure them. Failing to connect a wire, connecting it in the wrong place, and a wire coming loose are very common modes of failure. This below image shows the huge confusing wiring mess my circuit would be without the labels saving me from my insanity. Use a label maker machine for better looking labels. For #3, this is when you have backup systems. Simply assume that all critical components will fail. Then what? Our flights typically had 2 or 3 of everything, from GPS to cameras to transmitters. Expensive? Perhaps, but it's cheaper to buy two and keep both than to buy one and lose it - in which case you'll need to buy another anyway. The better way to save money is to have a backup system, and possibly two backup systems. And don't do something stupid like use the same battery for both the primary and the backup system - they should be kept completely independent from each other.
Testing
failure type: timeout
Another example would be data logging. What happens when the memory runs out? Does it restart at 0 and overwrite the earlier data? Does your software crash? Your expected flight might just be two hours, and it might take 4 hours before memory will run out, but what if it takes several hours/days to recover the package? Set your device up before you go to bed, and check it again in the morning to see if it failed or not.
failure type: unexpected power reset
This is actually very easy to test - simply disconnect your battery for a few seconds and see if it intelligently recovers or not. Do it while your mcu is doing important things. Does your memory card get corrupted entirely?
failure type: non-responsive component
Again, this is very easy to test. Simply disconnect the device and see what happens. Then write a program that detects that bad data and deals with it appropriately. Do this with every single device and component you have. If you want to be a little fancier, expect that some more complex device like GPS once in a blue moon gives corrupted data. How would your mcu handle that data? More importantly, is your GPS configured in aviation mode?
failure type: it's too freakin' cold
Your freezer is likely not to get colder than -20C, so if you need to test at even colder temperatures, you'll need to make a test chamber. Buy a large cooler, put your package inside, and then pack it with dry ice (frozen CO2; -78.5 °C, -109.3 °F). If you have the option, and are really concerned about temperatures, keep the batteries warm in your pocket until just before launch. Or, perhaps mount your voltage regulator directly on the battery as a sort of mini-heater.
failure type: too much heat
Be careful with too much insulation - we once had a camera fail before we even launched because it overheated with all the insulation we used! The best way to test this would be to place the package outside on a hot summer day in the sun, point solar deflectors at it, and then measure the temperature over several hours. Or put it in a large cooler with a thermal heater inside, and ramp up the temperature to ~120F over an hour or two to see what happens. If it gets hot inside your car on a bright summer day, that may work too. Some vacuum chambers also have an adjustable heater you can use - perfect for simulating the heat condition. To be even more realistic, do heating -> cooling -> heating -> cooling -> heating in that order, for 30 min each, as would happen with a real mission.
failure type: poor quality GPS at landing site
Obviously you want to make your package easy to see, using bright orange colors for both the package and the parachute. You may want a bright strobing light beacon for night missions. Some groups have put loud sirens on their balloon, as they don't weigh or cost that much and don't consume too much power. There are two issues I see with it, however. First, the loud noise drowns out all audio from your video cameras. The second issue is as described by an email I received: "Some have reported that the capsules with audio beacons were destroyed on the ground out of fear/annoyance." You can find easy to use alarms at Mallory Sonalert and Adept Rocketry. I'd recommend controlling it with a microcontroller, only activating the alarm at low altitudes. It should also have an 'alarm off' button on the outside of the package. You can also wire up a piezo-buzzer to be pulsed by a microcontroller. Use an amplifying microphone so you can hear the alarm from farther away. And binoculars to visually confirm.
The KISS Philosophy
Failure Stories
Story #1: "The second incident of the launch occurred when the Asimov II passed 14,000 feet [from a planned 90,000 feet]. Its onboard GPS hiccupped. This hiccup fooled the onboard electronics, placing the near spacecraft into descent mode. Now that it was in descent mode, the near spacecraft stopped performing experiments. The near spacecraft continued to transmit telemetry, but only engineering data and no science data." The lesson is clear about the above story. There was a single point of failure that shut down the entire system. When you program and wire up your device, take time to consider what would happen if any sensor or device failed. Would your system gracefully recover, or go into full melt-down? Thinking 'oh, that component will never fail' is asking for trouble . . . instead, have a mindset of 'freak accident hypothetical'. Story #2: This is a story I was told by my group when I first joined them. It was, I believe, their first ever launch. Long story short, the transmission antenna wasn't firmly secure and it literally snapped off during flight. They were not able to recover it without GPS coordinates. About a month later a farmer found this strange suspicious looking capsule in his field. There was a label on it that said 'call this number if found', but wasn't able to reach anyone. So he put it on his desk for another two months. Then one day he saw a space program on the Discovery Channel, which I assume was about high altitude balloons, that inspired the farmer to call a friend who was an engineer. The engineer came over, saw the phone number, called it and managed to reach the team. Turns out the farmer has difficulty using a telephone . . . Anyway, the software was not written to account for a brown-out situation as the batteries died, and as it continually reset overwrote all data thereby corrupting it. If you have a story of your own, please email us to have it added.
=== Ultimate Failure Testing Checklist ===
1) drive it around for hours test
2) let it sit there until batteries die out test
3) unplug stuff and see what fails test
4) introduce bad sensor data test
5) shake/drop test
6) freezer test, vacuum test
7) plot sensor data into meaningful information
|
|
Has this site helped you with your robot? Give us credit -
link back, and help others in the forums! Society of Robots copyright 2005-2014 |