Society of Robots
Search and Index
 Parts List
 Robot Forum
 Member Pages
 Axon MCU
 Robot Books

 How To Build
  A Robot




 Robot Journals
 Robot Theory

    High Altitude Balloon Tutorial
    === Failure Mitigation ===


    "Anything that can go wrong, will go wrong." - Murphy's Law

    Failure Mitigation
    Failure mitigation should be your #1 thought when planning a high altitude balloon mission.

    We all know how hard it is to build a robot. You program it, it doesn't work as planned, then you go back to the code and wiring to figure out what went wrong. Maybe a wire came loose, a battery went dead, or a special case in your code that caused an infinite loop. Whatever happened, you're able to 'reset' and try again and again until it works.

    Not so with space robots!

    The cost of a launch is expensive in both time and money. After you release that balloon, and something goes wrong, tough luck. You can't fix a thing after it's launched - it either works or it doesn't. There is little room for trial and error, so you need to plan for everything and fend off every possible route for failure.

    As such, as you're building the system, for every single sub-system/component you should ask yourself these three questions:

      1) What's the likelihood of this component failing?
      2) What steps can I take to reduce the chance of it failing?
      3) If it fails, does the entire mission fail?

    For #1, you should generally assume the component will fail. Especially the more complex stuff. But don't guess! Run the code on the hardware during a simulated flight. Make it as realistic as possible, and for longer than you expect the entire flight to run (several hours). In a later section I will go over testing methods.

    For #2, be as professional as you can with building your robot. DO NOT use 'ghetto' methods like duct-tape and hot-glue, they will fail. It needs to be sturdy enough that you can violently shake it with your hands and nothing will wiggle loose (the jet stream is equally violent). Next, implement automated failure detection and mitigation systems. Your microcontroller should have software to detect when a sensor isn't connected, isn't powered, isn't calibrated properly, etc., and then send you a warning before launch by blinking LEDs and such so you can fix it.

    Label all your wires and firmly secure them. Failing to connect a wire, connecting it in the wrong place, and a wire coming loose are very common modes of failure. This below image shows the huge confusing wiring mess my circuit would be without the labels saving me from my insanity. Use a label maker machine for better looking labels.

    labels prevent mistakes; click image to enlarge

    For #3, this is when you have backup systems. Simply assume that all critical components will fail. Then what? Our flights typically had 2 or 3 of everything, from GPS to cameras to transmitters. Expensive? Perhaps, but it's cheaper to buy two and keep both than to buy one and lose it - in which case you'll need to buy another anyway. The better way to save money is to have a backup system, and possibly two backup systems. And don't do something stupid like use the same battery for both the primary and the backup system - they should be kept completely independent from each other.

    A typical home-made robot can be field tested using an interactive process of test, reprogram, upload new software, test, etc. Not so with high altitude balloons. Instead, you need to artificially recreate the conditions your equipment will face as realistically as possible, and see how it reacts.

    failure type: timeout

      One type of failure that can occur is from the simple passage of time. If you run your code for 5 minutes and don't see a problem, you might be tempted to say it works fine. But what if there is a bug that has a once in an hour chance of happening? Most homemade robots don't run for more than 15 to 20 minutes before you get bored of it or batteries die out, so the bug may never show. Maybe something strange occasionally happens, then you restart and the problem goes away magically. So you don't investigate further. Given that a space balloon mission is typically ~4 hours long, what is the probability of a once-in-an-hour bug occurring?

      Another example would be data logging. What happens when the memory runs out? Does it restart at 0 and overwrite the earlier data? Does your software crash? Your expected flight might just be two hours, and it might take 4 hours before memory will run out, but what if it takes several hours/days to recover the package? Set your device up before you go to bed, and check it again in the morning to see if it failed or not.

    failure type: unexpected power reset

      Perhaps it got really cold and the batteries temporarily failed. Perhaps the jet stream thrashed it so much that a power wire jostled loose temporarily. Perhaps something overheated and temporarily shut down. Or perhaps it took you a week to find the device - long after batteries have died. How would your device respond to that?

      This is actually very easy to test - simply disconnect your battery for a few seconds and see if it intelligently recovers or not. Do it while your mcu is doing important things. Does your memory card get corrupted entirely?

    failure type: non-responsive component

      Suppose your mcu is connected to a humidity sensor. And let's say that sensor breaks. Or that you forgot to plug it in. Will your mcu get caught in an infinite loop? Will it happily log obviously bad data? Or will it 'know' that something is wrong, and warn you before you launch the balloon?

      Again, this is very easy to test. Simply disconnect the device and see what happens. Then write a program that detects that bad data and deals with it appropriately.

      Do this with every single device and component you have.

      If you want to be a little fancier, expect that some more complex device like GPS once in a blue moon gives corrupted data. How would your mcu handle that data? More importantly, is your GPS configured in aviation mode?

    failure type: it's too freakin' cold

      The coldest point of the flight will be during re-entry through the jet-stream, likely around -70C. Study the datasheets of all your components and batteries to see what their temperature tolerances are. Typically stuff will start to fail at -20C, and often times 0C would be the limit. Find the coldest freezer you have, and put your device in it for an hour with everything on and see what happens. This is also a great time to test data logging of temperature sensors.

      Your freezer is likely not to get colder than -20C, so if you need to test at even colder temperatures, you'll need to make a test chamber. Buy a large cooler, put your package inside, and then pack it with dry ice (frozen CO2; -78.5 C, -109.3 F).

      If you have the option, and are really concerned about temperatures, keep the batteries warm in your pocket until just before launch. Or, perhaps mount your voltage regulator directly on the battery as a sort of mini-heater.

    failure type: too much heat

      If there is no air in space, how will your overheating electronics cool down?

      Be careful with too much insulation - we once had a camera fail before we even launched because it overheated with all the insulation we used! The best way to test this would be to place the package outside on a hot summer day in the sun, point solar deflectors at it, and then measure the temperature over several hours.

      Or put it in a large cooler with a thermal heater inside, and ramp up the temperature to ~120F over an hour or two to see what happens. If it gets hot inside your car on a bright summer day, that may work too.

      Some vacuum chambers also have an adjustable heater you can use - perfect for simulating the heat condition.

      To be even more realistic, do heating -> cooling -> heating -> cooling -> heating in that order, for 30 min each, as would happen with a real mission.

    failure type: poor quality GPS at landing site

      GPS isn't always accurate. Depending on the environment and the situation, you could get readings that are off by a hundred meters or more. Sure, if your balloon lands in a big empty field that's not a problem. But if it lands in rough terrain, with lots of trees and stream cut ditches, it might take awhile to locate and recover.

      Obviously you want to make your package easy to see, using bright orange colors for both the package and the parachute. You may want a bright strobing light beacon for night missions.

      Some groups have put loud sirens on their balloon, as they don't weigh or cost that much and don't consume too much power. There are two issues I see with it, however. First, the loud noise drowns out all audio from your video cameras. The second issue is as described by an email I received:

      "Some have reported that the capsules with audio beacons were destroyed on the ground out of fear/annoyance."

      You can find easy to use alarms at Mallory Sonalert and Adept Rocketry. I'd recommend controlling it with a microcontroller, only activating the alarm at low altitudes. It should also have an 'alarm off' button on the outside of the package. You can also wire up a piezo-buzzer to be pulsed by a microcontroller.

      Use an amplifying microphone so you can hear the alarm from farther away. And binoculars to visually confirm.

    The KISS Philosophy
    Keep It Simple, Stupid. This engineering philosophy states that the more complicated your design is, the longer it will take to make, the more expensive it will be, and the more likely it will fail. Stick to a main basic design, and avoid the temptation for 'feature creep'.

    Failure Stories
    A few stories on what can go wrong . . .

    Story #1:

    "The second incident of the launch occurred when the Asimov II passed 14,000 feet [from a planned 90,000 feet]. Its onboard GPS hiccupped. This hiccup fooled the onboard electronics, placing the near spacecraft into descent mode. Now that it was in descent mode, the near spacecraft stopped performing experiments. The near spacecraft continued to transmit telemetry, but only engineering data and no science data."
    -L. Paul Verhage, Designing Your Own Program of Near Space Exploration

    The lesson is clear about the above story. There was a single point of failure that shut down the entire system. When you program and wire up your device, take time to consider what would happen if any sensor or device failed. Would your system gracefully recover, or go into full melt-down? Thinking 'oh, that component will never fail' is asking for trouble . . . instead, have a mindset of 'freak accident hypothetical'.

    Story #2:

    This is a story I was told by my group when I first joined them. It was, I believe, their first ever launch. Long story short, the transmission antenna wasn't firmly secure and it literally snapped off during flight. They were not able to recover it without GPS coordinates. About a month later a farmer found this strange suspicious looking capsule in his field. There was a label on it that said 'call this number if found', but wasn't able to reach anyone. So he put it on his desk for another two months. Then one day he saw a space program on the Discovery Channel, which I assume was about high altitude balloons, that inspired the farmer to call a friend who was an engineer. The engineer came over, saw the phone number, called it and managed to reach the team. Turns out the farmer has difficulty using a telephone . . . Anyway, the software was not written to account for a brown-out situation as the batteries died, and as it continually reset overwrote all data thereby corrupting it.

    If you have a story of your own, please email us to have it added.

    === Ultimate Failure Testing Checklist ===
    Below is a short-listed set of the most important tests your balloon package must pass to be successful for a real mission. It's better to always test hardware before software, because changes to hardware could affect software. Take good notes and clearly date each of your experiments - it may seem like boring pointless work but I find my notes really help in debugging complex systems.

    1) drive it around for hours test
    Pack it all up, turn all systems on, and put it into the back seat of your car. Drive it around for a few hours and have your friends attempt to track you down from the transmissions. Review recorded data at the end of the drive to verify everything looks reasonable.

    2) let it sit there until batteries die out test
    Pack it all up, turn all systems on, and let it sit there until you are confident all batteries are dead. This test verifies battery life. It also mimics what happens when you lose the package for days or weeks, when brown-out failures can corrupt data. Review recorded data at the end of the test.

    3) unplug stuff and see what fails test
    Turn all systems on, then randomly disconnect/reconnect each sensor for a set period of time. Verify recorded data at the end of test. This test verifies that the entire system will not have a complete failure if and when a single sensor has an issue (loose wire, etc).

    4) introduce bad sensor data test
    In software, after reading sensors, overwrite data from these sensors with bad data and see what happens. Some sensors, such as GPS, could temporarily spew out bad data. Does the entire recorded data get corrupted, or does your system quickly recover? What if GPS reports the wrong altitude?

    5) shake/drop test
    Pack it all up, turn all systems on, and violently shake your box of electronics. Then, throw your package into the air and let it drop onto a hard surface. Your space package must be designed to survive a fall from 100k feet, what are you afraid of?

    6) freezer test, vacuum test
    Pack it all up, turn all systems on, and let it sit in your freezer for an hour. Verify data afterwards. If you have access to a vacuum chamber, do it in there too.

    7) plot sensor data into meaningful information
    Pretend your mission is completed. Using data from the previous tests, plot it all out into nice looking charts. Inspect the charts to verify sensors and spot 'blips' that could help you find software bugs.

Get Your Ad Here

Has this site helped you with your robot? Give us credit - link back, and help others in the forums!
Society of Robots copyright 2005-2014
forum SMF post simple machines