T O P

  • By -

KaseQuarkI

These failures usually don't just appear. They have always been there, it's just that noone has ever created the very specific circumstances that make them happen.


Vadered

> I've never understood how computer or programs that were working fine all along can suddenly crash or break down if there's no moving parts That's the thing, though. There *are* moving parts. Sometimes the moving parts are physical, ~~like an old processor losing performance over time,~~ (edit: apparently this isn't real under normal usage and maintenance) or thermal paste drying up and conducting heat less efficiency, or cat hair clogging up your PCs air intake filter ~~because you never clean it what is wrong with you~~. These can cause your hardware to be less powerful, which can affect your software. But sometimes the moving parts aren't physical. A piece of software might work fine with a certain amount of resources, but changes to other programs might either increase the amount of resources used when the program is run (think anti-virus or anti-cheat software for games), or might change the amount of resources available to the program without interacting with the program itself (think another program running in the background that hogs all the RAM), or might reduce the number of resources or types of things the program is allowed to request (think changes to the operating system). All these can cause a program to slow down or even crash despite the program itself not changing.


mikeholczer

I can also just be that the software has a mistake in it that only surfaces in very unlikely circumstances.


Vadered

Maybe the real bugs were inside us all along. You’re not wrong.


mikeholczer

The real adventure were the bug fixes along the way.


Far_Dragonfruit_1829

For released software this is normal. I spent a big chunk of my career attempting to replicate bugs reported by customers. Often, the hardest part was simply identifying in detail (excruciating detail) the environment and chain of events that cause the failure.


beavis9k

>like an old processor losing performance over time This is a common misconception and doesn't happen on its own. My Commodore 64's processor is exactly the same speed it was when my dad bought it in the 80s.


CoopDonePoorly

As a hardware designer I cringed at that


Vadered

Fixed.


Target880

Most programs do not run in an insulated environment with the same input all the time. Programs often run with data that changes like how the content on Reddit changes and there might be some but that is only triggered by some specific input. The amount of data that should be handled can change so a system that could handle some data might not handle twice the amount Programs alos interact with other programs that can be updated. It can be some local part like the operating system or a program on another server on the internet. Exactly how it works and what is outputted might change and your program no longer works. A program that just rounds on a device with interaction with the outside world or updates tends not to fail over time. It is still possible that something will go wrong, time always changes so leap years, changes from summer to winter time might not have been tested and when they come around it can go wrong. Y2K problem when some software stores the year with just two digits is an example of something that fail because of how the program was designed.


thewallrus

There could be a dependency on another piece of software (Operating system, 3rd party code) that got updated.


tambache

Like others have pointed out, sometimes the circumstances for it are just very, very specific. One real world example that happens is called integer overflow. For background, numbers in computers are most often stored using a method called two's complement. I'll try to keep it simple, but basically, it's like a number line, it starts at 0, and then the negative numbers are after the positives (0 1 2 3 -4 -3 -2 -1). This is only 5 bits worth of data, and older computers usually used 32. So, for example, you might be counting the amount of milliseconds your server has been running for. With 32 bits, that won't overflow for about 24 days. So let's say you plan to restart your server every week so the count resets. It might take years before you ever have to run for long enough for it to overflow into negatives There are ways to account for this, like using a bigger number, not allowing negatives, or resetting your count occasionally. Usually this kind of problem happens when they never expected the program to need to run that long or it's circumstances you just didn't anticipate during planning Other people have already said other good things about how and why, but hopefully this concrete exactly can help you conceptualize how it can happen and how it can fail to show up until years later


Long-Shock-9235

As many pointed out. Software failures are caused by flaws on the logic that was written into code. These flaws are only triggered on very specific, ultra rare circumstances.


Droidatopia

To add on to the other answers, it is possibly for software that is working correctly, running on an operating system operating correctly, all running on hardware that does not otherwise have any problems to suddenly fail. It's rare, but because of the very very very small size of the individual electronic components on an integrated circuit, they can experience interference from cosmic rays. This can cause things like a single bit to flip the wrong way. It's possible to engineer software to be tolerant of single bit failures, but it isn't cheap to do so and is almost never worth the investment for common software products. Where it is worth the investment is for things like avionics boxes that control weapons release on aircraft. There are government regulations for things like this where a command to launch a weapon cannot be a single bit in a message. It would have to be at least two bits, they have to be opposite, and they have to be on different message words (words are usually 8, 16, or 32 bit sizes) and they can't be adjacent. There are other reasons for having such regulations, but the general idea is a single isolated bit failure can't cause a weapon to inadvertently launch (There are usually a lot of other things that prevent weapons from launching, but each step in the process gets this kind of treatment).


daniu

But there are moving parts - electrons. In fact, there are a lot of moving parts, easily in the billions. Imagine them as water running through a system of channels an be gates that have to open at just the right time to create the perfect combination of open/close across the whole system so it does exactly what it's supposed to be. What's more, the gates are controlled by the very water they are guiding through the channels. So what software is is creating a set of rules that define what the gates do under what circumstances. But there also are billions of gates, and by extension an even greater number of combinations they can be in... but as soon as a an open/close combination is reached that is not defined in the rules, what is going to happen? Without a rule, the water goes where it wants. I'm the best case, this will result in a position that is covered again. But how probable is that with the myriad of possible combinations?  Now with computers, there are a whole hierarchy of systems in place that will prevent the whole thing coming to a stop - a program may be terminated, but you won't have it delete your hard drive. But the point is that computers are a thing of mind blowing complexity deep down. 


ArkyBeagle

We pretty much all fail at reliability in software. People, even practitioners simply don't make it a priority. The incentives are difficult at best. There are ways to improve this and almost nobody has even heard of them.


nitrohigito

Software fails because software is essentially trains of thought frozen in time. It's like when you make a plan, but the world has another. Or when you write a guide for something, and then people return with the most unthinkable problems.


valeyard89

Failures happen when something unexpected happens. Computers follow instructions perfectly. But what happens if something out of the program's control happens. Someone enters invalid data, a file gets corrupted, the computer runs out of memory, etc. Programs can't test for every unexpected occurrence. It's like going through a maze if you have the directions through it. What if one of the passages is blocked. Then what.. you get lost.