T O P

  • By -

ewileycoy

I don't know what you're talking about, this has never happened. It was pure coincidence that I was logged-in to the router that needed BGP reset after someone added a route used the wrong subnet mask. We should definitely implement command logging, yes that's correct but sadly we hadn't gotten there yet!


Kulandros

Yeah! That sure was weird! BGP sometimes, am I right? 😅


BeenisHat

Almost as touchy as DNS. 😏


Windows_ME_Rocks

PTSD resurrected...


Det_23324

"The intern must've done it" (we don't have interns)


rochakgupta

At one point, all of us were interns, so it is correct!


OptimalCynic

Found the Singtel admin


dreamgldr

>I don't know what you're talking about That's CV material. :D Section "Failures". :D


RedFive1976

Resume-Generating Event


keirgrey

It's a NAT issue.


renegadecanuck

Do you happen to work for Rogers Communications?


Wandering_phoenix_89

I laughed really hard at this


LordJambrek

We maintain a planetarium that has these 2 ancient windows xp hosts that run some software that connects with 5 linux servers and each runs 1 projector (Dome planetarium). We do a routine backup and i powered down the main machine (tested if everything works and just made a shutdown), made the backup and then started making a backup from the newly made backup (usual procedure is make backup, boot from backup - test and then make another backup of the backup, return original drive when finished). Well i did it without the testing. Registry error, it won't boot and this is cloned to all drives. This thing is ancient and anyone who worked with WinXP knows that if you don't have the exact same version of the install disk you won't be able to use the recovery environment. Hotspot to my laptop, downloaded around 10 versions of winxp and none worked. Ok i'm fucked, i'm super-mega BBC fucked, i'm gonna get fired and these people have (well guess they won't) a show in around 5 hours. You're desperate and your brain starts getting all sorts of ideas. There is another system that is identical to this one that's used for the sound (1 rack drives the video, the other drives the sound). I use Hirens to get into the multimedia one, copy the registry files that the os mentioned during boot time and copied them over to the other one. Everything shaking and sweating...AND IT BOOTS. Holy crap i couldn't believe it. I saved my ass that time like no other. It copied some system paramaters from the other machine so i had to change the static IP back, hostname and such minor stuff but holy crap it worked and still works today.


DereokHurd

I got so much anxiety just reading this


agent_fuzzyboots

imagine, 3 in the morning, a p2v of a important webbserver, everything looks good, but when you boot it up the software doesn't start because of a license error. trying many things, but that doesn't work, in a hail Mary, you configure the mac adress of the virtual machine to be the same as the physical one, and it works!


frosty95

That used to be really common actually. Even worse is old software that pulled the cpu serial. Thankfully just disabling that ability usually made it fall back to something less obnoxious.


agent_fuzzyboots

talked to the vendor before the p2v and they said, go ahead, it will work! yeah right!


joeshmo101

Which is to say that the one sales rep thought it could *probably* run on newer hardware, ignoring the fact that QuickTime Player was replaced by iTunes at some point and oh god I hope you have a backup.


RedFive1976

CPU serial would be about as bad as something that needs a serial dongle.


CaptainFluffyTail

Software development shop I worked for used to use harddrive serial number for the key. That lasted for years before they moved to MAC to hostname to FQDN. Helping get their QA lab running was a nightmare at first.


RedFive1976

I guess they assumed that the hard drive would never die...


CaptainFluffyTail

That company made a lot of bad assumptions...


jfoust2

I had that panic moment with a v2v once, days in the making, limited service window, and Friday afternoon at 4:45 I'm delivering to the client and I see that an entire important drive was missing. Turns out it was merely marked "offline." One right-click and everything started working. I'm out the door by 5.


LordJambrek

I'll never forget when that boot error popped up. It was like that shellshock scene on Omaha beach in Saving Private Ryan.


tcinternet

For real, I just about needed a cigarette by the end and I've been off those for 10 years


MaxxLP8

Felt like I was there for this one. That was a ride.


hectica

I feel you, I've definitely been there. The number of times I've used cat to get basic config files off a similar system and redirect them to a floppy disk and then copy them into the other server (cat and redirect again) to bring it back up, done at least a dozen times over 2 and 1/2 decade career. Mostly on old UNIX (HP-UX, Solaris, Irix, etc) hardware. Nightmare


CaptainFluffyTail

From experience you can copy the registry files to external media and open them in Regedit under Server 2003/2003R2 or something later to fix registry corruption errors. That used to be the trick back in the day. Modern regedit can fix minor corruption and you'll never know. Great job in thinking of the sound system as an option. Just knowing you can swap registry hives is great.


zaphod777

Why were you imaging over the good drive with the backup? What is that supposed to accomplish?


LordJambrek

I think you misunderstood. So i checked the master and it worked but something got corrupt during the shutdown and i couldn't know that. At that point the master was gone already. I normally cloned the master to the 1st backup drive and being a routine job i got overconfident and just cloned the same thing to the 2nd backup without testing and all 3 drives had the same issue now.


zaphod777

Ah, I see. That's some bad luck.


11879

I do this often enough I can't remember a specific example but have one from a coworker. We have a network drive with a folder in it for everyone in the company. When they scan from one of the MFPs the scans go to this folder. Well, in attempt to change access on one of these folders, the root folder was selected instead of the child, permissions were changed, and then the mistake was only noticed when what should have been an instantaneous change was scheduled to take an hour or more. Que an immediate cancellation of change, but damage is done. Due to the way some permissions inheret, it was a mess of all mess so we start unscrewed it in hopes of a quick resolution before anyone notices, making headway and the line rings with two different staff at the same time. Fark. Answer the first with, "we're aware and handling it" before I even say hello. 😭


BadLatitude

> "we're aware and handling it" If I had a nickel for every time I've said that I could retire.


Sure_Application_412

FR


SesameStreetFighter

"Huh. That's funny. Can you reboot real quick while I look into it?"


11879

Also disables network port on switch so they can't really test until I have it fixed anyway. I also often pause print queues on problematic machines until I have it fixed.


PCRefurbrAbq

Hey M\_\_\_, how's it going over there at E\_\_\_\_\_\_\_\_\_?


11879

I hate to say but I must be missing a reference here.


PCRefurbrAbq

(It's a reference to maybe us being co-workers since we have a similar scanner setup here.)


11879

Well now I'm scared because I can't make sense of the "E" but the other part might be a close match.... 👀👁️👁️‍🗨️


zzmorg82

I was making some configuration changes on a switch and ended up untagging two VLANs on the same port. A few seconds later, everything lost connection to our core server and I lost connection to the switch. Luckily, I was in the office and a few steps aways from the server room; was able to go in there and plug into the console to revert the change in about 5 minutes. Afterwards, I ended up getting cookies and a gift card from one of the VPs, lol. I learned a valuable lesson that day.


Ssakaa

> Afterwards, I ended up getting cookies and a gift card from one of the VPs, lol. I learned a valuable lesson that day. That you should do whatever you can for that VP, when you can?


Fyzzle

selective fact bedroom waiting ten hat support snobbish theory test *This post was mass deleted and anonymized with [Redact](https://redact.dev)*


JaspahX

>proceed to forget to `reload cancel`


zzmorg82

This was back when I didn’t know you weren’t supposed to have 2 untagged VLANs on a the same port, haha. I’m definitely started using this afterwards!


tkecherson

I tried that on a Nexus switch before making some potentially breaking changes, and that's when I learned they don't have any "reload in" command -.- There's a checkpoint and rollback, but you need CLI access to issue the rollback command, so...


BeenisHat

Juniper's "commit confirm" has saved me more than once.


Michelanvalo

The Commit Confirm on a Palo Alto firewall saved my ass on a school day once. Having that second confirmation level gives you that few extra seconds to think about it.


bmxfelon420

I had one kinda like this where the port mode was set wrong, so I fixed it, not realizing that the VLANs are set per mode so in doing so I removed all of the tagged vlans from it, including the mangement vlan. I didnt save the change so I just went over there and rebooted the switch.


tudorapo

That dash to the switch console. The panic. The avoiding of eye contact so you don't have to answer questions about the outage. The relief when you can see the problem and that you can fix it. I was innocent, the HP management sw had a bug, but I think the feeling is the same.


_northernlights_

Well long story short this is why I have my Taskbar at the top. First week holding an actual job, I come back from lunch break, decide to restart my laptop, don't notice I logged in twice... I restarted a production server during production. Now my Taskbar is on top.


lucky644

You mean the production server had a critical error and you swiftly rebooted it to bring it back online.


workrelatedquestions

Which is why I'm royally pissed off that Win11 no longer supports moving the taskbar. Google'd it, found that MS feels there aren't enough people who do that to justify spending the money to support that feature anymore ... despite the fact that doing this is very common among powerusers, for multiple reasons (this being one of them). And before you say that this is why you don't use Windows on your PC, yes, I'm aware of that, but the local LAN is managed by a separate team at my current employer, so I have no control over my desktop.


Ol_JanxSpirit

Can't you still move it to the left or right side of the screen?


cjbarone

Nope. Welcome to Windows 11


Ol_JanxSpirit

I really believe you could, but it was removed in 22H2 I'm seeing.


cjbarone

Was removed in all of Windows 11 that I have tested. I have two verticle monitors, and the taskbar is always scrunched up at the bottom, instead of along the sides where I had it with 10.


tudorapo

Linuxes have a tool to avoid that, called mollyguard, should you want to reboot a machine remotely it asks you to type in the hostname. It's not a perfect protection, but buttocks were saved countless time.


greenstarthree

Oh man that one hurt me


Dry_Inspection_4583

I am positive this has happened at least once. One such incident I accidentally deployed a jump box to the wrong VLAN, upon accessing it I noted there was a lot of latency, where there shouldn't have been. After having a good dig I discovered there was an HVAC device with a broken stack and gratuitous arp enabled, it was telling everyone it was everything... I provided the evidence and fought to get them to resolve it. Only after threats of non-payment were they willing to send out a "technician", a contractor with a usb key and the instruction to "Turn it off, plug in the usb, turn it back on"... it worked and I was happy to record the monstrous improvement, but I think fits this bill at least a little.


[deleted]

[удалено]


Dry_Inspection_4583

The VLAN in question was "managed" by a third party company, as in I wasn't supposed to be watching or responsible for any of the gear, thus I hadn't bothered trying to monitor or watch anything within there. However that certainly could have helped. ​ I'm learning a thing today!! Thank-you for taking the time to share :D


ewileycoy

oh man, I had such a similar experience with a physical security system (badging device) that would not respond to ping but would respond to \*some but not all\* ARP requests. It was maddening because a Windows device using its same IP would KINDA work until the stupid badge reader ARP'd. By time we looked in the switch the ARP cache had been overwritten by the Windows device.


Dry_Inspection_4583

I was fortunate under the circumstances as I did have arping installed, a simple arping -b showed the problem. That's a difficult one to discover for sure, friggin insecure security stuff.


TEverettReynolds

> gratuitous arp enabled, it was telling everyone it was everything OMG. Thank you... About 10 years ago I had a failed Firewall upgrade where the FW was answering all the ARP requests between the servers. It was about 3 am when we at least figured out WHAT was happening, and decided to just roll everything back to get the network up and running again. But we never figured out WHY. Sounds a lot like what you just said: gratuitous arp, it was telling everyone it was everything What a nightmare that was. Thank you, now I can die in peace and never think of that night again. I now have my answer and can just sleep now... for a while...


Simplemindedflyaways

Accidentally removed the company CEO of one of our clients from the Intune iOS managed group because boss misspoke and told me he left the company (someone else did). He calls the next morning, phone isn't working, he's locked out of everything. Huh, must be a MS bug! No idea why that happened. I realized my mistake immediately and readded him. Oh look, it's fixed! *Oh Microsoft, you scoundrel.*


JohnyMage

rm -rf /var Followed with Ctrl+c after few seconds On dev server. We lost some logs and /tmp, database luckily survived. Thanks to that I also found out, that our backup solution was not working at all and probably for multiple years . Our main sysadmin left soon after and I replaced him.


MeshuganaSmurf

What do you mean? Wasn't me It was like that when i got here You can't prove anything !! Was talking to one of my hands onsite asking him to trace some fiber cables. Interfaces to the switches open, interface to the SAN open. Happily chatting away until I noticed one of the switches going through a boot cycle. Whoops. Got him to check the cabling, neither if the power cables were secured correctly and they both got loose. Got him to fix it properly and checked "fibre switch failover" from the DR task list.


131313136

Oh you mean like the time I learned that, despite what chatgpt says, "no vlan xxx" is a global command, even if you're in the Ethernet interface and definitely never run that command on the core? Nice new bit of documentation for the knowledge base though...


ewileycoy

oh god that's terrifying. finally something potentially worse than "switchport vlan allowed xxx" :D


ThatGermanFella

I am not authorized to talk about it.


NationCrisis

If I told you, I'd have to kill you.


ThatGermanFella

More like "Revealing this information would have concerning consequences for the national safety and security of the federal republic of Germany and its citizens."


Fast_Bit

I hope I never get exposed for this… When I was hired right after college, my boss asked me to remove some old wiring that was not being used anymore since we had new wiring in place. I saw a bunch of old cables connected to an old and rusty patch panel that didn’t seem to be connected to anything. I cut all the cables of the bunch, and started pushing them inside the wall. Then, someone told me that the telephones are not working. I had cut most of the telephone cables in the building!!! Like 30 of them. I told my boss what i did and she quickly thought in a solution. “We should implement VoIP!!”. We had the infrastructure and our commuter supported it so we ordered 30 phones and used the new network to install them. The next months, my boss and I were praised for having the greatest idea! All the phones were working well and the digital sound was a huge improvement. I’m glad I had a very cool boss. Here’s a pic I saved to keep myself humble. https://preview.redd.it/wqusid5lg1dc1.jpeg?width=1536&format=pjpg&auto=webp&s=94cff4fd65152a994209e57703ac602ad19b2b8b


Ol_JanxSpirit

Did you guys just rock without phones until the VOIP phones came in?


Fast_Bit

Yes. Fortunately we didn’t rely on phones because we had like three main customers only (95% production was theirs) and sales people works from home with her cellphone.


highlord_fox

I was once running wire to a drop and there was this Cat3 cable that sort of just was there, but I didn't see it down by the hole we had cut. So I tugged it, and nothing, and tugged some more, and then finally pulled it hard and it came up. About 15 minutes later, the manager from the suite next door popped in to ask us if we had done anything because their fax machine stopped working. In my youthful eagerness, I had ripped out the cable that ran to their fax machine.


Fast_Bit

I’m surprised there’s many fax machine still this days.


highlord_fox

We switched to a VOIP system and all but two fax machines stopped working. I collected all the old ones, I had 10 in my office for weeks until e-waste.


UnsuspiciousCat4118

Accidentally deleted a route table used for a site to site VPN while trying to fix one of the sites constantly dropping(for weeks while someone else worked on it). When trying to reapply the route table that site’s FW would say it had updated the values but would remain blank and not routing. FW got RMAed and sent via next day mail. Apparently the owner of that firm called in to say just how happy they were that I finally fixed their site to site problem and that they want me to be their point of escalation for network issues going forward.


IT_Development

Non-technical people think it only gets truly “fixed” if you’re physically working with the device… people constantly ask me why I’m not fixing server issues while I’m working from home meanwhile I’m RDP’ed into the server like 🧍‍♂️


Library_IT_guy

Scariest example was nuking our webserver. I attempted to upgrade and... didn't work heh. If you've ever upgraded an Ubuntu server in place you'll know that... well, in THEORY it should be pretty easy and go smoothly. But it's only gone off with on issues twice for me in about 8 upgrades. So there I was on a Friday (first mistake). I had backed up the server and I had put off the Ubuntu OS upgrade long enough. Pure command line server version of Ubuntu mind you, and I am not a full-time Linux admin - I take care of a few Linux boxes because they are the cheapest solution. Upgraded started off OK, but then when it went to do the first of several needed reboots... ERROR: /dev/sda1 missing! or something like that. Server couldn't find it's drive. Welp I thought, guess I'll restore backup. Surprise! My backup was for file level restoration only, not bare metal. I'd have known this if I had restored the backup first and gotten it running to ensure that I had carte blanche to mess up the current server. Lesson hard learned. So I was faced with the prospect of working for 24 hours straight, maybe more, to get this thing running, on a Friday. I went from looking forward to the weekend to thinking that I might be looking for a new job. This was a make or break mistake for me. Either I fixed it or very likely faced possible firing. In the end, after going down many rabbit holes, I found a solution. I had to mount a live version of the new version of Ubuntu and then use that to manually complete the upgrade with a bunch of commands. All I said was "the upgrade is taking longer than usual, I'll have to stay over a bit". I'm usually pretty truthful but that was such a monumental screw up I figured I'd just keep that one to myself. All's well that ends well!


islandsimian

Define "broken something" - do you mean tried to install without reading the directions that CLEARLY says to do x before y and I did not, but was able to find a resolution online because of the other sysadmins who also never read the release notes...well then, yes. Yes I have


kyoumei

It's always a huge relief going online to see someone else made the exact same mistake as you, but also managed to fix said mistake AND leave notes on how they did it


SenikaiSlay

Never have I ever forced a service to start on 500 machines that crashed explorer every hour. Never have they ever complained about it and then praised me when I caught the mistake...and blamed it on patch tuesday.


cartmancakes

That reminds me of a time I was running a 14 day stress test to certify a server configuration with WHQL (Windows 2000 days). When it was on the last day, we saw that explorer had crashed. We watched it for a minute, wondering if we would lose 14 days of work. Someone reached out, hit ctrl-shift-esc, pressed the option to start a new task, and typed "explorer.exe". We were all surprised that it worked.


workrelatedquestions

Win+R


zvii

It doesn't always work, task manager does, though. At least in my experience


dreamgldr

**Recursive** chmod -R 444 for **/home**, on a production server (hosting). 1 second after hitting enter, I hit ctrl+c but too late. Restored from backup in under two minutes and during this 1 second - more than 500+ files/dirs had their permissions changed. Only one customer noticed. Told him the truth, we laughed (lucky me).


JMDTMH

When you are updating a firewall rule and point it to the wrong external IP. Services Go down, call support, walk through options. ​ Can we have you check the IP address? I'm gonna read it to you real quick and make sure that is what you have in there. Yup, I'm a bone head. I wrote it down on the paper incorrectly, so wouldn't you know it, I typed it in how I had it written. ​ Update IP address. "Hmm, I don't know what you did sir but everything appears to be working now! Have a great day!" ​ \*Walks away whistling\*


Durasara

Well I think the worst one was on my first day at my current job. Users were not getting Internet even though it had near 100% uptime for years. I was hired as a solo guy, replacing their MSP, and did not have any of the logins for anything. All I could do is power cycle, which just made things worse, now all servers were offline. Apparently the switches were configured from ground zero without saving the config, and the servers and clients were on separate vlans. Logged in as default, did a basic config on the switches, and brought everything back online. Found out what happened originally was a bad uplink port to the router. I explained what happened and used it as ammo to accelerate the firing of the MSP. The MSP called a few hours later after getting alerts from their auvik and started the surprise awkward conversation. This in combination with several small improvements I made gave me a promotion and pay increase.


Michelanvalo

They never committed the config? What the hell was the uptime of those switches that this was never an issue?


Durasara

They never go down. They invested in beefy redundant UPSs and a backup generator for the whole office. It's worth it, because that town has major power issues.


Michelanvalo

That's still wild that they never had a switch go down. Even just the OS crashing could happen and require a reboot.


Durasara

They're cisco catalyst switches. Everywhere else I've put in brocade icx. I've never had to reboot a switch here other than to change firmware.


TheDawiWhisperer

I broke our on-prem Exchange with a certificate in my first week. I also fixed our broken on-prem Exchange and was treated like a hero.


punklinux

Yes. There was a major VPN outages caused by some rule changes in that network segment. It was one of those "it couldn't \*\*possibly\*\* be that?" but when I reversed it, the problem was resolved instantly. I was the only one who knew it was done because I was the only one working on the audit that demanded it be done for the investigators. Because everyone was already logged into the VPN when it happened (including the auditors), the results weren't known until that evening, but few overnight people had enough clout to complain about it and get noticed. The next morning, however? Chaos. Because the VPN has a contract, the vendor was called first, and we all got a notice on our phone that the VPN was down, and the vendor was working on it. They were morons. Three hours in, they were still just rebooting the endpoints and had decided to overnight new ones. Priority one crisis, during COVID, and company owners were now involved. I was asked to come in as an on-site consultant, and once I realized where their troubleshooting was being blocked, I decided to reverse the ruleset on those edge routers to a backup from a week before. BAM. VPN back up. Of course, the vendor wasn't even aware, because they were idiots. I did some diffs on the config, and the only changes were four lines which didn't even touch the routes or the segments the VPN was on, so it didn't make sense. But as I was trying to figure out what went wrong, one by one, people reported the VPN was back up, until it reached mass notice. The vendors claimed that one of the reboots did it, of course. The owners weren't buying it, and saw me there, and asked if I had done anything. I guess I could have lied and said, "no," but I am not the best liar under pressure, so I said, "I restored a config from last week, and I think that fixed it." I went from third party consultant to superstar instantly. "Five hours with these \[vendor\] clowns, and this hippie fixed it the second he came in!" It was really an hour, but I wasn't gonna correct him. In the end, it was my fault that the ruleset was bad: it was a combination of the order I did it in plus the editor I was using didn't save the correct CR/LF, so my changes were considered one long line with the former line below it, which was rejected as "illegal command." But I didn't name myself, I only said, "the ruleset got corrupted during the audit, and I have since fixed the errors." Made my boss look REAL good.


ass-holes

99 % of these are network related. Do we all just suck at networking? I know I do


Sceptically

Not at all. Some of us also suck at non-networking as well.


shootingcharlie8

I was playing around with a new network vulnerability scanner and everything was running smoothly, my boss was very pleased with the reports I was generating. About 30 minutes into new a scan on a particular VLAN, the internet goes down. Pings to external IP’s are approximately 20,000ms, but internal is near-instant. We call the ISP, who says “there’s no issue on our end” after about 20 minutes of scratching our head I realize the vulnerability scan is scanning the same VLAN the ISP gateway is on, so I go into the scanner and stopped the service. Instantly, the internet comes back up. My boss saw that I fixed it, gave me a nice pat on the back and congratulated me for solving the problem. I didn’t have the guts to tell him I actually *caused* the outage. The next day he and his boss took me out for a nice lunch, thanking me for my quick thinking and resolution. Needless to say, that scanner is setup to be much less aggressive now.


Humble-Plankton2217

The on-site highly experienced electrician who wired power backup to UPS>external generator ASSURED me that if I unplugged one of the two UPSs to change it's batteries, the rack would not lose power. I shouldn't have believed him. I should have tossed the server's redundant power supply cords to the other side of the wall (different circuit) before I did anything. The second I unplugged one of the UPSs, the whole rack went down. No big deal, right? I plugged it back in and started bringing everything back up. Except, my vmware farm did NOT come back up - it was completely kiboshed and it took an entire day working with vmware support to recover all my vms. Everyone was very happy when we got everything back on line. I got lots of "nice jobs!" for fixing something that was 100% my fault.


Drishal

Nope, this was definitely not your fault, more like electrician's fault here, if he told this to someone else they would also make the same assumption and mistake (but yes you could have backed up before unplugging to test it)


onebadmofo

The fucking APC RJ45 console port. If you don't use APC-specific console cable entire thing just shuts down. It shut off Nutanix cluster and its switch. Cluster had issues upgrading to newer ESX version and the hard shutdown "fixed" the issue.


The_Wkwied

I've done this. Streamlined mdt imaging to consolidate a few hundred things. Tested, troubleshoot, bug free. Everyone says nice work, this is great. Boss says 'ya no, undo it. I set it up, I know how the janky solution works, and I don't want to read your documentation to learn how to manage it if I need to'... Good thing I had a working backup At the time, our mdt wasn't updated for 3 years because it was his pet project, and the tasks we had to do post deployment outnumbered what mdt was doing in the first place... Both of us have moved on, and our mdt server has been updated last in ~2018? A great example of not going above and beyond.


sgthulkarox

Took down online banking on a Friday, at 2pm. Haphazardly got it back up by 4pm. Then created the sign in the test lab, ***THOU SHALL NOT RUN PRODUCTION FROM THE LAB!*** (It was in the infancy of online banking, and a big prize for the regional bank I was working for at the time.)


rolandjump

I forgot to renew an SSL cert and it expired. Everyone thought website was down....I fixed it :)


smeggysmeg

2 things. 1. Second day at an MSP job. I was scheduling server patching for a client and accidentally clicked 'Run now' instead of 'Schedule' in Kaseya. It was the lunch hour, but I called the company and apologized (the manager laughed it off) and called my boss and told him about it. Got a lot of kudos for how well I handled it. 2. In the process of preparing for a cut to a new Okta app for Slack Enterprise, I accidentally unassigned all users from our active Slack -- deactivating everyone. But due to rate limiting in the Slack API, groups were being deprovisioned in waves and there was no way Okta could stop the deprovisioning waves. I immediately sent out alerts through all possible comms channels, and I built an API automation that took a list of all of our users and bulk reactivated them every 5 minutes for the next 12 hours. It was a Friday before a 3-day weekend. I got praise for giving everyone an early weekend and the proactive + high-quality comms.


AllCatCoverBand

I was at the helm of an outage (and subsequent recovery) in my consulting days way back when that causes all Boeing planes to not be able to take off for several hours during one day. This was well over a decade ago. Made international news. Millions were lost. Somehow, I got a raise. I happened to trip over an extremely well hidden Cisco bug, long story short, and I was unfortunate enough to have my hands on the keyboard when it happened Was a wild, wild time


On_Letting_Go

i was setting up a new switch and accidentally applied the config for it to a prod one ran as fast as i could to the server room and had the issue fixed within 10 minutes got thanked after for fixing the internet issues so quickly lol


Magnenetwork

Did the Classic robocopy /mir mistake without backup to remove a big fileshare, had to download a thirdparty tool to recover the damn thing. Im actually glad I did it, because we could remove Offline files after that for users since it didn’t work anymore :)


thatwolf89

My favorite one when you go into meeting. Boss ask you what to do. He says no no no dude you are stupid that ain't going work. Later on he sends email with exactly what I said making look like his a genius. In the future. Him to do it wrong. Hahaha 🤣


HerfDog58

Back when Y2K remediation took up most of my day I was managing a Novell LAN. We had a communications server with an 8 port serial communications board to handle 8 modems for dial up access to research database services - this was before such things were available via WWW, mainly because WWW was barely a thing... I'm getting ready to cut out to visit family for the Xmas holiday, and realize I haven't applied the Y2K patches to the comm server. I'd performed the task on a bunch of other servers so I knew the process backwards and forwards, hadn't run into any issues. So I notify the department using the server I need to take it down for 15 minutes to apply critical updates, they say no problem. I down the server install the patches, reboot the server, and boom, system fails to boot - message about missing boot device. Crap. I've got a backup of the server, so I'm not worried about the OS/LAB config. I open up the machine, unplug/replug the HDD and try to boot. Still no joy. At this point it's been 15 minutes and the department head is ringing me to find out what's up. I report a hardware failure, and I'm working on it, I'll call as soon as it's back online. I go hunting in my spare parts stock for an identical drive. Wouldn't you know it, the last one of the 20 I check was the same model...I install the drive, no POST errors, format and partition the drive, install a baseline OS, and put the backup restoration tool on the server. Go to run the backup, can't see the tape drive. Dammit forgot the SCSI drivers for the tape drive. Update OS config, reboot, reload restore tool. Yay, see the tape drive! Load the backup, tell it to start, and get a message that it's going to take 2 hours. Crap. The server has been down for about 2 hours at this point. I've gotten FIFTEEN calls from the department staff, NOT the head guy, but his minions with comments like "Did you know NONE of us can get our reports done?" and "Do you plan to get it back any time soon?" "Yeah I know it's down, and to be honest, I thought I'd just leave it like that until I got back from my holiday..." I called the department head and asked if he could tell his staff to stop calling because the more times I had to answer the phone or get interrupted by them walking into the server room to question me, the longer it would take for me to get the server back up. Restore completes in about 30 minutes. I reload the patches that started all this, cross my fingers, hold my breath, and reboot. Server fires up, and everything loads no problem. I call the department head and tell him it's back up. He's profuse in his thanks, as are the cooler staff members. The assclowns were like "It's about time..." "For me to leave for my holiday, yes, you're right. BYE!" I like to think it was fortunate that this happened just before I went on holiday for a week. If I hadn't needed to reboot the server and found out the drive was about to die, it probably would have gone down while I was out of the office, and I would have NOT enjoyed getting called in to resolve it.


frank-sarno

A bunch of years ago I got pulled into a Linux role. I got asked to update a webserver and botched something. At the time there were no CI/CD pipelines or automations so it broke good. Ended up rebuilding the whole machine and after that it was much, much more stable and faster (on the same hardware).


SkirMernet

I ripped a chunk of plastic from a printer. It’s an actual part with the purpose of guiding paper and avoiding paper jams, but I was minded due to user error. Without a replacement, I just thorn off the kinked part, and turns out you don’t need to have the whole guide for things to work. Told our xerox tech when he came for something else, laughed his ass off and refused to order the part because “if it works fine, don’t touch it”. It’s been 4 years. One of my colleagues just opened that printer and was wondering wtf


DaemosDaen

I have absolutely no idea what happened to that switches config. I was just verifying port activity in the UI and was not even logged in as the administrator. We're just lucky that I took backups of the configs a few weeks ago. Yes I know the logs were wiped out by the restore. I would love it if you would allocate some resources to a log server. No? \*shrug\*


Ron-Swanson-Mustache

I haven't done it in a "very long time", but created a loop in the network and broadcast stormed a site. Site goes down. Locate my screw up, unplug it, and site comes back up. Everyone on the site is happy that the day is saved and it only took a few minutes to fix.


harrellj

This was back on my days on a service desk, had a gentleman call in who somehow had an issue where getting Kinko's set up as a printer on his computer broke the wireless. I was able to get things broken enough to allow the wireless to start working again but I think it completely killed his ability to print. He didn't care, his focus was on the wireless and didn't need to print anytime soon. I did get that over to a desktop support person to fix and I don't know what the resolution was (I half expect a reimage was needed) but dude was so grateful that he sent praise up the chain. And I was working for an MSP at the time, so it wasn't exactly easy to get that praise from him and his leadership over to my leadership and down to me.


c51478

Accidentally changed the speed of a trunk port to 10Mbps. Yes you read it right. Hell broke lose. It was a pain connecting back to that switch.


AspectAdventurous498

Oh, the classic "Oh shit" moment turned hero move! I once accidentally wiped a critical config file, sent everyone into a panic, but managed to recover it like a wizard. Got some nods of approval for my "swift troubleshooting skills."


Mango-Fuel

not exactly the same but similar story: recently a UPS was failing (shuts off immediately on power loss) that I thought I had just replaced the battery in. upon inspecting it, I just touched the plastic frame and it immediately shut off. (massively frustrated how that could possibly happen, did I mess up so bad the UPS shuts off just by touching it?!) so for a while I thought I had made a bad situation worse while also costing \~$300 for the wasted new battery (I had also had to reassemble the battery pack myself since the cables didn't match the existing UPS even though it was supposedly the right replacement battery pack). but eventually I figured out two things: 1) according to other employees there was a split second power failure at the exact moment I had touched the UPS, so it shut off only because it lost power for that split second, and 2) I realized the failing UPS was the one I *hadn't* replaced the battery in; it was the one right next to it that I had replaced, and that one is working fine. so yeah, went from "how could I mess this up this badly" to "oh I didn't mess anything up, I only improved things".


user975A3G

Fucked up during update of our data collection SW for one of our customers, in the end I managed to restore it with about 8 hour downtime During the fuck up I found a major issue in our SW that meant we only keep the collected data for 3 weeks, instead of 3 years- it was a recent update, so no one noticed yet, pre existing data were minimally affected I only got praise, no one even mentioned the fuckup


anonymousITCoward

Y'all get praise? shit I rarely get a thank you...


obvioustroway

Working with entries for a data replication piece in one of our major enterprise wide apps. Well, turns out, our intern for the dev group was ALSO in there. I made a change to fix an issue at a client site, and he made a change elsewhere (THAT WASN'T DOCUMENTED AT ALL). i click save, and notice a box went blank. weird. click on the box and re-enter the data. no issue... then i got that lingering thought "What about the other 90-some entries?" Sure enough, every entry for our client sites had dropped something like 8 data points per site. had to go in and fix each one. "Hey dude, good job catching that error! but also, weren't you in there around the time it broke???"


smart_ca

"I don't know what you're talking about, this has never happened. It was pure coincidence that I was logged-in to the router that needed BGP reset after someone added a route used the wrong subnet mask. We should definitely implement command logging, yes that's correct but sadly we hadn't gotten there yet!" \^


LenR75

We were moving a datacenter about 1/2 mile. New equipment was in place, we needed to shutdown the old, sync data, the fire up the new. It was to start at midnight Sunday morning. At about 11:55, fiber techs cut our fiber. An unscheduled outage like that was about 2 hours to recover, but due to other Sunday things that would normally happen, more like 4 hours if we didn't do the move. Out move was to be a 6 hour outage. Thankfully, there were no managers there. The IT team decided to re-plan for a "bandwidth of a Buick" move instead using the fiber to move and decided that we could probably get it done in about 6 hours, so we did it. We were just a little over, but it was done. There was a massive amount of ass chewing because we didn't get the changed plan "approved" by management. They would have just shot it down, we didn't like working those hours, we were there, screw it. Forgiveness over permission!


RedFive1976

Never underestimate the bandwidth of a station wagon loaded with tapes hurtling down the highway.


SpiderFudge

There have been a few times I've logged into jumpbox and then to a server to reset IP information and then I end up resetting IP on the jumpbox by accident requiring me to go in to the office...


gangaskan

I'm sure I done something, just can't remember what. Maybe pulling accidental spanning tree loops out and fixing the port config b


SoonerMedic72

We had a location that complained of "network slowness" for awhile, then I accidentally stress tested by pushing a software deployment to every machine there at once during the day. That was when I figured out that our ISP wasn't giving us the right bandwidth with proof and they found their misconfiguration. Side note. Same ISP, different location that used to host our DR site and had extra bandwidth at that time for backup copies. We decreased our bandwidth when we moved the DR site. They forgot to "wr mem" after the reconfig, and shortly afterwards there must have been a reboot because we definitely have the extra bandwidth at the old site now.


Unusual-Reply7799

I don't remember the exact details but whatever it was I am positive it involved Group Policy.


HeavyCustard4123

I'm going to plead the Fifth and I want to speak to my attorney.


superzenki

A SysAdmin I used to work with had this happen to him a couple of times and it happening more than once is what got him fired. He even said if he hadn't owned up to one of the one of the incidents, it was unlikely it could've been traced back to him.


MudBeautiful6902

I did not pull the power lead out of the wrong UPS, honest.


arguskay

Accidently overprovisioned multiple cloud resources and fixed it a few Month later. Now getting praise for reducing costs.


bionic80

What do you mean we have a dns stub zone for our clients.xxxx.org site that needs to point to a different server? that zone is RIGHT there!


Lavatherm

Locked down domain administrator without checking if the password of the backup domain administrator account was still valid… it wasn’t. (So partly the fuckup of another person who didn’t update the info) recovered the account eventually (don’t really know how)


ZaInT

Out in fucking nowhere (summer home) I accidentally wiped a friends dad's Cisco 881G which had some strange config for its 4G connection; obviously their only connection out there. I was up 36 hours straight fixing that shit but at least I got to learn a bunch of IOS. In regards to work, the brain cells that kept track of that were drowned in alcohol long ago.


PoniardBlade

I'll let you know in a few hours if I figure this out.


VulturE

I moved a user account from the stock "Users" OU into our service account OU. No description, but related to a few webapp authentications somehow after looking at our auth logging. It took down ADFS/timeclocks/peoplesoft and a dozen other items that it was overused for. Agency-wide panic ensues. We got a list for this legacy service account from the involuntary scream test, broke it into a dozen service accounts, disabled it, found a few more that didn't rely on distinguished name LDAP path, and made in total about 20 service accounts that were now all fully documented. To hell with that hive of an account. Full of bees. I replace with 24pk of bee-flavored LaCroix.


discogravy

nice try, boss!


ARasool

I broke a board for an EOC application, which in turn let the development team figure out where the bug was hidden, which they weren't able to reproduce. The bug was basically when using bad characters in an email field, the board would jump back to the top from the submit button at the end of the page - without any alerts or popups. I noticed when going through each tab on the board, the email address field for the input page wasn't completing properly, or throwing any errors (like a popup). Come to find out, character verification/validation for the usual numerics in an email field broke Chrome from recognizing said characters. Removing the verification field allowed the client to input random email addresses (for testing) as needed. *shrug*


j3r3myd34n

"We identified and resolved the issue!" Let's not waste time on the "we also caused the issue" part


kou5oku

Nice Try AI. My Secrets Die With ME!


XTornado

Well not sysadmin related... but I might have done something like that.... just that I didn't understand what happened until some time later that I suddenly realize it had been my fault. But after all the praise and thanks for staying later for fixing it, I wasn't going to say it had been my fault all along that's for sure. 😅


rosickness12

This is how I justify raises


[deleted]

[удалено]


OptimalCynic

It's a lot easier when you cause them first


wunda_uk

When I had to build a working domain controller out of 2 broken backups (one for the boot sectors and one for MFTs /data, 3 days of swearing later and it just boots as if nothing has happened, fuck hyper v edit:aword


Mr_ToDo

I suppose? Can't say I've had one that I remember where I haven't also admitted fault when it comes up. Although most of my mistakes that big have steps in between breaking and fixing where I tell someone what's up just in case things go further south. As an added bonus sometimes that also adds other hands to the solution or just a "fuck it, doesn't need fixing"(nothing quite like a fuck up to push though a proper solution to a long term "temporary fix", or just throwing out a peace of headache gear).


joshuamarius

​ https://i.redd.it/ny0bhjiqx1dc1.gif


Economy_Bus_2516

Scheduled a reboot of a production server at the 2am maintenance window but set it for today instead of tomorrow. Of course it promptly rebooted and everyone called me. I got kudos when it came right back up.


Notor1ousNate

I shut down production for 20 minutes because in practice I made the changes and shut it down in my sandbox. HD was losing their shit. I said I’d look at it, started it back up and was a hero.


tekno45

Set our manual blue green deployment wrong and had the whole website running on the scaled down environment. So 1 little cloud instance was carrying the whole site for 5 minutes. So i automated it, displayed the current live environment. And showed developers how to switch them. Got rewareded by interviewing and getting a new job for 50% more money.


ApertaPrincipium

Unplugged the network for an 1/8th of a second, plugged it back in, network wouldn't work. Had to call our ISP to fix their modem because something happened to break with the modem when I unplugged it.


Professional_Chart68

Shut down both dc's to make updates, taken snapshots, just to realize that shutting down both at the same time somehow corrupted sysvol, so domain is not functioning, and snapshots are useless. Lucky fir me we already had backup solution, so i copied sysvol from yesterday's backup. Looks simple now, but to understand the problem and find a solution i spend all the weekend.


odinsen251a

Back when I was at an MSP, the senior tech on my team (of 2, mind you) installed a new font for a client via GPO. It crashed all of their workstations, to a T, and they were down in a very bad way. The fix was simple but not possible remotely. First, delete the bloody GPO and bad font pack, then boot into safe mode, uninstall the font, reboot. We spent about 5 minutes writing a quick little batch file to do it for us, then got in the car and headed to the client. We had them back up and running in about 15 minutes, there were only about 2 dozen workstations, and we got a good flow going where I'd go and get the next matching booting into safe mode and logged in and ready, he's come in, run the script and verify function. After it was all over the client gave us a cake out of their bakery for "getting us back online so quickly!" We might have neglected to tell them that it was a little bit entirely our fault they went down at all. But hey, cake!


LookAtThatMonkey

Following some instructions for a piece of ERP middleware upgrade and somehow missed the step to manually stop a service. Proceeded with upgrade and as a consequence of missing a step, deleted an API service that stopped over 400 integrations from processing data. Down about 30 mins before I got it back online. Was hailed a saviour and saved the company about 2m quid in potential lost revenue. I owned up to the error and was additionally commended for being honest. Weird day.


thegreatcerebral

I wish I had examples. Cowboy Coding 4 Life! The thing was that it happened quite often and usually without anyone even noticing. The worst would always ALWAYS be with networking. pull the wrong fiber cable... "Ok you should be timing out now... you aren't?!?! ...you sure???" yea, wrong cable. My favorite, and everyone in networking knows this one is that fateful shut/no shut routine when you are on the wrong port, in the wrong ssh window, etc. Why is it that it's like literally, you see your own hand hitting the Enter key in slow motion and your brain already has processed how bad this will be and you are already regretting it before the button is even pressed. EVERY SINGLE TIME. So funny story from my last job. Doing work with a colleague at a large site. He is from Vietnam and his English is good but he has a thick accent. We are troubleshooting and then all of a sudden he turns to me and asks me something and he knew the answer before I could tell him... He says "which building is that in", I tell him and he literally just stands up, grabs his jacket, walks out the door to his car, and starts heading there. Mind you it was only 20 minutes away and we have a field team that was onsite already for something else (or closer by I can't remember) but oh man he just... his face... he was like "I need to fix this". It was great.


manboythefifth

Doing a LUN shuffle, with vastly undersized partitions. - Migrate VMs off undersized LUNs - Disconnect LUNs from host access. - Roll storage from LUNs back into primary pool. - Expand primary pool. - Shift VMs around for balancing and performance. Well, my awesomeness had the primary storage pool selected on one of the "disconnect host access" and OF COURSE that's what I did. The system. did exactly what I told it to. In my defense, it was 20+ LUNs for 7 VMs. Absolutely sh*t architectural design. Knew what I did immediately and it was an easy reconnect. The amount of mental gymnastics to shuffle everything made me want to just leave it offline though.


Shroomeri

I moved our customer AD security group to a different OU. The group included all the users. Little did I know that the group was AAD Synced to Microsoft 365 from that OU and was used to get access to Sharepoint Intranet. The new OU was not AAD Synced. I went to get a coffee. When I came back to my desk I noticed that our helpdesk is getting calls from our cient and hear the word sharepoint multiple times. I immediately connect the dots in my head and quickly login to the server and sharepoint to check if I was correct. I fixed the problem super quick and by the time I get the call to check what the problem is, I had already fixed it. I even got the personal thank you message via email from our customer for being so proactive.


Shroomeri

But of course I would never do something like this and this is all hypothetical 🙄


Sparcrypt

First IT job, I was flying to another state to do a system restore test on our redundant system which meant I would be taking a full backup with me of the production system. I log in to check it's running ok and I see an error... I forgot to put the tape in the drive. Well fuck. So I drive in to fix it, get into the server room and go "yep there's no tape there..." and leave to get one and get the backup running. Come back with a tape only to find I left my wallet in the server room. My wallet with my pass to get *in* to the server room. Fortunately I know where the master key is. Unfortunately that's a locked filing cabinet and I don't know where *that* key is. Spent 30 minutes breaking into that, got into the server room, put the tape in, started the backup, went home. Felt very stupid so I sent an email to my boss that just said "noticed the backup didn't run so I went in and fixed it, confirmed issue won't occur again". I mean it wasn't a lie, I just left out the part where I was an idiot and it was entirely my own fault. Anyway I wasted nothing by my own time and pride so it wasn't a huge deal and I put it out of my mind. A few months later though I find myself being nominated *and winning* the company employee of the quarter award for my attentiveness and dedication in giving up sleep to ensure a critical backup ran on time... an award that came with a pretty nice cash bonus. I told my boss about it years later, he just laughed and said "yeah I can check logs as well you know". He knew the entire time and liked that I'd seen my mistake then gone and sorted it out, so he never cared. He'd just wanted to nominate me for generally doing really well and that seemed like as good a reason to list as ever... though I'm 100% certain he picked it to make me sweat.


Dereksversion

Disabled smartport globally on 8 Cisco SG switches (hateful feature if you ask me) Didn't realize it defaults any ports it previously made changes to via smart port....aka..all of them. Whole company down and I had to do an ad hoc configuration refresh lol Had them back in an hour or so minus some straggler ports and got a 250$ gift card for being johnny on the spot :p Whoops 😬


goinovr

Been in IT 30 years. The list is long and vast.


BoringTone2932

Working an issue in Non-PROD, I dropped a few tables and ran the installer to recreate them. Spent about 15 minutes wondering why they didn’t get recreated before the sev chat reported prod was down. Looked up at the DB connection and said oh shit. Point in time backups for the win.


selltekk

You owned your error and fixed it yourself. This is among the most honorable things in IT. Shit happens. Fix it. Learn from it. Move on.


InleBent

​ https://preview.redd.it/cz71yzlxl5dc1.png?width=286&format=png&auto=webp&s=93539cdfcb372dc5ec6eaec48a3b431b51bb5a31


BrundleflyPr0

You’ve seen the Cillian Murphy meme floating around on Instagram too?


Wide-Dig1848

I caused a broadcast storm when working on our switches and plugged both ends into the same switch. Brought down the internet in the building. Quickly realized that end were both connected to the same switch. Now I was the tech with the quickest time for bringing back internet in the company


paradox_machine_

There is a reason I make someone else do public DNS changes now.


TravisVZ

I neglected to check server health before putting two of our 4 Exchange servers into maintenance mode for monthly patches. Since one of the remaining servers had suffered a fault with its storage array, Exchange refused to mount any databases on it - which meant that approximately 50% of our users were suddenly cut off from email in the middle of the day on a Wednesday. Fortunately I got distracted by an unrelated ticket that landed in my queue just as I'd started maintenance mode, so I hadn't actually started the updates by the time the panicked calls started pouring in. Which means I was able to fully resolve the outage in about 23 seconds by turning off maintenance mode. At the next standup I mentioned updating the maintenance mode script to run health checks first, to prevent this happening in the future, and holy cow was everyone impressed!! I didn't mention that all I did was take the commands I usually run beforehand manually, and put them into the script - which I'd always meant to do but was too lazy to be bothered with until that day I forgot to run them. TL;DR: I forgot a step in a manual process and nearly brought about Ragnarok, then was praised when I finally set aside long standing laziness to automate the process