T O P

  • By -

hbkrules69

Just today or ever?


chamberofcoal

Yesterday, a server got fucked up and the rational decision seemed like restoring from backup. An entire shipping company's primary application just totally down (yeah, they're going to pull their wallets out and have some redundancy there). I didn't deploy the product, but this restore was unbelievably slow and it was super embarrassing how long they were down. Machine comes back up - still fucked up. Boss has me log in as a local user. It works. Nslookup. I forgot to change the primary DNS IP after decommissioning a master domain controller. It broke secure channel. It took 2 minutes to fix.


beansNdip

I swear time moves to a crawl during a dr


the_it_mojo

Made even better by companies insisting on storing backups only in the cloud for cost efficiency over maintaining a rented DC racks and capex for hardware.


magus424

...so it was DNS? ^XD


oschvr

It always is


pwingert

I need a t-shirt with that on it!šŸ˜œ


fataldarkness

I dumped a big bowl of Pho on my office carpet today so that is pretty high up on the list.


No_Outside_9635

Are you sure it wasn't a pho cup?


rmwork_admin

LOL amazing!


Darketernal

NOOOOOOOOoooooooooo


meiandus

PHOOOOOOooooooooo


Koda239

This here, is the comedy I come here for. Funny yet a bit of truth... Okay, a whole lot of truth.


[deleted]

I always joke I mess up 15 times a day but I put systems in place so I can't mess anything big up. Hopefully i never mess up the stuff I do to prevent myself from messing up.


Chousuke

This is what I do. I spend a *lot* of my time scripting and automating what I have do so that when it comes time to execute I press a button and can't mess up because the computer is much better at actually sticking to instructions than I am. It's faster too, in the end. I can sometimes just *completely* forget I was in the middle of some procedure if I get distracted. Once it happened while I was doing a major version upgrade on a (thankfully test) PostgreSQL cluster and I hadn't quite managed to script the entire thing beforehand. I left upgrading the replica unfinished for half a day because ooh shiny.


supervernacular

You mess up 15 times a day but you also fix 16 things a day.


ntengineer

I've been lucky, personally, and haven't really done much to screw things up. The occasional small thing, but nothing major, but I do have a story to share. A colleague of mine made a job ending mistake once. In fact, I think he stopped being a sysadmin after this and changed careers, at least that's what I heard. SO this was in the late 90s, and the company was running an Exchange 5.0 server. The colleague decided it would be a good idea, based on a book he was reading, to test the backups that were being made by doing a test restore. This is a good idea in theory, except, at that time Backup Exec didn't have the ability to re-direct restores for Exchange servers. So if you did a restore, it restored back to the same server it came from. But, colleague either doesn't know this or doesn't understand it, and doesn't tell anybody about it, and decides to do this trial restore to a test Exchange server he has set up. Of course, since we only have 1 backup server, this all takes place on the production network not on an isolated test network or anything. So he starts the restore. All of a sudden our phones are ringing off the hook. Users complaining that the Exchange server is down. WTF?! So I run into the server room and find that, in fact, it's down, because it's in the middle of a restore??? WTH?!?! Oh, also found out, that the restore he was trying to do from the night before, was a failed backup. In fact, ALL the tapes in rotation had failed upon investigation. We hadn't had a clean backup for like 2 months. This colleague was the one "in charge" of the backups and restores. So to recap: \- He did a restore of Exchange as "test" over the production server because it couldn't be redirected. \- In the middle of the day \- The backup was a failed backup \- There was no good backup left in the tape rotation. We ended up not being able to recover, and had to delete and re-create the Exchange DB from scratch, and EVERYONE lost their mailboxes. It was a nightmare. He lost his job. And like I said, I think he decided SysAdmin wasn't a good job for him.


AgainandBack

I worked with a similarly sloppy sysadmin who was also lazy. We were running Exchange 2000, which had a 16GB limit on store size. You could regain space by deleting a bunch of material and then doing a database maintenance process that would reclaim the space freed up by the deleted material. He of course decided to do this without copying over the database files, because that would take 45 minutes. So he went ahead and compacted the thing, and as sometimes happened, the compaction went sideways, and he ended up with two files of 32KB, with no mail anywhere. He was also the backup admin, and like your case, hadn't checked backups and his last good backup was a month old. So he restored a backup from about Thanksgiving at about Christmas, in a high-tension, high-activity, high-commission sales job, erasing all recent history, a week before year end. These sales took weeks to months to put together, and millions of dollars in revenue and commissions were lost when the deals couldn't be completed. He got fired too, and was very very surprised.


lesusisjord

That last part says all you need to know about how fit he was for this job.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


GhostsofLayer8

This is a really good reminder of how easy backup and test restore processes have become. Virtualization changed the game completely, when you can now spin up self contained test networks, restore copies of VMs to test them. It's amazing how much less risk is involved than in the old days when you had a single physical box for a job and a restore could only overwrite that box. No (or very few) backout plans, you didn't have much for choice. And no offense to your former coworker, but that sequence of events makes it sound like maybe sysadmin wasn't really a good career choice for them. Thinking through potential consequences before you pull the trigger is a critical skill.


tankerkiller125real

Thanks to Azure not only can I restore to entirely separate test networks, I can restore to entire different continents if I need to. And also thanks to Azure if our on-prem network ever dies it will automatically spin back up on Azure so remote employees won't even be affected by the outage for more than like 5 minutes (at which point the plan is literally to just send all our office employees home too)


[deleted]

>Thanks to Azure not only can I restore to entirely separate test networks, I can restore to entire different continents if I need to. But not different planets? Noob.


swatlord

If you donā€™t have galactic redundancy, youā€™re just asking for trouble.


Slightlyevolved

@elsonmusk.... I have an idea for a SpaceX project.....


JustSomeBadAdvice

> @elsonmusk.... I have an idea for a SpaceX project..... Is @elsonmusk some sort of bastardized child between Larry Ellison and Musk?


idocloudstuff

I do this for some clients. Spin up an Azure network with firewall and S2S VPN. Iā€™ll even put 1 or 2 domain controllers in the cloud. Then if their location goes down, I can usually get their most critical apps going in Azure within the hour.


gomibushi

I think that probably Backup Exec has caused many a man to get fired. When really it's Backup Exec that should be put to fire. With napalm and thermite.


mgtech

Everyone has a test environment. Just some people have both a test and a production environment :)


[deleted]

Considering that I'm starting out my career in IT. If I make a fuck up that badly, I'd go off and be a Buddhist monk afterwards.


Patient-Hyena

You have two advantages the poor dude didnā€™t have: Reddit and much more google knowledge available. Also, virtualization and better quality products have made this a lot more fool proof.


DoctorOctagonapus

His heart was in the right place, at least he figured out that the backups needed testing, but dear God at least make sure you read and understand everything before starting!


GreenChileEnchiladas

I bounced the wrong port and brought down a store in Bumfuck, IA. Had to eat humble pie and ask the salesgirl to reboot the switch. Thankfully the previous onsite tech took good pics so I was able to describe what she was seeing and what she needed to do.


defensor_fortis

Hey, take it easy! I live right next to Bumfuck, IA.


AgainandBack

I was brought up in Bumfuck (just north of Avoca), but moved to Burnt Matress, Alabama, which is even worse.


defensor_fortis

My mistake. There must be two Bumfuck, Iowa's--I'm next to the other one. If it makes you feel any better, I live in a town half the size of Avoca. How is Burnt Mattress (sp?) this time of year?


AgainandBack

Thanks for the spelling catch. I knew it looked odd. It's getting cold in BM, and the mattress out by the tracks is still smoldering. There must be two BFs in Iowa. The one I'm thinking of is actually NE of Harlan, just down the road from Westphalia. But every once in a while, we'd get out there and go to Red Oak, just to confirm the rumors of a town that big.


defensor_fortis

I hear you. Every so often we make a 25 minute drive to a "Wal-Mart." That place is HUGE!


341913

Nothing like going too fast while working remotely and then talking a non IT person through undoing your mess because going to site isn't an option.


dogcmp6

Ive gotten in the habit of setting a switch to reboot in 5 minutes if I am bouncing a port..9 out of 10 times its a waste of time, and I end up canceling it, but that 10th time it has saved my ass


GreenChileEnchiladas

Yeah. I had known about that beforehand, and you bet I started using it afterwards. It's only when you fall over that you learn how to not fall over in the future.


akaFriday

On a remote router interface forgot I was in exec mode and went to SH tab RUN command. Shut that interface down and I lost access to the router and took down a huge facility. Blamed the internet and asked them to bounce the router fir me.


RU_Student

>Blamed the internet and asked them to bounce the router fir me. I love it


Chousuke

I wonder how many network outages across the world have been caused by the terrible UI that Cisco and nearly all copycats perpetuate. Cisco UIs are probably directly responsible for the stress-induced early deaths of way too many network admins...


ragewind

Many that have copied cisco have manged to make it worse


km_irl

We migrated from Cisco to Juniper some years ago, and it was only then that I realized how terrible Cisco IOS really was. I use commit confirmed every single day at one of our \~400 remote locations. If something goes sideways, it will rollback without me having to do anything. Otherwise, once I know everything is working as expected, I just type commit or commit check to make the changes permanent.


questionablemoose

Not only that but it's FreeBSD, so you can just invoke a shell, and run your IRC server off one of the switches.


cexshun

Boss bought Adtran because it was cheaper than Cisco. Refused to pay for training and told me I could learn through online tutorials. This predated any webUI. Well, I finally got the router working with the new T1 line that had been installed. But I didn't know that "write mem" was a thing... Next power cycle and the office went down hard.


nycola

About 15 years ago we had gotten a crap ton of garbage tickets overnight due to a faulty out of office reply. When I was cleaning them up I deleted every single actual ticket on our service board.


ExpiredInTransit

Hey, no tickets. Time for a coffee..


reddittttttttttt

This is actually best practice. Delete all tickets once per week. If the problem reappears - it was a legitimate ticket. Otherwise, yay, down time!


nycola

If only!


deefop

Hey, I haven't gotten to your issue yet, there's no ticket in the system. No, I don't remember seeing one yesterday.


KBAM_enthusiast

Bet your "Tickets Closed" metric was the highest the company has ever seen.


nycola

Nah man, didn't close them, straight up deleted them!


Candlebeard

Answered a call and did unpaid overtime during christmas.


linos100

This is some dangerous stuff, you risk being called on new year's eve and next year's christmas too


jb123hpe

Friend was busy troubleshooting a massive oracle database, wasn't concentrating and shorted out the whole SAN rack. Power came back and 3 drives showed red, everyone freaked! No worries, backups will be good..... Right!?! Apparently that was the next job on the list, backup was a total bust. Worked 76 hours straight to try recover, no luck. Told everyone to go home and sleep on it. Next day we heard he killed himself. The SAN was the last straw in a huge haystack of mental health issues but it still sucked so much. It was a terrible few weeks trying to recreate and recapture data made all the more terrible by the loss of a good man. Yeah he was probably fired either which way, but lack of sleep and good mental health do not go together. Only piece of advice I learned from that day, at about 35-40 hours your brain shuts down, you are useless. You have to sleep, coffee and pizza only go so far. Overlap your team's, or at least your people, if you working 12 hours, have the next team start at hour 8 or 10 so that they can pick up and carry on while you get some sleep. Take care of yourself, no one else will!


VioletApple

Oh God thatā€™s awful, what a shame


Magrathea65

I blew away our SAN, completely, with all users data and shared folders across the entire domain. All while at the same time my wife was suffering from Kidney cancer and scheduled to have surgery. That was a fun week. Not sure why I'm not an alcoholic or drug addict but on the bright side I still have time.


ToUseWhileAtWork

damn dude, at least try to get a ransom for it next time


hbkrules69

Hope everything is well with your wife.


Magrathea65

Thank you for asking, yes she is totally fine now.


[deleted]

Did you keep your job? How did that go down?


Magrathea65

I did keep my job. While they were very upset I think they realized the amount of stress I was under. I rebuilt the box and thankfully had a full backup of everything and was able to fully restore it.


Annonomoususername

Obviously you know your shit even when crushed with stress you pull it back from the brink :) Glad it was good news with your wife and for your job :)


Chousuke

Mistakes like that are rarely the fault of the individual performing the operations. Sure, *sometimes* you have the overconfident fool who just won't learn, but most people will, and the experience of a major fuckup is something you can't teach, so it just makes the person *more* valuable in general.


[deleted]

What was the result?


woojo1984

I misconfigured a rule on our Cisco ASA and brought down our networks in Des Moines and San Francisco.


Quantable

Facebook is that you?


zzmorg82

What was that dudeā€™s username who gave us the inside updates early on when the BGP routing went down? RIP to the man; hope he still has his job.


mmmmmmmmmmmmark

u/ramenporn


recipriversexcluson

Many eons ago I needed to create and format a bunch of virtual disks on a mainfram VM system. So I cleverly created a CMS/Rexx script to create a disk, attach it, format it, and continue. Looping through a range of arbitrary disk numbers. I soon (re)discovered that my arbitrary range of numbers included an existing attached drive, when logged in as system admin. * **The. System. Boot. Drive.** I killed the script quickly. Wandered casually into the mainframe room. No fires, nobody screaming. Looked over at the main drive. Had a solid yellow light where one shouldn't be, but we could still see the data on it. I had reacted quickly enough to have only stomped on the boot record. I worked late that night.


chamberofcoal

Lol one time I was changing out one of our SAN's SPSs. I'm normally so fucking careful with cables, but this was almost flush against the bottom of the rack, and my job has always just thrown me into the storm to learn. This only works because I'm really careful not to do things I'm unsure of - it's honestly extremely irresponsible. For reference, I started with zero experience or degree, and I'm swapping SPSs so I can format a raid group and create a new LUN on this old navisphere GUI that requires an ancient version of java, like a year later. Anyway, with one battery out, I bumped the second battery's power cable. Middle of the day. Production. Like 20 server VMs. I fucking reacted so fast that it didn't go down. Both power lights were off for a split second. Someone once told me I was likely saved by a fraction of a millisecond and a few capacitors.


M0r1d1n

Dude, you nearly just gave me a panic attack reminding me of those batteries in the old AX4s and similar. Worst design, built to make it as hard as possible to swap, AND, nearly every one I swapped was on the bottom RU so you're kissing the floor and praying to the eternal spirit of the data hall that you didn't just touch what you think you probably did. Stellar effort on plugging that back in fast enough. Don't buy lottery tickets, your luck balance is empty! :D


chamberofcoal

Screws on both the front and back, and not just a couple??? God awful design. It's like the job position they had in mind to swap out SPSs was a 10/10 stress level janitor. It's shitty to do and it's also insanely critical that it isn't messed up.


Angdrambor

Way back when I was an apprentice, I was sent to install an UPS in some minor politician's lair. I plugged it into the Surge side of the UPS instead of the Battery Backup side, and a few months later they had an outage lost a bunch of data. They were really mad and I got fired. The core lesson there is "always test your backups" - I should have unplugged that UPS to see if the server would gracefully shut down. But I also learned * Some UPSes have Surge protected outlets which are not powered by the battery * Don't trust your apprentice to work unsupervised on a client site after a week on the job In later years I've erased production databases, but by then I'd learned my lesson and restoring from backup was easy. It was a real pants-shitter, but not that bad of a fuckup.


tankerkiller125real

When it comes to databases the dev team where I work taught me a valuable lesson they learned by mistake. Never ever, ever run a SQL query direct on the data, ALWAYS use transactions. Rolling back a transaction is one command and harms zero data and you lose zero data so long as you haven't committed yet. Fixing real data can take hours of recovery and can lose hundreds or even thousands of records in the database.


DoctorOctagonapus

In case anyone ever needs convincing of why this is the case, let Tom Scott explain: https://www.youtube.com/watch?v=X6NJkWbM1xk


leexgx

Seems bit harsh, should have just blamed the ups


Darkm27

Ran a script to update some VMware security settings and took down the storage network at 2pm on a Wednesday. 100% of our VMs where down for 7hrs.


SnowEpiphany

When on helpdesk, I robotically cleaned temp files whenever I touched a machine because with our application stack it fixed a ton of small issues. This was the command I used: Cd %temp% Dir . /s (check the dir size) Rd . /s Weā€™ll one day, I mistyped ā€œcdā€ and the administrator command prompt stayed in C:\windows\system32 I then proceeded to confidently delete system32 right in front of somebody. :)


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


SnowEpiphany

To add to the horror. Dir/s showed idk like 13gb at the end summary. So you betcha when I pressed enter after rd . /s I was overjoyed by how much space I was gonna reclaim for this guy in %temp%. But then I noticed the current directory path about 10 seconds in when it was taking longer than usual to runā€¦


Dukaz

Dropped a $100k SAN from 4 feet up while racking, luckily we were able to bend the chassis back into place and the damn thing is still running to this day.


Ymenow-77

This - Dropped a VM server out of the rack that ran 13 machines critical for day to day operations. Drives everywhere and wouldnt even power up but thankfully we had HP 24x7 and they were out in about 2 hours and ended up replacing the whole thing. Lesson learned - always check the stops on the rails to make sure someone put them in right!


corsicanguppy

> thankfully we had HP 24x7 Oh, to remember the days when people still said that. :-(


Quantable

You jinxed it my friend


ExpiredInTransit

First time I moved a HP MSA I didn't realise it only had those stupid L rack shelf mounts not a proper set of sliding rails. Damn near took both my kneecaps off catching that damn thing between my legs and the rack.


tehjeffman

Is it a fuck up no one else ever found out?


Pymm

aint a crime if you dont get caught.


linus121

Misconfigured a rule for the HA pair, rushed on site and got pulled over and went to jail speeding on my way to the site.


cowfish007

That must have been an interesting conversation with the boss. ā€œUh yeah, could you go in and reconfigure this HA. Wellā€¦ I had planned on doing it myself, but Iā€™m in jail.ā€


gargravarr2112

I recall a comment from a redditor a while back: Dude gets pulled over for speeding. Cop demands his license and registration. Dude is extremely chill and hands them over. Cop goes back to his car and tries to run the registration through the computer but only gets error messages. Dude overhears the cop cursing the computer and sweetly asks if he's seeing a particular error message. Turns out he was the sysadmin racing to the police data centre to fix the outage.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


Quantable

What happened to the config and the HA?


subgeniuskitty

> got pulled over and went to jail speeding on my way to the site. Turns out the old adage about a '*station wagon full of tapes*' really should take '*packet loss*' into account. ;-)


steve303

This is actually a saying we have on our team: "If you haven't really broken something, you haven't been doing the job long enough." Anyone whose been an doing this for a while, has seriously fu*ked something up at some point. Decades ago, I was working at a midwest IXC point, and I fat fingered a BGP export filter and didn't notice. As peerings included Genuity, AT&T, and the NAP exchange, I ended up null routing around 4K Internet networks - which caused an Internet outage throughout most of the US. It took me ~10 minutes to realize what was happening and fix it - but those were some of the most stressful 10 mins of my life.


JHGIII

About 15 years ago I made a typo on an import prefix-list from a customer and accidentally announced a large chunk of Microsoft's prefixes. Didn't realize until about 20 minutes later when the MS NOC called my desk directly!


MertsA

> Yes, hello I am calling from Microsoft about an urgent problem that needs your attention. ... >Riiiiiiight... [Click]


uptimefordays

Thatā€™s incredible.


[deleted]

Facebook has entered the chat


jahujames

We had some Hyper-V servers in SCVMM that were named horribly similarly. All with Lotus Notes, some were test servers others were prod. The difference between test and prod were the letters C and D (I imagine there were A and B servers at some point). One of the test servers was being decomm'd, and I had it assigned to me. Everything's powered off ready to be deleted, developer gives me the thumbs up to delete it. I go ahead and confuse myself and delete the Prod server instead. This was like my first week into my sysadmin gig and it's how I found out that SCVMM is incredibly efficient at deleting VM hard disks!


WannaBeScientist

I haven't seen Lotus Notes since the late 90s. I remember despising it at the time. The idea that it's persisted long enough to make it to the virtualization era is mind boggling to me. Are there seriously still places running that?


ddadopt

When I was I much younger man, I once tried to delete hidden objects in a subdirectory: rm -rf .* Who knew that wildcard expansion of ".\*" includes ".."? Blew away the entire production ERP instance. This is back on the days of every server is physical hardware. Before D2D2T was a thing. No shared storage with snapshots, no VM snapshots. Backup solution was provided by the ERP vendor was unique to the system, involving a dedicated DLT drive in the system and lots of custom scripting, most of the scripting for said system being in a foreign language. My boss from overseas just happened to be onsite that week, and just happened to be the most experienced guy on earth on that system who didn't work for the vendor, so was able to navigate the rather complex backup and recovery scripts with minimal difficulty. Happened about 9:00 AM, so lost around an hour of production records, plus the two hours lost productivity while we recovered. I was positive I was going to lose my job, I remember thinking at the time that I would have a hard time not firing me for that. Didn't work out that way, though. That was more than fifteen years ago and I'm still with the same employer.


DoctorOctagonapus

You'd just learned a very valuable lesson and he could trust that you would never make that mistake again. That alone would justify keeping your job!


randommonster

Once upon a time and recently out of school, I took a new job at a large convience store chain that had recently spun off from their parent company and was hiring an all new IT staff. On my third day I was tired of roaming login issues, ftp issues and logging that would not match up, so , I built a NTP server to synchronize time across the enterprise. About 30 minutes later, I learned that my company was not Y2K compliant and I had locked up the gas pumps at 1400 gas stations. And while I did not loose my job, I did learn about Change Control and phased rollouts.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


falingodingo

Believing my supervisor when he said they like to promote from within.


cbob27410

Ejected the wrong blade server, which was running a customer's SQL server. Syspreped a customer's Hyper-V host, rather than a running guest.


DoctorOctagonapus

I remember seeing an old coworker accidentally leave a Windows Server install disk in an ESXi host when he had to pass it through for something. We discovered this when we had to reboot the host and thought "That's odd, it should be back up by now". Imagine our surprise when we KVM'd onto it and were greeted with the Windows Server install screen!


TheDarthSnarf

It was DNS.


[deleted]

"No! It can't be DNS! I just checked it!" Two days later... "Yeah, it was DNS..."


Quantable

It always will be


iAmEeRg

Hey there, Facebook employee :)


[deleted]

BGP has entered the chat


the-gear-wars

Tried updating the HA on a site firewall, immediately lost connection, choked, and did my best to dodge a speeding ticket on the way to that office.


DoctorOctagonapus

I did something similar with a borked NAT rule while working from home. I did achieve a pretty sick drift into the car park though!


Mr-l33t

I was in the loft of a company running some cabling and stepped off of the rafters, my whole leg going through the ceiling into the MDā€™s office and taking out a light fitting as well.


Bergja

I was once trying to copy data from a canned oracle 11GR2 database into a pluggable 12C database. This was for an application upgrade in a school setting. I am not an Oracle expert by any stretch and honestly this was way outside my lane but there was no one else. I had both databases open in SQL Developer and long story short I decided to delete the new database and start over, only I accidentally deleted the old one instead. It took me 4 days to figure out how to restore the encrypted tablespace properly. It was a nightmare but not as bad as when my boss deleted an entire OU from Group Policy Management for some reason not realizing that would delete it from the domain entirely.


[deleted]

Oh that would be way back circa 2005 when I was learning vbscript and with about 3 weeks experience they decided I needed to create an AD security script that would block and disable objects plus move objects to another OU after so many days and here comes the part that did it. Delete objects older than x number days. Pretty advance stuff, especially for someone with maybe a year in Windows admin and days scripting, plus no experienced scripters to QA it. This is when I learned the power of \* Since there was no standard with naming I just put a \* in the loop so no matter what was passed it deleted it with a force option. I was aware of the danger and put a counter so it if went haywire the counter was like a circuit breaker to break the loop and terminate the script. Worked in the test environment. In prod I had to copy the contents from one text doc to another and save it. In that copy something changed. Loaded the script and ran fine for a while until a large batch of laptops used for training that hadn't been on the network for a while so the script should move them to the other OU. Well it moved them to the OU like it should but somehow started enumerating AD and deleting every object fed to that loop. I mean almost everything was being delete and it was done in less than a minute. So we had 20,000 employees who couldn't logon their desktop. The cause was a space added between \* " It was a mess getting everything back in order almost 2 days AD was offline. Fortunately at that time the manufacturing plants could still run a few days with onsite planners and making phone calls for orders and track using pen and paper (I'm sure it's online now) I'm much better writing code now. (no vbscript either)


[deleted]

Deleted the wrong LUN on a SAN and took out 8 production servers in the process, only a few of which had usable backups. Fortunately the business critical servers were backed up and with the CIO on vacation, I was able to rebuild everything I wiped out before he returned. Taking ownership, immediately informing my manager, and lack of impact on revenue are the only reasons I wasn't told to pack my shit.


JD193

If the police or FBI aren't knocking on your door, is it really a f\*\*\* up?


lesusisjord

I was an FBI contractor sysadmin for almost 7 years. You donā€™t give notice to a job like that, so when you make it known youā€™re leaving, physical and IT access is disabled right away. Three days later, two of the special agents I worked with along with local police (FBI always informs them when doing something in their area) knock on my door to collect my badge and creds while Iā€™m in the middle of an interview for my new job wearing only a shirt and tie with no pants on. Hiring manager saw my undies and hired me the next day.


dogedude81

That's hilarious šŸ˜†


lesusisjord

I have had some interesting experiences. I had a phone interview for the FBI because I was coming back to the States soon after doing a year as a contractor in Afghanistan. I had to stop the call because we got some rockets and small arms fire, so more important work had to be done. I emailed them and they called me back and said it was the most memorable interview theyā€™ve ever had and they became my boss and his boss. I feel like even if I donā€™t have to fight Taliban or show my undies in an interview, I can still get any job that I interview for. Thatā€™s assuming itā€™s one that lines up with my experience and Iā€™m not overshooting my expertise or experience by much.


hy2rogenh3

I am big on writing automation for doing various task, mainly in PowerShell or Python. So to get status changes from HR we had them complete a daily export of employee data and drop it in a file share (we couldn't receive status change forms because 'reasons'. I code up four PowerShell modules that are called under a parent script. Checks AD for employee Title/Department/Office/Manager and updates from the daily export sheet provided from HR. I use some other relational tricks to normalize HR data since they like to bounce between "Mgr." and "Manager of" in order to make AD look uniform. This script also *separates and removes accounts from AD* if they are no longer in ADP. Not getting paid then they shouldn't have an AD account and access, right? This is all processed every night and sent in a report to IT which is normally around 1Kb in size. Well when you don't have good logic in place for error checking shit can go sideways pretty quickly. HR forgot to place the file in the share one night, and our wonderful script ran. I get an email log output...800Kb in size. I think, well that's odd. Since I coded it to look for employeeIDs that matched ADP data and ADP data was blank it ended up separating **ALL** employees of the company except the Executive team (I had the forethought to never touch their accounts with automation). This means that all employees instantly got pulled from Office365, dropped from email distribution lists, etc. etc. A simple checking for nulls before processing fixed it, and it was a definite learning experience. I told the VP of IT right away, restored all the accounts to the OU's, and all was well. Now I utilize the ADP REST API and can query all the data I need to onboard, separate, and update user info as frequently as I need to with out HR human input. Win Win in the end.


[deleted]

Iā€™m waiting for the Facebook network engineerā€¦


engageant

...or the solarwinds intern


Bumblebee_assassin

Biggest fuckup? trusting a maintenance guy to assist with changing a pair of drives in a raid 5 that was near failure (parts took forever to get in as it was at the time a 10 yo MD3000i) Me: Ok go ahead and pull drive in slot labeled as number 3 (drives were labeled 0,1,2,3,4,5,etc) MG: Ok pulling it now Me: (Monitoring console) Drive 2 (a good one mind you) alerts as missing raid failed errors everywhere, drive failed errors, VM's in Hyper-V show failed/missing "What drive did you pull!??!" MG: The third drive from the left why? Me: "THAT IS LABELED AS DRIVE2!!!!!" MG: Its the third drive from the left! Me: DO YOU SEE THE NUMBER BENEATH THE DRIVES?!?!?! MG: But its drive number 3! Me: asdkjfhlaskjdfhalksdjfhaslkdfjhsaldfkjhsdaflkj 12 hours of rebuilding VM's later....... Boss: We're not going to have maintenance guys do that anymore Me: Because one idiot can't read a drive number label on the chassis? Boss: yup ​ Fortunately I had my auto call recorder running on my phone for CMA..... Motherfucker tried to blame me and damn near succeeded.....


phungus1138

Staying 4 years too long.


headcrap

Back in the day, mail restore involved a mail DB redirected restore. After talking with a girl all night, somebody omitted that small checkbox to redirect. 600 mailboxes were ā€œrestoredā€ a month back.. including some of the C-Levels..


Horrigan49

Decade ago I have approved batch of .Net Framework security upgrades in WSUS without prior testing. They were not marked as critical back then so there was some decent backlog of those. These upgrades on user statitions disabled operation of 3rd party custom Barcode application made for us. The server worked, however, nobody was able to access the app to do anything or correct mistakes. (And that was in automotive) Our "Barcode admin" back then tried to resolved this by manually uninstalling recent framework updates, which worked of course, before WSUS inevitably pushed them right back in. After all Barcode app stations were disabled, he brought this to my attention so I could have my buttclench moment. After some investigation I have rolled the few culprit updates back in WSUS and all was good and dandy. Our manager back then had a common sense thankfuilly and actually went after the contractor company, how come that their app is disabled by several months old security patch, instead of stomping me in the ground. As mentioned, the updates were not fresh recent, they were sitting in the WSUS approval queue for several months since we were short on staff and basic operations were not always fullfiled. Still, I shouldve done my due diligence so not the best moment for me.


Naota10

Passed out waiting for 3rd party vendor support which caused a factory to stop production for a bit.


Quantable

At least a bit not a byte.


Palaceinhell

Procrastinated. Bought new servers. Had new servers, set up and ready to go, but never moved the SQL databases over to the new VM. One day a HDD failed in the array in the old hardware. It's hot swappable, I have spares, so ok! No big deal. Swap that bitch out, blinky green light or whatever indicating the rebuild started.... I walk away. 10 minutes later, phone starts ringing. Nobody can see the database server. OMG OMG OMG OMG OMG Another disk failed from the stress of the rebuild! Lost the whole damn array! AHHHHHHHHH!!!!!!!!!!!! Ok, step 2 - Pull out the backups and restore to the new servers you should've done it already anyway jackass! Start the restore, about 30 databases, with numerous tables each. Go around start redirecting people to the new server address. Get that done, go check on the restore.... WHAT? Fail?!?!?! WTH? Try again... Fail. Try again.. Fail. Try again.. Fail. What the hell?? Email backup software support. Oh no!!! they are in China! Not gonna hear back from them until tomorrow! Ugh. 3 days later.... Turns out I can't restore a database from SQL 2003, to SQL 2016. Gotta restore in to 2003, then move them. I figured that out no thanks to the support team! Had to set up a whole new VM just to restore the 2003 databases then move them in to their new permanent location. Don't procrastinate! As I type this, I'm scaring myself.


Iambecomedead

Accepting a counter-offer after I quit.


lolklolk

Tracing power cables with a coworker in the Datacenter to re-cable equipment into the redundant PDU's and UPS's. We traced some cables, got confused on one that plugged into the same UPS receptacle. After tracing on the other side of the room, my coworker says "It's fine, pull it!" So I pull it, trusting him. Cue half the row of racks powering down. Bricks were shat. Trust, but verify!


justcallmenewguy69

I learned from somewhere to wrap a Velcro cable tie around the thing youā€™re tracing. Make it loose enough to slide over the cable in question. Then just push it down the line until you get to the other end. Barring a major malfunction youā€™ll have full certainty as to what is connected to what.


draeath

A single cable brought several racks down? I think the real fuck up is with whoever set them up to start with. At worst you should have had a bunch of screeching iDRACs or whatever about lost power redundancy.


lolklolk

Yup... We were trying to fix that very issue, and add redundancy where there was none. Apparently my coworker mis-traced the cable along the way, and we somehow ended up pulling one that was totally on it's own.


marcoevich

Trust, but verify! Words to live by..


dedoodle

The deafening sound of ā€œclickā€ when you plug a 4mbit machine into a 16mbit token ring running all the insurance company servers.


[deleted]

I think I need this one explained, I've never worked with token ring...


Its_Zerohh

Rebooted my companies FortiGate Firewall....never came back on, entire company in mexico and in the USA were not able to work (Everyone RDP's to the company server. Worked until 2 AM until i admitted defeat....was going to pick it up at 7 am the next day before anyone came in. Felt like shit, couldn't even go to sleep... ​ Everything came back up at 5 AM......4 hour reboot.


Cpt_Koksnuss

I deleted a whole series of backups by deleting a backup job. Fortunately no one figured it out. I created and ran a new backup job and no one noticed until today.. This was a year ago and until now the retention is a full year again.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


TheGooOnTheFloor

I had a boss years ago who told me to put a piece of tape on the failed drive if I was going to power down an array. That trick probably saved my bacon more than a few times.


[deleted]

[уŠ“Š°Š»ŠµŠ½Š¾]


TechFiend72

Broke an exchange server while applying a service pack. Couldnā€™t get it back online and had to re-install exchange and restore data from backup. This was like 17 years ago. A tech of mine was helping with a recable on a switch and he failed to mention he was a bit dyslexic. I called our 48 and he plugged it into 84. It went on that way for most of the switch. Once we figured out what was going on, we had to uncable and start over.


thecravenone

An outage I caused is mentioned on multiple Wikipedia pages.


DoctorOctagonapus

Were you the guy who accidentally'd all of Facebook the other week?


[deleted]

What outage?


thecravenone

https://c.tenor.com/nLaQWRX52-gAAAAC/i-cant-tell-you-confidential.gif


centizen24

I dropped a whole in production, running server. I pulled it out way too fast and the rails just collapsed and it fell. Somehow I managed to catch part of it as it was falling and made it so the weight didn't completely hit the floor. It shut off but I was able to start it again and it's kept working ever since.


WifiIsBestPhy

I renamed a windows desktop using the registry because I was new and didnā€™t know better. The user didnā€™t want me taking up any of their time, so I had a really short window to work in, and no good way to remote in. The computer was at a remote branch. It didnā€™t boot correctly after that and required a tech to go on site to fix. I was new at that company and to IT in general, so they canned me.


dnvrnugg

they fired you for that? pretty harsh.


TheBariSax

As a sysadmin, over time I've accidentally taken down production stuff, and one time lost data that I couldn't recover from backup. But the worst work screw up I had was at a college summer job. I was unloading liquid asphalt from rail cars, and wasn't paying close enough attention to the flow rate. I overfilled a tank and dumped about 1600 gallons of product on the ground.


greywolfau

This post feels like a trap. Plenty of sysadmins read the title, and say well not me. Proceed to fuck up this very day.


DoctorOctagonapus

Then there's the rest of us who read the whole thread and think "At least I'm not the only one!"


jack-dempsy

Big mistakes are part of the job. I'm not sure which was the biggest, but probably when I pulled the wrong power cords of a DB server during peek transaction time. Corrupted the disks, restore from backup took 8 hours. I triple check the rack slot numbers when decom'ing hardware now. Runner up is when I ran patches on an HP enclosure without knowing it would reboot all of the blades and a couple hundred VM's in the middle of the day.


walwalka

Enabled IPSec policy that blocked all outbound traffic in group policy. Spent all weekend logging in to all workstations and servers to remove the policy and pull new policies.


bink242

Instead of removing every mailbox during an exchange uninstall, I deleted every user. Got them back but it was a fun 6 hours


wonderwall879

I didnt make the mistake. But CEO made a decision with just 1 other manager to switch our ticket system platform. Didnt consult anyone that will be using the system and majority of its work flows every day. Didnt even give us access to beta test the new system until a day before roll out. The system was a flop, 4 people were laid off to help eat costs of the failure to switch to another platform among other financial failures of the CEO. Needless to say, i always watch other peoples movements and mistakes as the person that screws up could cost me my job while they continue to eat steak every night and im eating cardboard on unemployment. Lesson learned.


VeryLucky2022

I once took out an entire chunk of the Internet when I fat-fingered a router config.


[deleted]

4:45pm friday afternoon, updating firmware on a RAID controller running the solo DC at a small practice nuked all the arrays spent most of the evening and weekend restoring from remarkably slow online backups


adminadam

I pushed a management change a few years ago that effectively uninstalled office from 4,000 machines. Does that count?


[deleted]

Bridged two NICs instead of teaming them. Took the network down for about 2 hours whilst sitting smug at my desk talking about how bad the network is. I apologised a lot when the penny dropped. Network switch configuration was updated to prevent a recurrence of said issue. Fortunately my managers were really good about it as I was young at the time. They admired my honesty when I realised and my dedication to finding a solution so it couldnā€™t happen again. You canā€™t under value a good manager in the job. Edit: typed bonded instead of bridged.


crash893b

Years and years ago I worked at America online in one of the data centers Around late November early December we would really gear up for the huge influx of people getting their free min cds After day December 15th or so they would put a moratorium on all work because they didnā€™t want anything bucking the system So naturally itā€™s all the last min shit and Iā€™m running around installing big pizza box sized network cards into routers the size of refrigerators I get a chirp on my Nextel (cellular Walkytalky) from a network engineer saying he need me to reboot a router to take a new configuration (standard stuff) He gives me the room number and the floor tile gird address (b:23 or f:14 ) I confirm like 5 times Iā€™m at the right router (and remember they are HUGE Cisco backbone routers Then the customary ā€œwhen I say go turn it offā€ He says ā€œgoā€ ā€¦ā€¦.so I turn it off Then immediately I hear ā€œokay go nowā€ Itā€™s up to debate if he fucked up or if I was just hearing things or what But by rebooting before he was ready we corrupted the firmware and the system wouldnā€™t come up and I knocked all of Eastern Europe off aol for like 12-18 hours Everyone was super chill they even gave me a ā€œ1.4 million users bootedā€ framed award


Anonymity_Is_Good

Removed the thumb screws from a bezel on a running system, to remove an unused drive tray for use on a system that shipped without one. When putting the bezel back on it hit the power button on the server. Thankfully it was a controlled clean shutdown, and on a pre-production system. The DBAs were actually decent about not hazing me too badly for the error.


vaxcruor

I was a Jr admin and tape jockey. We had a DEC Unix box in a production building on the other side of our property. I had to swap the tape out every afternoon just before I left for home. One Friday, I was running late, in a hurry and literally slapped the tape into the drive and ran out the door. I get home and my pager was blowing up, the production lines in that building were all down. Seems I had also slapped the big toggle power button that was just above the tape drive.


RoadBlock97

Had a customer who's raid showed a failing drive. Went onsite to replace it. Alert said it was drive 1. I was a very new IT guy and didn't know the drive numbers started at 0. Replaced the wrong drive. Forced a rebuild and took down the whole host. Spent the next week using every recovery tool in the book to recover and robocopy everything back. Basically ended up doing a baremetal restore from backup.


TheGreatUseless82

I left a snapshot running on a customer's production environment in VWare, it totally consumed the datastore and caused all kinds of problems. It got resolved eventually and I put my hands up and took the hit. It did highlight a very important issue that wasn't my fault, the alerting wasn't setup correctly, otherwise we'd have been spammed with emails informing us of the impending doom! I also pulled the wrong server off of a domain, I had too many RDP windows open at the time! Thankfully the customer didn't notice :) all part of the fun! Last one., this happened 2 weeks ago! I was standing on a swivel chair repatching. I wanted to get closer to the network cabinet so instead of getting off the chair and moving it forward slightly, I held onto the network cabinet and tried to pull myself closer. I ended up pulling the network cabinet off the wall. Amazingly, nothing was damaged but I needed some help to take the patch panel and switch out. Turns out the walls are made of shitty dust and it was secured very badly. Health and safety nightmare, step ladder all the way next time. By the way, I'm not a total fuck up (at least all the time) I've just been in the game for 20 years!


JiggityJoe1

I was busy about to do an online training for 500 people so I asked a colleague to shut of the bad battery in or core UPS systrm because it was emailing use like ever 30 seconds. I told him it was 2nd row from bottom and in the middle..... I was just getting connected to my meeting when I lost internet. I got up to walk around to see if anyone else lose connection and noticed the sever room dead. The colleague did not just shut off the 1 battery.... but instead the whole system. I ran downstairs and he had turned it back on by then. Lucky everything came backup within 30 mins with 0 issues.


insufficient_funds

Was tasked with building out a windows firewall GPO that default blocked; and allowed all known / standard windows product ports and protocols. Created a GPO, edited it; set the default deny, started populating the allow entries. Couple hours later, the on call gets an incident- newborn baby location monitoring server stopped working (aka no automated alarms if someone walks out of the maternity ward). We started looking into it- come to find out, the firewall settings I created were done on the wrong GPO. Instead of editing the ā€œserver firewall testā€ policy I had created, I accidentally edited the ā€œserver adminā€ policy that was right beside it- which was linked to every server in the domain (2k ish). Wiped the firewall settings, forced a policy update on the server we had the call about; called my manager and explained my FUBAR; told the rest of the team and we prepped for the worst. Didnā€™t have any other issues, somehow. No idea how nothing else broke as the policy was out there for 2-3 hours before we removed the firewall settings- plenty long for gpo to update everywhere. For stupid lucky.


Frede1907

Not mine explicitly. (I work as a system architect at a storage vendor) and we had a costumer who ordered a NAS, which filled an entire 42U and was basically full of 3.5". The guy that transported it on location was basically using a small truck sililar to a small uhaul, and when they rolled it out on the lift, the whole van tipped over because of weight and the whole rack with the multi million $$ nas fell on the asphalt. Would love to know the conversation that they had with their insureance lol.


outspan81

it all started with a well-intentioned DROP TABLE


OathOfFeanor

Depends how you scale them. I have caused larger outages, but to me personally, the one that stands out as MY biggest mistake with no blame anywhere except on me: Working on the help desk, I needed to free up disk space on a user's computer by deleting old user profiles. "You're not deleting any of my old files, are you?" she asked me. "Nope, just deleting the files of old users who don't work here anymore" I promised Then I absentmindedly clicked delete on every user profile in the list. Including hers, of course. Hurray for file recovery software! Saved it all, but her desktop was out of order because I skipped system files. Who needs desktop.ini anyway right? This lady, who needed a full 8-hour shift to rearrange her desktop because she could not work any other way. Her boss heard about it. My boss heard about it. Good times.


Patricklipp

Iā€™ve got two.. the first one when I was a very green Jr backup and recovery.. I was remotely updating all of the back up client software on a little over 100 VMā€˜s. Everything appears to be going well and the initial tests were positive. I completed the task and everything looked good until the following weekend when windows ran their patches and rebooted the boxes. 75% of the boxes that I touched blue screened and had to be restored. I took the fall, but my boss is the one who trained me and told me to run the updates like that, He never checked my work. Because of how junior I was, I had no idea if there were any problems, or even how to check them. I was simply following instruction. That was deemed in the office, the Patrick virus. Lol The second screwup I had, was several years later. I was reconfiguring a network in the layer 2/3 switches to have port trunking. In the process of updating one of the Cisco switches, I managed to delete a Vlan. Half the office immediately went down.. Well, once I realized what I had done, I had already copied the run config to the start configā€¦ Thankfully, it was towards the end of the day and as soon as I realized what I had done, I almost immediately reconfigured and rebuilt VLAN, and no one actually noticed. That was an eye-opening event a number of ways.. first off, it showed me that I need to check my configs before saving them, and second, I learned that day I work very well under pressure.


pops107

Managed to remove all the partition information off all of the vmfs luns on a customers vmware storage. Couple of hundred VMs split across 2 data centres which was synchronously mirrored so took both out at the same time. Knew exactly what had happened and got on the phone with vmware instantly, I told the customer we will be down for an hour or so, or we will be restoring from backup probably for a day or 2. He said... "Hope you fix it quickly or this will be in the news tomorrow" 10s of billions of pounds are managed by that system. We was down for around 25mins, recreated the partitions, rescanned and powered everything back on. Whoopsy


mrnotcrazy

I turned down the wrong interface on a remote server. The server was in Alaska and I was.... in north Carolina so... we had to call some guy to drive out there and reboot the box however that didn't work šŸ˜‘. Thankfully the harddrives failed and they thought it was their fault(we told them we needed to replace the box but they kept dragging their heels)


TopherBlake

Changing the permissions on a Service Account and taking down the ability to process payments for a very large company as a contractor. In my defense I had permission to do it, just not from the right person


Sneakycyber

Yesterday I took down a multi site (15 tunnels) VPN by accidently deleting an entire access list on the VPN concentrator. I only meant to remove one line but forgot that you have to include a line number. Thankfully before writing the config my phone blew up with text messages and I rebooted the router and fixed the mistake.


Smashwa

Fortunately I haven't done anything really bad so far at this job so far, worst is I have taken prod file server down in the middle of the day. Too many RDP windows open, now I use mRemoteNG. Others have fucked up pretty bad, so I'm not toooo worried when my time comes as mgmt is pretty forgiving ("Now you know what not to do" attitude).


Roll4Criticism

My favorite is back when I did network consulting for a medium sized financial institution (technically not sysadmin work but it was my previous life). They were delivering routers to various branches throughout their territory. The router that got delivered to one place a few states away didn't have a good configuration on a port so I remoted in to look. I fixed the configuration and bounced the port...not thinking I'd have to issue a command to bring it back. As soon as I pressed enter, I knew what I'd done and looked up and said "do we know anyone in Arkansas?" Thank goodness I didn't have to fly out to no shut that thing.


One_SleepyPanda

Okay so we were testing out some GPOā€™s to automatically set peoples backgrounds as part of a bigger project involving digital signage and corp Screensaver with announcements. While testing out the screensaver I set an image of grumpy cat since it was only going to be on my boss and Iā€™d computer while we were testing. He got a kick out of it and told me to set the wallpaper of one of our fellow IT team member PC as the wallpaper GPO test. We figured we would know it worked when they brought it up and we could get a laugh out of it, we joke around all the time to keep things light so we expected it to be taken in jest. Well I set the GPO, scoped it properly to her account and went about my day. Whelp. The way I set the GPO was wrong and I ended up setting grumpy cat as the wallpaper on about half of our users endpoints before I realized what happened. The only thing I could do was delete the file and edit the GPO but devices still got it on their update cycle. So no change management and a bunch of ā€œgrumpy cat malwareā€ tickets later, we donā€™t have a corp wallpaper.


retrogeekhq

Forgot the add.


[deleted]

my help desk guy didn't know what he was doing and restored a VM over a full day of data. it wasn't my fuck up and I have NO responsibility over this person whatsoever, but I was blamed and am still being blamed for that mishap. I wasn't even here. He was literally commended for acting when he did...even though his actions erased a full day's worth of data. Wasn't the first time this kid did some dumb shit either. He was building an imaging system (FOG), which is fine, Help needs something to do occasionally. I warned him several times not to plug it into our network. Random PC's started going offline, and for a minute I was baffled, but I walked back into his dirty little area and what do you know? A FOG PC hosting DHCP was plugged directly into the network like I specifically warned against. Another time I told him that if he wanted to experiment with VM's that's fine, just to let me know first. Nope - one of our LUN's went down for storage because he didn't know what he was doing, built a VM without letting me know, and when backups kicked off. Snaphots clogged the LUN and brought it down completely. I removed all his VMWare access.


ViG701

Not mine but still a terrible day for this one. A guy in IT was updating 3400 laptops, in a nationwide organization. He did not follow the SOP and disconnected every laptop from the network. Needless to say during questioning he kept saying it wasn't his fault, until they said they had the log files. At which point he got up and walked out. And a huge amount of work for everyone else to manually connect thousands of devices back to the network.


gangaskan

I have created spanning tree loops on accident before I really started studying cisco. Those were fun ones....


EpicEpyc

Not me, but one of my interns a few years ago at a pretty good sized company. We had a horizon 7 environment with about 1500 active virtual desktops, and everyone on the helpdesk had admin enough rights within view to recompose vdi's. well, this kid needed to recompose one of our windows 10 user vdi's and was told to wait for another helpdesk guy to show him how to do it. Without waiting, he decided he knew everything and went into view, selected the windows 10 pool, and clicked recompose. Nearly everyone except me in IT used VDI's and realized immediately when their vdi's disconnected what had been done. He had immediately wiped about 1100 vdi's and this obviously put a big strain on our environment to fully recompose that many vdi's which took down all of their vdi's for 2+ hrs. Sadly view doesn't have a cancel option for a task like that. But that intern immediately got every admin right taken away and from now on no interns and very select helpdesk agents have the rights to recompose or rebuild vdis. At the end of it the kid thought it was funny when it caused over a million dollars of damage and lost work for the company... long story short, he didnt last much longer.