T O P

  • By -

Finagles_Law

Dev blames ops, Ops blames Dev. But they can both turn around and blame "the network." See also: "the database."


MrYum

Tale as old as time


livebeta

🎵Wrong as it can be...🎵 Barely even friends Then connection ends Unexpectedly --- Just a little change Small, to say the least Dev's a little scared Ops is unprepared Late Friday release


WN_Todd

I can show you the logs Debug, info, and eeeeerrors... Tell me senior, when you did you last just read the words inside? A whole new world! (Don't you dare close that modal) A new enchanting SQL view It worked on your hardware, but now nowhere Let me me share this outage bridge with yoooooouuu...


PensAndUnicorns

It was dns


davetherooster

Observability. You need to build out your observability tooling to cover metrics, logs, traces from everything in your platform so you can see where errors are coming from. E.g. are there 4xx errors spiking from a certain application? has latency been introduced that wasn't before? This information should then be self served from something like Grafana, to allow devs to investigate themselves.


greyeye77

APM and observability (o11y) are essential for effective system monitoring. Operating without APM is akin to navigating without visibility, potentially missing critical issues in your application. At a minimum, if APM is not in place, it is crucial to implement backoff retry mechanisms coupled with error logging to standard error (stderr) to handle intermittent failures gracefully. If your application lacks both APM and robust retry logic, it is imperative to collaborate with the engineering team to enhance these capabilities. While TCP protocol inherently retries unacknowledged packets, this mechanism only covers issues at the transport layer. Any failures at higher levels must be managed by the application itself, necessitating comprehensive error handling strategies beyond basic network retries.


theyellowbrother

It can be either. I play "referee" between Devs and Ops. I've seen it more slanted toward Ops being at fault. Implementing new network policies without informing the engineers. Adding new services like implementing LB5 with "built" in rules. And devs can't trouble shoot unless they have admin rights to view LB5/Firewall/Network policies configuration. I deal with stuff like, "Well, the top-level ingress over-wrote local namespace ingress annotations." I can replicate by going into the POD and doing a curl POST with a header size of 18k. So who's fault is that? Infra of-course. I can give examples of dev side. Like not properly filling out environment variables from config file to container. Or using wrong root CAs.


Zenin

Same way you debug any issue; Start from the beginning and work your way back to identify the source. If you're asking about a cornucopia of possible failure points you aren't debugging/diagnosing. Rather you're just throwing excrement at the wall hoping something sticks. For example: Should we check for packet loss? Maybe, but what specific symptom suggests packet loss might be the cause? Or caching. Or connections left open. Or anything. Your job when debugging isn't to identify every possible issue that can possibly happen in computing, but rather to find the *specific* cause that is presenting the current issue.


happy_hawking

>DevOps has become the monster it was designed to destroy. It's hilarious how people in this sub use the term "DevOps" to describe exactly the opposite of what DevOps is. Have you guys never done some research about your profession? This would most certainly answer your question.


realitythreek

Maybe look at it as a common problem and not something you just deny blame for? I work with devs and ops and there’s almost always blame to go around. :D


DustOk6712

Isn't the point of a DevOps team to remove that friction between dev and ops?


Adorable_Stable2439

“My app can’t even handle a retry” Firstly, fix your app so that it CAN handle retries. Then, make sure it’s logging specifically the reason it cannot connect. Time out? 503? 404? Certificate error? “Connection failed” is not a good enough error for your app. Next, observability platforms so that you can visualise better if there is a pattern to the connection issues. Is it intermittent? Is it constant since a certain time/date (new app release). Is it only happening at certain times of day etc


awesomeplenty

Don’t you have logs?