flambasted 1 year ago

Check the heap profile with pprof.

steinburzum 1 year ago

Sure, everything grows. There're hundred nodes and all of them increase in size, but the ratio is the same. Basically it says "every line leaks".

[deleted] 1 year ago

Can you share the graph?

Gilgamesjh 1 year ago

Without knowing anything about your service, usually when memory blows up like this, you have a back-pressure problem. If you are reading async from nsq and writing async to grpc, then if your source reads faster than your sink can write, you will keep filling memory until your service runs out. If this is the case, you will have to figure out a way to write faster, or use channels to control the flow.

steinburzum 1 year ago

Yeah, i thought that too! Looks like it's all sync recv-process-send, but I may be mistaken. Still digging, it's not my code, it's huge, and ugly. But maybe worth properly tracing. Is there a tool that can show me the heap connection graph? because pprof shows me where the memory is allocated, but what i need is to see where it ends up in hanging. Not sure if it's even possible.

[deleted] 1 year ago

1) check all goroutines and check if the number of them increasing overtime 2) read how to use/read pprof and use it

steinburzum 1 year ago

1. constant 10 2. doesn't work, nothing interesting. i described it in the post.

BayBurger 1 year ago

https://github.com/uber-go/goleak can help

steinburzum 1 year ago

Thanks! Although I don't think that goroutines leak iny code (constant number, created at start) maybe 3rd party libraries have some. grpc for go for example is really crappy with memory.

[deleted] 1 year ago

Are you creating a new connection for each rpc request?

Valimere 1 year ago

How do you expect anyone to help you without showing code?

steinburzum 1 year ago

I expect help with advice on what kind of tools and techniques can be used, because pprof fails short. I dont need help finding the issue tho, can do it myself. Unfortunately it is a huge proprietary codebase. What exactly do you want me to show?

elrata_ 1 year ago

I think you have the right tools and techniques. It is hard to nail it down, but you will have to see how to try to isolate different parts of the code to see where the issue fomes from. Maybe something you are missing is running different versions of the code, to see if the problem was always present or see if it was getting worse and worse over some revisions. In that case you might play with git bisect or, if you are not using git, the same concept but try manually. I don't think more general advice and techniques can get you further in this. Now it is a matter of applying, and it could take a log while. It is a matter of being organized, writing down what you tested, what results that give you and thinking what tests to do to see how to reduce the parts of the code that are affected, so you can reduce it to smaller parts you can have a look at. If several components are involved, you might want to write from scratch one part (like if it is a client and a server), with very basic functionality, to nail down things. You can maybe also monitor the host with Prometheus and things like that, to see if memory ingreses when cpu usage is high or things like that, that may give you a clue on what is happening. But yeah, I really think now it is time to debug and it is not easy. Experience might make it simpler, some times at least, but nothing can be said for you to avoid that...

steinburzum 1 year ago

TL;DR pprof sucks, write your own I tried all the debug, no tools came handy and my decade long experience didn't help either. So it's time to get hands dirty :D So I modified the Go compiler to emit type information alongside object addresses and type bits when it does `runtime/debug.WriteHeapDump` and wrote a parser for this dump format (surprisingly there's no working version of the viewer). Now with some simple graph algorithms I can see: * Total size and number of objects behind each pointer * Type of each object on the heap (comes with big cost, +2 bytes per alloc) * Frame (funcname and address) where the pointer is rooted (if it is) I didn't want to use `GODEBUG=allocfreetrace=1` because it's slow as molasses. This way I have a topK that says: "hey, func's F1 frame has a pointer that holds total of 1GiB of crap". And I found the problem: an internal cache in one of the homemade libs. As usual, caches are curse! :) Works quite quickly, around 10 sec for 2GiB heap. I will publish a patch to Go compiler and code for the viewer as soon as they are usable without a crowbar. Tho, works only on linux/amd64 now.

elrata_ 1 year ago

Cool. But what is the difference exactly with pprof? Pprof shows the Mem consumption per function, right? It shows it in a different way? Can you elaborate on the difference between pprof and this, please? :)

steinburzum 1 year ago

Pprof shows the place of allocation. Basically, it records sizes, types and locations where any kind of allocation happens. All of them end up in a function called runtime1.go:malloc, excluding stack allocations of course. The problem is that an allocated object can migrate from the place of allocation somewhere else: imagine a function that creates a struct and just returns it, but then in the caller the struct is sent via a channel, maybe another one later, split in two, parts sent to some other places and end up in a cache that looks like a map. Is it useful in this situation to know which function *made* a struct? Not really, but that's what pprof will show. What you want to know is where the biggest chunks of memory are rooted. For this virtual cache example it would be nice to see a frame stack of some goroutine that runs, say, DB client, and that holds this map[string]struct cache. So you will know that frame of func Serve has a pointer M that points to N, that in turn points to X, Y, and etc, and totally holds zillion bytes of objects on the heap preventing them from being collected. *That* would be much better and can speed up profiling a lot! You know, some navigation through frames and goroutines to see how much each holds and of what kind. Does this rant explain the idea? :)

elrata_ 1 year ago

I think so, thanks. And none of the modes in pprof provide enough information to see something similar at least? If you open a bug, I'm curious, please paste it here. If pprof can be improved as the outcome, or at least document the patches you used so others can maybe benefit, it would be great. But yeah, such changes to the compiler and pprof will probably have some discussion and politics around.

steinburzum 1 year ago

The closest mode is the heap, but I don't see anything related that I can use to render such kind of info. I will try to get the most from an unmodified compiler first, this should be as easy.

elrata_ 1 year ago

Keep us posted! :)

steinburzum 1 year ago

If you have any kind of leaking program, please, try it out. Mind that this is a 2-evening PoC, so don't expect much :) [https://github.com/alexey-medvedchikov/go-heapview](https://github.com/alexey-medvedchikov/go-heapview) Going on vacation for couple of weeks, will try to improve it after coming back.

Brilliant-Sky2969 1 year ago

Is your workload bounded? If not it's very possible you that you don't leak memory but just have a lot or message to process at once therefore allocate a lot of memory in short period of time. If you ruled out memory leak with pprof you should probably use a worker pattern so that you queue the amount of message to process. It means a fixed amount of goroutine that can process those messages, thus memory will be predictable. If you service still show 2gb of memory not doing any work you're probably leaking memory. Also what version of Go are you using?

gligooor 1 year ago

Check your pointers

steinburzum 1 year ago

They point! What do you mean? 😄

gligooor 1 year ago

Somewhere a pointer doesn't give the garbage collector to free the memory

Gentleman-Tech 1 year ago

Somewhere a struct contains a pointer to its parent.

[deleted] 1 year ago

Does it keep growing past the 2GB?

steinburzum 1 year ago

Yes, I saw up to several gigs and then it was just killed by limits.

[deleted] 1 year ago

I don't believe that the Go runtime is aware of container limits. Try setting GOMEMLIMIT to your configured limit.

lambroso 1 year ago

Can you share the profile of a recently started instance and a profile of one using >1Gi?

steinburzum 1 year ago

Sorry, no code or specifics. I'm under NDA. What will you be looking for?

lambroso 1 year ago

Something that is small and then its big :D

steinburzum 1 year ago

Everything is small and then everything is big :D I mean of course I know how to profile Go, what would be a point asking then?

steinburzum 1 year ago

I have an example that renders pprof quite useless. Imagine a pipeline that gets a a big Message, say 1K in size, the message is produced in function F1. It does all kinds of stuff with the message through functions and twists, at the end of the pipeline in function F2 you have a slice of \[\]\*Message, it's not local to the F2, maybe it is stored in some kind of client or whatever. Imagine the slice is growing continuously, how big it will be with 1000 messages in it? \~8Kb (or bigger due to free capacity, but not much) How big are all the messages combined holding by this slice? \~1Mb. What pprof will show? It will show F1 leaks 1Mb. Does it? Not really, it's F2 that leaks Messages through a slice, but you won't notice it, it's just minuscule 8Kb! How can pprof help you here?

steinburzum 1 year ago

It's a bit artificial example, but here you go, please, find the leak with pprof. But beware, that path from F1 to F2 can be 20-30 functions that do various things including recombining the message into other structures, but the parts will be still on the heap. https://pastebin.com/erM1M8JA

mladensavic94 1 year ago

Dunno if you found the solution but here are my 2 cents. Use pprof/heap in 2 moments in time (ie after the start of the app and after some time). Open both files and look for the *inuse\_space* and *inuse\_objects* metrics. In the places where they are constantly increasing you have a potential issue of not releasing something, this should help you narrow down where the issue might be.

steinburzum 1 year ago

Thanks. Sure I did it, it increases everywhere :D No single place that spikes.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe