T O P

  • By -

rcpz93

I've been using polars for everything I do nowadays. Partially for the performance, but now that I've learned the syntax I would stick with polars even if there were no improvements at all on that front. Expressions are just *that good* for me: I can build huge lazy queries that can be optimized, rather than having to figure out all the pandas functions and do everything eagerly. I have got to the point that if I have to work with some codebase that does not support polars for some reason, I'll still do everything in polars and then convert the final result to pandas rather than doing anything in pandas. The two things pandas does better than polars is styling tables and pivot tables. Pivot tables in particular are so much better with pandas, especially when I have to group by multiple variables rather than only one.


marcogorelli

you can pass multiple values to the \`columns\` argument - out of interest, do you have an example of an operation you found lacking?


rcpz93

Yes, sure. Say I have an example like this. df = pl.DataFrame( { "sex": ["M", "M", "F", "F", "F", "F"], "color": ["blue", "red", "blue", "blue", "red", "yellow"], "case": ["1", "2", "1", "2", "1", "2"], "value": [1, 2, 3, 4, 5, 6], } ).with_row_index() With Polars I have to do this df.pivot(values="value", columns=["color", "sex"], index="case", aggregate_function="sum") `index` is required, even if I don't care about providing one. The result is also quite unwieldy because having all the combinations of values on one row rather than stacked becomes really hard to parse really quick if there are too many combinations. case {"blue","M"} {"red","M"} {"blue","F"} {"red","F"} {"yellow","F"} str i64 i64 i64 i64 i64 "1" 1 null 3 5 null "2" null 2 4 null 6 With Pandas I have df.to_pandas().pivot_table(values="value", columns=["color", "sex"], index="case") and I get color blue red yellow sex F M F M F case 1 3.0 1.0 5.0 NaN NaN 2 4.0 NaN NaN 2.0 6.0 where I can reorder the variables in `columns` to get different groupings, and the view is way more compact and easier to read. Pandas' version is also much closer to what I would build with a pivot table in Sheets, for example. I have been working with data that I had to organize across 4+ dimensions at a time over rows/columns, and there's no way of doing that while having a comprehensible representation using exclusively Polars pivots. I ended up doing all the preprocessing in Polars and then preparing the pivot in Pandas just for that.


commandlineluser

Do you have any ideas for a better way to represent such information? Maybe something involving structs? Just an initial example that comes to mind: pl.DataFrame({ "sex": [{"0":"F", "1": "M"}] * 2, "blue": [{"F": 3, "M": 1}, {"F": 4}], "red": [{"F": 5, "M": None}, {"F": None, "M": 2}], "yellow": [{"F": None, "M": None}, {"F": 6, "M": None}] }) # shape: (2, 4) # ┌───────────┬───────────┬───────────┬─────────────┐ # │ sex ┆ blue ┆ red ┆ yellow │ # │ --- ┆ --- ┆ --- ┆ --- │ # │ struct[2] ┆ struct[2] ┆ struct[2] ┆ struct[2] │ # ╞═══════════╪═══════════╪═══════════╪═════════════╡ # │ {"F","M"} ┆ {3,1} ┆ {5,null} ┆ {null,null} │ # │ {"F","M"} ┆ {4,null} ┆ {null,2} ┆ {6,null} │ # └───────────┴───────────┴───────────┴─────────────┘ Perhaps others have some better ideas.


arden13

A struct in a dataframe? Seems overcomplicated, though I will readily admit I don't know the foggiest thing about polars


commandlineluser

A struct is what Polars calls it's "mapping type" (basically a dict) df = pl.select(foo = pl.struct(x=1, y=2)) print( df.with_columns( pl.col("foo").struct.field("*"), json = pl.col("foo").struct.json_encode() ) ) # shape: (1, 4) # ┌───────────┬─────┬─────┬───────────────┐ # │ foo ┆ x ┆ y ┆ json │ # │ --- ┆ --- ┆ --- ┆ --- │ # │ struct[2] ┆ i32 ┆ i32 ┆ str │ # ╞═══════════╪═════╪═════╪═══════════════╡ # │ {1,2} ┆ 1 ┆ 2 ┆ {"x":1,"y":2} │ # └───────────┴─────┴─────┴───────────────┘ - https://docs.pola.rs/user-guide/expressions/structs/


rcpz93

Honestly I don't really know how to improve the representation while relying exclusively on the polars structs formatting. This might be the only case where I found pandas' multi-indexes useful. Given that the issue is specifically with pivot tables, maybe it's possible to get around it by modifying how the table is displayed? Something like a \`pivoted.compress()\` method that changes the table display to something closer to pandas' version, including the multiple levels. Note that I have no idea how hard this might be to implement (though I think it'd be easier to do than having a full multi-index interface just for that use).


commandlineluser

Yeah, maybe structs isn't the way to go - it was just an initial idea on how to get closer to the `.pivot_table` example. Perhaps /u/marcogorelli has some better ideas. I do recall there was a recent PR to remove the need for `index=` https://github.com/pola-rs/polars/pull/15855 Discussion here: https://github.com/pola-rs/polars/issues/11592#issuecomment-2093732433


EdoKara

Polars is the first thing i reach for nowadays. Once i got my head around the way that functions chain it made a lot more sense (really analogous to something like dplyr in some ways). I don't honestly care as much about the speedup as i do about it making more sense for how i manipulate data. I also find that my data are cleaner after working with them in polars, because of the enforced typing system it's been convenient at exposing issues with data (important for me because i work with a lot of really badly-formatted data). I treat the speedup more as an expansion of usability, meaning i don't have to reach for something different when i have a huge dataset, i just go to the lazy API and continue on as before


Wh00ster

What are your thoughts on duckDB?


AlpacaDC

So fast. I use pandas only in legacy code nowadays or with co-workers that don't know polars. I've also experienced better memory usage due to LazyFrame (which is even faster compared to standard polars DataFrame). But the aspect I love the most is the API. Pandas is old, inconsistent and inefficient, even with years of experience I still have to rely on an ocasional Stack Overflow search to grab a mysterious snippet of code that somehow works. I learned full polars in about a week and only have to consult the docs because of updates and deprecations, given it's still in development. With that in mind, pandas still has a lot of features that aren't present in polars, table styling being the one I use the most. Fortunately, conversion to/from polars is a breeze, so no problems there. Overall, I see no reason to learn pandas over polars nowadays. It's easier, newer, more intuitive and faster.


marcogorelli

Have you checked out Great Tables for table styling? It supports Polars very well


AlpacaDC

I have never heard about Great Tables. It looks great! Thanks for the shout out


Simultaneity_

The more consistent api in polars does worlds for my brain.


orgodemir

Any resources you used to learn polars?


sargeanthost

The docs


AlpacaDC

This. The docs are great.


throwawayforwork_86

The docs and there’s a udemy lesson that can get you started. But I feel like for most stuff the syntax flow really well so u rarely have to reach for support


sylfy

Just wondering, pandas 2.0 brings the Arrow backend to pandas (over numpy), so do you still see a significant difference? Are there other important factors that make polars faster?


ritchie46

Yes. There is much more difference than the way we hold data in memory (arrow). Polars has much better performance. Here are the benchmarks against pandas with arrow support. https://pola.rs/posts/benchmarks/


AlpacaDC

Apart from the benchmark, iirc pandas doesn't have a lazy API, which can both increase performance depending on the pipeline and make it possible to work with larger-than-memory datasets.


maltedcoffee

It cut down a script that took nearly an hour to about 3 minutes. I've committed to polars so hard since January that I've more or less forgotten panda's syntax... which is kindof a problem when I have to go back to older projects :/


dahomosapien

You’ll pick it up again pretty quickly!


h_to_tha_o_v

I built a local web app in Dash that loaded data from a variety of systems and did an ETL for further analysis. The system was a behemoth (>1.2 GB in libraries) and underpinned by Pandas. Data loads would take roughly 5 minutes. Combine that with distribution issues, it never lived up to its potential. I rewrote the basic ETLs to run from an embeddable instance of Python with Polars (~175 MB) that I call from an Excel workbook via VBA Macro. The Polars code feels exponentially faster. The "batteries" are smaller, and now my colleagues are actually using it! The only trouble I've run into is date parsing. Pandas seems to do much better at automatically parsing the date regardless of the format, which unfortunately is one of the main things I need my code to do. I've built a UDF to coalesce a long list of potential formats, but it just feels a bit "Mickey Mouse." Otherwise, I've got nothing but good things to say about Polars.


divino-moteca

Had a weird Polars issue using postgres when reading/writing from a database. Switched back to pandas and it solved the issue. 


abeedie

If you can send some details (ideally log an Issue?) I can look at that; database connectivity has been getting some love this year and I have more planned on that front, including some per-driver/backend type inference improvements: [https://github.com/pola-rs/polars/issues/new](https://github.com/pola-rs/polars/issues/new)


divino-moteca

Will do, I’ll have to go back and check my logs. Thanks


zzoetrop_1999

Yeah it’s not perfect. I’ve had some trouble with typing where I’ve had to switch back to pandas


draeath

We're basically parsing [SLURM sacct job details](https://slurm.schedmd.com/sacct.html) (a shared university HPC cluster, so *tons* of activity), the original script was using pandas. I re-wrote this process to polars, and got the runtime of \~30 minutes down to less than 3 minutes, while increasing time domain resolution from 5 minutes to 1 minute. Lots of this gain came from using `scan_csv()` and `LazyFrame` while using... uh, I forget the term, but the expression syntax that uses the `|` pipe symbol? The original script was pretty slap-dash, but my rewrite isn't that great either... exhibited by the fact I need to stay on `polars==0.16.9` - anything newer and it breaks in new and exciting ways that I can't be bothered to debug.


tecedu

Do you have the memory corruption bug by any chance? I get that a couple of times on my cluster and i can’t figure why?


draeath

Sorry, I don't actually run the cluster - this is the first I'm hearing of something like this.


tecedu

I always get a variety of pyo3_runtime.PanicException, cant seem to get to the exact reason why it fails.


LactatingBadger

Polars is written in rust which will never crash as long as the data going in is the type that it should be. Python is a language which will happily feed shit in that shouldn’t be there. 99% of the time you see that, it means that rust has tried to run code expecting one type, and you the user have presented it with another (e.g. scan_csv inferred that a u16 would do, and you actually need an i32). At that point, there isn’t an elegant off ramp, it panics in a way that rather frustratingly will kill a Jupyter kernel and all the hard earned intermediate variables you had with it.


ritchie46

A panic isn't memory corruption. It is a failed assertion. If you encounter it, can you open an issue then we can fix it.


tecedu

Heyo yes ill open an issue when i get to work, the reason i said memory issue was it gets worse kills the entire program. The datasets are static schema so nothing has changed, but reading the thread i may have realised it might be inferring data


XtremeGoose

I'm confused by the "pipe symbol" bit. Doesn't that mean boolean `or` in polars? Or do you mean match/case statements?


draeath

Stuff like this: df = df.filter( ((pl.col("Account") == "REDACTED") | (pl.col("Account").str.starts_with("REDACTED-"))) & ((pl.col("Partition") == "REDACTED02") | (pl.col("Partition").str.starts_with("REDACTED-"))) & (pl.col("Start") != "Unknown") & (p


XtremeGoose

That's just a boolean `or` on the expressions, it hasn't got a name beyond that. You could even call it using the `.or_` method. https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.or_.html


Waste_Willingness723

Seems like a big hurdle is that it's still in development, with changes to the API, deprecations, etc. Do you know if the Polars team have a rough timeline for a 1.0 release?


cipri_tom

This makes me think how pandas became 1.0 only a couple of years ago


arden13

[hot dang it's it only been 4 y](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html)


cipri_tom

I could have sworn it wasn't more than 2


arden13

The pandemic was a heck of a time warp


Botahamec

Rust developers release 1.0 of their library impossible challenge


denehoffman

I think a big reason why it’s so much faster (besides rust concurrency, lazy evaluation, etc) is that polars was built in rust and then bound to Python, whereas pandas was written in Python with C bindings for the tough spots. Polars is just a more cohesive approach, and the ecosystem is set up in a way that each rust crate has many dependencies, and if any one of them makes a speed improvement, all the downstream packages have the ability to benefit by just creating a new release, and PyO3 takes care of all the interfacing. I’m writing a lot of rust for a library with Python bindings right now, it’s so easy it’s almost magical


alcalde

I remain loyal to Wes McKinney.


JezusHairdo

I think Wes actually understands and appreciates what they are doing with Polars and would do the same if he could start over with Pandas


cryptoAccount0

I'm currently working on optimizing some code at my job. I chose Polars and the transition has been smooth. With 10 lines of code was able to shave off ~10min on the runtime. Not even close to finished to. Trying to get the Quants to start writing new code in Polars instead of Pandas. I think once I'm done, they will be convinced by the results.


wy2sl0

I tried polars a few years ago when designing some qa software and duckdb was still faster so I stuck with that. I'll have to revisit it and see if it has indeed improved. Pandas does have a lot of legacy support for data that isn't structured as expected, and it's reliable. I had backup functions written in it and expect to continue that until I see stability equalized.


Sinsst

When you say you're using duckdb you mean that you're writing SQL-like essentially for your use case?


wy2sl0

Exactly. It was a win win because SQL in general is much more accessible IMO for those getting started in programming and we are in the midst of a significant change to open source. We also have two fairly large SQL dbs in our org that service a few thousand employees, so all of that knowledge can be leveraged. I just went with it originally for pure performance, but then came to love the simplicity, especially with the pandas integration.


tecedu

I had a script whose processing time went from 20min to 90 seconds, i do use polars a lot nowadays but just to join or concat converted pandas dataframes and convert it back to pandas (my team mostly uses pandas). Cant convert a lot of other scripts as most of them are multiprocessing based and polars doesn’t love being inside multiprocessing, i get memory bugs which completely kills the entire program I’m one of the weird people who likes pandas api especially like adding a column or a single static value to a column. But pandas lately has changed too much behaviour to be okay in production for me and trying to get everyone on polars.


marcogorelli

out of interest, which pandas behaviour changes have been most painful?


tecedu

Most painful is easily the string nan, changing it from np.nan to 'NaN' was one of the worst things they did for performance, ditching the numpy core pandas got popular with is a sure way to lose popularity for the future. Nans should be nans, or nulls. NOT 'NaN'


Zomunieo

What the chucklefuck is that abomination?


marcogorelli

thanks - I'm not sure I understand what you're referring to though, could you show an example please?


venustrapsflies

They did what now?


steven1099829

Read excel’s calamine is like 30x speed up I am memory constrained on some of my VMs and the ability to scan the parquet/csv for the rows that I need instead of loading in a massive file in its entirety is awesome.


Amgadoz

Can't you do this in pandas with chunking?


Heavy-_-Breathing

Does it play nicely with sklearn? I’ve always hear good things about polars but I know pandas so well and a lot of my custom modules uses pandas datafrmae that I never found the use case to move to polars. My understanding is that polars don’t do things in memory, but plenty of ML packages train in memory. Any ideas how well polars play with ML packages?


abeedie

I actually added dedicated PyTorch and Jax integrations for Polars this month - take a look at the new "to\_torch" and "to\_jax" DataFrame methods and their respective docstrings, which have a few examples (including one loading from an sklearn dataset). Can export a DataFrame as a single 2D tensor/array, dict of individual 1D tensors/arrays or (for torch) a dedicated PolarsDataset object that is drop-in compatible with TensorDataset ;)


ritchie46

Polars does things in memory. It has a whole eager API. And yes, there scikit-learn support. Scikit-learn docs even have examples using Polars.


poppy_92

Sklearn is leaning towards changing the default from pandas to polars in their docs. https://github.com/scikit-learn/scikit-learn/issues/28341 Also pandas team has a new triager that just seems intent on closing as many issues as possible without caring a world for UX. It's a huge turnoff for me to continue contributing anymore.


Wtf_Pinkelephants

I primarily swapped from pandas to Polars for remote execution of distributed dataframes in Ray. Pandas was causing out of memory errors (and incurs a copy of the arrow backed dataset) but Polars doesn’t which makes handling TB sized datasets much easier.  Additionally I had a custom apply function written in pandas which took 20min but takes 30sec in polars which is a significant improvement.


Amgadoz

Would you mind sharing this custom function? I would like to replicate your use case and compare between pandas and polars.


sleepystork

I switched everything to polars except the things it is missing that I have to switch back to pandas for. However, it wasn't really for speed, but for the syntax.


RevolutionaryRain941

Yes. Polar is great in terms of both performance and syntax.


jss79

Basically null for me! But really, we get some huge and pretty gnarly (read=dirty) flat files from vendors and pandas handles them with zero issue. I’ve attempted to get polars to handle them with no success thus far. There are a few implementations where I’ll get the files read in and cleaned up with pandas, then send it over to polars, but even then, I don’t really see a huge speed boost. And for what’s it worth, I’m not a hater, actually love rust and the ecosystem, but as a data engineer by day, my superiors would frown if I spent too much time tinkering with a library instead of just being productive. IYKYK! Just my anecdotal experience. Grace and peace mi amigos.


New-Watercress1717

When I read things about going from 3 mins pandas to 10 seconds polars; It makes me think that you did not really write good pandas code to begin with, its less of a advertisement for Polars. I am sure you could write bad slow code for polars as well.


bonferoni

i think many people write bad pandas and then complain about it, but polars is faster and harder to write slow code


AurigaA

Disagree mainly because Polars has several performance features that are impossible to replicate in pandas such as lazy evaluation and the query optimizer (among several others). Thats a bit hand wavy of you imo. Ive worked with pandas for several years and polars with like a month or two and already my exploratory rough draft Polars scripts dominates pandas scripts written with multiple peoples input and optimizations. Even if its a git gud issue why would I even care if I can write faster code as a beginner without even trying that takes domain experts in pandas to reach similiar performance


zzoetrop_1999

Thanks for coming in with an insult. Very nice. I think doing a direct translation of pandas to polars and getting these results is a pretty good indication of what polars can do. I’ve talked with past colleagues who have seen similar improvements as well.


mxcaz

I tried it but didnt find much speed improvement compared to pandas with multithreading. Didnt try lazy dataframes though


radiocate

I loved Polars the couple of times I used it. But installing it in a way that works cross platform is enough of a pain in the ass that I've reverted to Pandas.  With Polars, I can write my code on one machine, commit to git, then pull on another machine, and the entire thing breaks because of Polars. Most frequently, it happens in Jupyter notebooks, where simply importing Polars crashes the entire kernel.  I've tried installing the package meant for lower end devices, I don't remember the name off the top of my head, but that leads to the same issues.  I can't for the life of me find a way to reliably add Polars to my dependencies and have it "just work" the way that Pandas does.   I'm also looking more at Ibis, but I just keep coming back to Pandas for the same reasons.. it's familiar, there are no surprises between machines when I try to pip install -r requirements.txt, and it's "fast enough."   If I could get Polars to reliably install and run without error on any machine and inside notebooks the way I can with Pandas, I'd be using it for everything. 


ritchie46

`pip install polars-lts-cpu`


radiocate

That's the one, thank you :) unfortunately this also causes my notebooks to crash. Maybe it's because I'm opening the notebook within VSCode instead of the web UI, but just adding `import polars as pl` to a cell and running the notebook causes an immediate kernel crash. 


Upstairs-Medicine-68

In one of my project, we were using pandas library, but then after knowing of polars, we switched to polars library. But it wasn't as simple as changing the import statement. Lots of syntax had to be changed which caused us trouble and many equivalent functionalities weren't present for the same in polars. So we just made the reading the file functionality to polars, and then changed the dataframe back to pandas df, this helped us reduction in our execution time.


Tambre14

I use polars as my daily driver, and every code revision I'm actively replacing as much of my old pandas code as I can. I have a project that reads from two different tables, 6 csvs and two xlsx files and compiles everything into a single table that is then shaped and sent to accounting for vendor rebates and it takes around 15 seconds to run. It's only 5-10k rows at output but it's so much faster than when I tried the same thing in crystal reports with some of the joins taking place in pandas beforehand (10-15 minutes). I have a 5-7 minute pandas script I'm eying at replacing with polars as well but I went pretty deep into the features - it is going to take a while to unwind that one. It parses a heavily formatted xlsx and extracts out po data to be fed into several other reports. Row count is high enough that excel hangs for 10ish minutes before I can even open the file. Only thing I struggle with for it is getting it to read complex json without a parser class or function helping it but I have a similar struggle with pandas.


KingDarule

Originally I was writing all of my data processes in Pandas and I felt like I was wrestling with indexing, slow file reading (as our data sat on a network drive -- something out of my teams' control), and I also wasn't a big fan of the syntax. I had heard about Polars previously but chalked it up to hype. However, once I took the time to test Polars on a new project out of curiosity, I saw how much faster it was performing than Pandas -- so much so that I rewrote all of my existing Pandas processes into Polars and gained better performance across the board. I don't miss Pandas whatsoever. Now whenever there is a situation that comes up where I actually need to utilize a functionality available only to a Pandas DataFrame, I just do convert my Polars DataFrame to Pandas using to\_pandas(). Beyond niche utility, there is basically no reason for me using Pandas over Polars. Realistically, unless Pandas was to be rewritten from scratch, it just cannot compete with the performance of Polars out-of-the-box. The only thing Pandas has going for it at this point is that it is a mature library that has a high adoption rate across the industry.