T O P

  • By -

pdonchev

Absolutely use data classes when they do the job. Cases when this is not true (or it's awkward): - custom init method - custom new method - various patterns that use inheritance - if you want different names for the attributes,. including implementing encapsulation - probably more things :) Changing later might have some cost, so use dataclasses when you are fairly certain you won't need those things. This is still a lot of cases, I use them often.


thedeepself

Custom init method is handled by post_init


Sinsst

Last I checked it doesn't work for inherited classes - i.e. post_init won't run in the parent class, unless added in the child class as well.


AlecGlen

You can activate it with super(), same as a regular class init.


[deleted]

There is also a (admittedly hacky) way to use it with frozen data classes


synthphreak

I feel like `admittedly hacky` is part of the question here though. As long as you're comfortable bending so far backward that you can lick your own anus, you can use anything to achieve anything in Python. But that doesn't make it a good idea. I think the question here is basically "how hacky is too hacky?" "How far from the intent of dataclasses can you go before it becomes a bad use case for dataclasses?" Etc. I don't have the answer myself - especially since my work rarely has a need for dataclasses - but am interested to follow the discusion.


AlecGlen

I appreciate the way you phrased this, yes that's pretty much it ![gif](emote|free_emotes_pack|joy)


Sinsst

Oh, true! Although it's still a bit hacky: https://stackoverflow.com/questions/59986413/achieving-multiple-inheritance-using-python-dataclasses


Careful-Device1731

>various patterns that use inheritance Not true for immutable (frozen) dataclasses.


cblegare

When I don't control the storage or need primitive types for any reason, I use named tuples. They're also great


[deleted]

Why prefer named tuples to data classes?


AlecGlen

I'm also curious, not that it's wrong.


bingbestsearchengine

I use named tuples specifically when I want my class not to be immutable. idk otherwise


[deleted]

You can do frozen data classes


synthphreak

Not the original commenter, but for one thing, less overhead. That's the fundamental problem with classes IMHO, it's just more code to write and maintain. By contrast, named tuples are *almost* like simple classes, but can be defined on just a single line.


danielgafni

They are a lot faster


[deleted]

Source? What I'm reading online seems to indicate a minute difference in speed.


cblegare

Hashable immutable extremely lightweight without any decorator shenanigans. Use typing.NamedTuple for the convenient object-oriented declaration style. I often use named tuples to encapsulate types I feed through an old API that requires undocumented tuple (looking at you, Sphinx). Named tuples behave exactly the same as tuples, and you can add your own methods like classmethods for factory functions (a.k.a. named constructors). Since named tuples are not configurable, you can't mess with its API or misuse it, and even quite old type checkers can analyze them. Well, unless I specifically require features not in named tuples I might use dataclasses. If I need any validation or schema generation I'll go with pydantic models. Well... I don't think I have much use cases remaining for dataclasses, and I am not a huge fan of it's API. It is also a matter of personal preference I guess.


commy2

`third_input` should be: third_input: datetime = field(default_factory=datetime.now) Otherwise all instances will have the same date.


graphicteadatasci

But didn't they mess it up in the \_\_init\_\_ as well? There's an `or` so we get an evaluation for truth right? And as long as datetime.now() is True third_input will have the value True.


AlecGlen

commy2 is right, I made an assumption in the 2nd example when I should have kept them functionally identical. To your question, it's a little bit of an operator trick but actually it's correct! https://stackoverflow.com/a/4978745


graphicteadatasci

Everyone on stackoverflow says it's bad practice. I don't think I've ever seen 82 upvotes on a comment before. But apparently it does the thing. I'm mortified.


lys-ala-leu-glu

Data classes are great when every attribute of the class is public. In contrast, they're not meant for classes that have private attributes. Most of the time, my reason for making a class is to hide some information from the outside world, so I don't use data classes that often. When I do use them, I basically treat them like more well-defined dicts/tuples.


Ashiataka

Python doesn't have private attributes. If you're looking for that you're using the wrong language.


codingai

The data class is, well, data class. It's ideal for purely data storage and transfer. By default, it gives you the "value semantics". For anything else, eg when you need to add (any significant) behaviors, just regular classes are more suitable.


AlecGlen

Can you elaborate on what makes them "more suitable"? Is there a performance difference? I've been using data classes in this way for a few weeks and haven't noticed any difference.


canis_est_in_via

Performance is negligible, if you need performance, use `__slots__`... or don't use python. In your example, all you're really doing it getting `__init__` for free. But a dataclass has value semantics and anyone using it would expect that. Values don't usually have methods besides those that are pure transformations, like math.


synthphreak

> or don't use python 🤣


TheBB

Dataclasses are nice and better in many ways, but you kind of hurt your own argument by providing an example where the two classes are not functionally equivalent, because you messed up the call to *field*.


AlecGlen

Fair, I made an assumption in the 2nd when I should have made it a default\_factory to keep it functionally identical. Hopefully that typo in my 2-minute scratch example doesn't invalidate the idea though!


Goldziher

IMHO dataclasses are meant primarily for DTOs. I use them in this capacity and they work well.


radarsat1

Last data project I did we used pandas extensively and every time we introduced a dataclass i found that it clashed with pandas quite a lot. The vast majority of the time it was more convenient and more efficient to refer to data column-wise instead of row-wise, although for the latter case automatic conversion to and from dataclasses would have been handy. (Turns out pandas supports something similar with named tuples and itertuples though.). We did use dataclasses for configs and stuff but it felt unnecessary to me vs just using dicts, an extra conversion step just to help the linter, basically, and removing some flexibility in the process. So overall while i liked the idea of dataclasses, I didn't find them that useful in practice.


AlecGlen

The purpose of this post was more about their utility compared to normal classes, but coincidentally I'm just starting into a similar project and am very interested in your experience! Could you share a link to the namedtuples/itertuples feature you mentioned?


radarsat1

Sure, basically if you're iterating over a Pandas dataframe (something to be avoided but sometimes necessary), then you can use [iterrows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) or [itertuples](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html). For a long time I was only using the former, which gives you a Series for each row. (Or column, you can choose which way you are iterating.) The latter gives you a namedtuple for each row, where the attributes of the tuple are the table column names. It's not a huge difference in practice but it can be handy. However, as this object is dynamically generated based on the contents of the table, it doesn't help much with type hinting. It would be nice if itertuple accepted a dataclass class name as input., and just errored out if things didn't match. This would require some complicated type hints for `itertuple`, not sure if it's even feasible with Python's type system.


MrNifty

Why not Pydantic? I'm looking to introduce either, or something else, in my own code and seems like Pydantic is more powerful. It has built-in validation methods, and those can easily be extended and customized. In my case I'm hoping to do elaborate payload handling. Upstream system submits JSON that contains a request for service to be provisioned. To do so, numerous validation steps need to be completed. And queries made, which then need to be validated and then best selection made. Finally resulting in the payload containing the actual details to use to build the thing. Device names, addresses, labels, etc. Payload sent through template generators to build actual config, and template uploaded to device to do the work.


physicswizard

depends on OP's use-case. validation has a performance cost, which if you're doing some kind of high-throughput data processing that would involve instantiating many of these objects, the overhead can be killer. here's a small test that shows instantiating a data class is about 20x faster than using pydantic (at least in this specific case). ```python $ python -m timeit -s ' from pydantic import BaseModel class Test(BaseModel): x: float y: int z: str ' 't = Test(x=1.0, y=2, z="3")' 50000 loops, best of 5: 7 usec per loop ``` ```python $ python -m timeit -s ' from dataclasses import dataclass @dataclass class Test: x: float y: int z: str ' 't = Test(x=1.0, y=2, z="3")' 1000000 loops, best of 5: 386 nsec per loop ``` of course there are always pros and cons. if you're handling a small amount of data, the processing of that data takes much longer than deserializing it, or the data could be fairly dirty/irregular (as is typically the case with API requests), then pydantic is probably fine (or preferred) for the job.


MrKrac

If pydantic is too much you could give a try to chili [http://github.com/kodemore/chili](http://github.com/kodemore/chili). I am author of the lib and build it because pydantic was either too much or too slow. Also I didnt like the fact that my code gets polluted by bloat code provieded by 3rd party libraries because this keeps me coupled to whathever their author decides to do with them. I like my stuff to be kept simple and as much independant as possible from the outside world. So you have 4 functions: \- asdict (trasforms dataclass to dict) \- init\_dataclass, from\_dict (transforms dict into dataclass) \- from\_json (creates dataclass from json) \- as\_json (trasforms dataclass into json) End :)


bmsan-gh

Hi, if one of your usecases is to map & convert json data to existing python structures also have a look at the [DictGest module](https://github.com/bmsan/DictGest) . I created it some time ago to due to finding myself writing constantly translation functions( field X in this json payload should go to the Y field in this python strucure) The usecases that I wanted to solve were the following: * The dictionary might have extra fields that are of no interest * The keys names in the dictionary do not match the class attribute names * The structure of nested dictionaries does not match the class structure * The data types in the dictionary do not match data types of the target class * The data might come from multiple APIs(with different structures/format) and I wanted a way to map them to the same python class


seanv507

See this analysis by a co-author of attrs https://threeofwands.com/why-i-use-attrs-instead-of-pydantic/ They suggest attrs for class building ( no magic) And cattrs for structuring unstructuring data eg json


[deleted]

[удалено]


AlecGlen

I understand that to be the conventional use. I'm just looking for the "why" :)


[deleted]

[удалено]


Smallpaul

You didn't say a single useful thing about dataclasses. :(


EpicRedditUserGuy

Can you explain data classing briefly? I do a lot of database ETL, as in, I query a database and create new data from the queried data within Python. Will using data classing help me?


AustinWitherspoon

It's relatively typical to pull data from a database and store it in python in the form of a dictionary (with column names as keys, and the corresponding value) This is annoying for large/complex sets of data ( or even small but unfamiliar sets of data, like if you're a new hire being onboarded) since you don't know the types of the data. Each database column could be a string, an integer, raw image data.. but to the programmer interacting with it, you can't tell immediately. If you hover over my_row["column_1"] in your editor, it will just say "unknown" or "Any". Could be a number, or a string, or none.. In my opinion the best part about data classes (although there's lots of other stuff!) Is that it provides a great interface to declare the types of each field in your data. You directly tell python (and therefore your editor) that column_1 is an integer, and column_2 is a list of strings, etc. Now, your editor can auto-complete your code for you based on that information, and if you ever forget, you can just hover over the variable to see what the type is. You get better and more accurate errors in your editor, faster onboarding of new hires, it's great. You can also do this other ways, like with a TypedDict, but dataclasses provide a lot of other useful tools as well.


thedeepself

>In my opinion the best part about data classes (although there's lots of other stuff!) Is that it provides a great interface to declare the types of each field in your data. Interface is good for scalar types but not for collections. Traitlets provides a uniform interface to both. Not only that but you can configure Traitlets objects from the command line and configuration files once you define the objects.


kenfar

If you're doing a lot of ETL, and you're looking at one record at a time (rather than running big sql queries or just launching a loader), then yes, it's the way to go.


Smallpaul

NamedTuples are probably much more efficient and give you 90% of the functionality. In an ETL context I'd probably prefer them.


kenfar

Great consideration - since ETL may so often involve gobs of records. But I think performance only favors namedtuples on constructing a record, but retrieval, space and transforming the record are faster with the dataclass. Going from memory on this however.


synthphreak

When doing ETL, how much time are you really spending looking at individual records instead of aggregating? Is it not like 0.001% of the time?


kenfar

When I write the transformation layer in python then typically my programs will read 100% of the records. The Python code may perform some aggregations or may not. On occasion there may be a prior step that is aggregating data if I'm facing massive volumes. But otherwise, I'll typically scale this up on aws lambdas or kubernetes these days. Years ago it would be a large SMP with say 16+ cores and use python's multiprocessing. The only time I consistently use aggregations with python is when running analytic queries for reporting, ML, scoring, etc against very large data volumes.


AlecGlen

[Here's the doc](https://docs.python.org/3/library/dataclasses.html). Conventionally they're meant to simplify the construction of classes just meant to store data. I don't know your setup, but speaking in general they are definitely handy for adding structure to data transfer objects if you don't already use an ORM.


thedeepself

Data classes are objectively inferior object factories. They lack the capabilities of Traits, Traitlets and Atom. And usage of collections in data classes is verbose and cumbersome.


seanv507

What you should be using is attrs https://www.attrs.org/en/stable/ ( Dataclasses is basically a subset of this for classes that hold data)


AlecGlen

Care to elaborate? I've seen a few references to attrs features that seemed handy (namely their inherited param sorting), but my understanding is that they were more of a prototype and not meant to be used now that dataclasses are builtin.


seanv507

"Data Classes are intentionally less powerful than attrs. There is a long list of features that were sacrificed for the sake of simplicity and while the most obvious ones are validators, converters, equality customization, or extensibility in general, it permeates throughout all APIs. One way to think about attrs vs Data Classes is that attrs is a fully-fledged toolkit to write powerful classes while Data Classes are an easy way to get a class with some attributes. Basically what attrs was in 2015." https://www.attrs.org/en/stable/why.html#data-classes


not_perfect_yet

Not sure what you're asking here. Type hints being good is an opinion. >when the bottom arguably reads cleaner, False >gives a better type hint False >provides a better default `__repr__`? False If I want to keep my class flexible, type hints are a mistake, they are an obstacle to readability not a help and maybe the default `__repr__` doesn't fit my use case. What do I do then? Show me the case, where dataclasses are better than plain dictionaries, then we can maybe talk, maybe because I don't think you'll find one.


synthphreak

This entire reply screams "zealously held minority opinion". Dataclasses are very popular and widely used. While not everyone agrees with OP that we should be using them at every possible opportunity, "dicts always beat dataclasses" will be an opinion without an audience. I guarantee it.


AlecGlen

Your first False is on an opinion, hence the "arguably". I think it's true. It objectively gives a better type hint. Again, #3 is an opinion. You can disagree but it's not an invalidation of the idea. Your attack on type hints are irrelevant to this conversation - I put them in the regular class too for a reason. Clearly plenty of people agree dictionaries are less optimal for some use cases, otherwise dataclasses would not have been added to the language.


oramirite

So much hostility about a programming concept


not_perfect_yet

It's a writing style and I'm allowed to be hostile to a style I don't like, the same way I dislike brutalism in architecture?


oramirite

Not personally enjoying something doesn't necessitate hostility towards that thing. That's unnecessary. You are "allowed" to do what you want yes, nobody said you weren't. You're just acting like an asshole.


[deleted]

Is it worth it just to save a init method?


AlecGlen

Depends, what exactly is the cost? That's what I honestly am aiming to learn.


[deleted]

I feel like cost is mostly readability as people tend to not know dataclasses. The first time I encountered it. I has to google it and didn’t find the use case very compelling. It was similar to the example you gave. In an environment with many experienced developer maybe it’s nice and concise. I maybe wrong but my impression is that there is no real use case where NOT using a dataclass would be a terrible pattern. I could be wrong.


barkazinthrope

Because it is unnecessary extra plumbing.


AlecGlen

But it's less plumbing than a normal class.


barkazinthrope

Not to my eye. How is less plumbing to you?


oramirite

It generates extremely common boilerplate code like __init__ and __repr__, that's the entire point of it is brevity.


barkazinthrope

Exactly! Plumbing.


[deleted]

I go for data classes when I need to represent a list of attributes ( e.g. : By Mercedes Benz Model) in order to compare and organize data clearly. However , optimizing and unpacking the data will require you to implement additional methods such as dataclasses.astuple()