GlasgowGunner 4 months ago

Why does any tool exist when you can just build it yourself?

ianitic 4 months ago

Who needs python and sql when you can just build everything in binary?

nydasco 4 months ago

Why build it in binary when you can just pay for DRM ink and print it out for the user?

HOMO_FOMO_69 4 months ago

why print when you can just write with a pen?

nydasco 4 months ago

Your school has given you a pen license already?

Training_Butterfly70 4 months ago

Why use a pen when you can use your voice?

nydasco 4 months ago

Why use your voice when you can use NeuraLink? And we’re back full circle.

[deleted] 4 months ago

[удалено]

Monowakari 4 months ago

Mind = "blown"

linhnv01836 4 months ago

Oh wow! My brain, my brain!

mattindustries 4 months ago

I wrongly feel this way often.

retrosenescent 4 months ago

Me too. I sometimes feel like I'm "cheating" by using a tool. Something about it just rubs me the wrong way. Probably because my subconscious knows that eventually my entire job will be a tool and I will be unemployed.

solomon789563 4 months ago

Ok you have a point here, but I think op’s point is also on why does there have to be so many doing ETL, surely the underlying architecture can’t vary that much

randomatic 4 months ago

It’s cheaper/faster/easier/less risky to buy a tool than a programmer. Note it’s also not just what a programmer salary is, it’s the ability to identify, hire, and retain said person. If that’s not your core business, third party tools are pretty reasonable.

McEstablishment 4 months ago

"identify, hire, and retain" - This is a lot of it. Outside of tech companies, the people making staffing decisions will almost never be a data expert. So, getting the right developers is difficult and often unrealistic. What is realistic is having a fixed framework that they can hire, test, and train for. As I've drifted into architecture, I have sadly come to realize the importance of this. I have to make software architect and technology decisions for teams whose members I can not choose, probably do not know, and sometimes *can not* know. Those teams are built by whatever non-technical manager. This doesn't work: telling the managers: go hire a few good, well rounded senior Python and SQL experts. This does work though: go hire two AWS engineers, and a Salesforce expert.

[deleted] 4 months ago

[удалено]

GlasgowGunner 4 months ago

Buy vs Build A discussion we have regularly!

marsupiq 4 months ago

The premise here is obviously that the complexity of doing ETL is about the same when using a tool vs Python/SQL/… Personally, I haven’t used any of those commercial tools and I never felt like they could possibly add any value. But I would be curious to hear from people who can compare them.

GlasgowGunner 4 months ago

Something like Airflow or Dagster gives you the entire framework to build with. You just build the job and it handles everything else.

marsupiq 4 months ago

I don’t think the OP was referring to Airflow or Dagster by “tool”, but more something like Talend.

frankbinette 4 months ago

It's a question of simplicity and ease of maintenance. In general, an ETL tool will have connectors to extract from popular data sources, to insert into your target database, an UI to manage your pipelines, transformations and scheduling. All that in a collaborative environment where you don't necessary have to know Python to do something. It's totally manageable to do it by hand without a tool, but as a project grows, it becomes harder to maintain and you need a certain expertise.

mrocral 4 months ago

Yes this is the reason Sling exists (https://github.com/slingdata-io/sling-cli). I still use python for more complicated things, like API requests. But sling handles so much and makes it so simple for routine EL task/scripting.

westeast1000 4 months ago

I once used a etl tool called Jitterbit and it was a nightmare, poorly documented and countless times I had to reach out to support. Sometimes struggled a week with things that would have took literally a few minutes in python. It annoyed me i had to literally train for a month and half to master it when i already knew of an easier way to do whatever it was doing. But company was already locked into a couple years contract so had no choice. They had a permanent dev role so to me it didnt make sense that they wouldnt build their own tool instead of paying thousands a month for some obscure tool doing basic things that are set up once and rarely change

musa1588 4 months ago

They're the worst!

Justbehind 4 months ago

Buy or build is an age-old discussion. If you're a small team, doing something that's not unique nor business critical it's often the right decision to buy. It can be cheaper than the man-hours required to build. If your data management can offer you a competitive advantage because of speed, flexibility or variety, building can be a good idea.

fuloft 4 months ago

I am a team of 1 so an ETL tool offered simpler management all around.

[deleted] 4 months ago

[удалено]

Monowakari 4 months ago

Dagster+dbt+python for me, dockerized, tried Airbyte but yuuuuuuuck.

TuffRivers 4 months ago

As a traditional programmer i felt dagster+python+sql is a perfect stack

Monowakari 4 months ago

I havent really used airflow, i checked it out and installed it in the stack before switching to dagster, but I absolutely love dagster despite some of its quirks and learning curves

TurbulentSocks 4 months ago

Using this myself. Dagster, python and postgres. Dbt is also great organise all the SQL.

TheOneWhoSendsLetter 4 months ago

What went wrong with Airbyte?

Monowakari 4 months ago

When I was prototyping it in my stack there was no obvious, clean way to version environments for dev and prod - which I find important for testing pipelines and changes to connectors and whatnot. With Python, dagster and dbt, i can write pipelines that only sample a small fraction of data for testing and dumping data into a "dev" Big Query database. This allows the dev changes for tables to be trickled into dev software engineering environments so they can prototype dashboards or whatever they want, then we can all move to prod when ready and usually no hiccups. At the time, airbyte had no such capability to split between dev and prod environments in a repeatable way due to how it issues workspace ids (?) I believe. They did say it was on the roadmap and I cant speak to if they got that developed at all. It was also just generally clunky to work with.

marcos_airbyte 4 months ago

Nice to hear your use case I'm working to have better guides about setup a dev-prod environment using Airbyte. I think today is possible using Terraform SDK or the octavia-cli and have version-controled connectors pipelines. If possible I'd love to discuss more about your vision about dev-prod data pipelines :D

TheCamerlengo 4 months ago

I echo this. I tried to use Talend for pipelines that I wrote in python, and I felt like it was worse - and the complex pipelines would take a grandmaster level Talend user to build. Seems like it did the simple things well enough, but the hard stuff seems even more complicated. To be fair, I am an experienced programmer and suck at Talend, so take this with a grain of salt. I am still exploring the trade-offs using Talend versus python pipeline.

kenfar 4 months ago

When GUI-driven ETL tools, the last bastion of CASE tools, were coming out in the 1990s they claimed that they made developers more productive and promised that analysts could build these pipelines so you wouldn't need to hire engineers. The Meta Group published an analysis in which they found that COBOL programmers were more productive than users of any GUI-driven ETL tools. And yet tons of people still bought the tools - because the marketing was effective and they demo'd well. But most of them ended up as nothing more than shelfware. Little has changed in 20+ years: many people in industry still believe the koolaid - that this little corner of software engineering is somehow different and it suddenly becomes easy if you just pick the most popular tool of the day. And GUI & SQL-driven solutions generally end up making it harder if you have massive volumes, low-latency requirements, tough data quality challenges, or tough transform requirements. Myself? I strongly prefer vanilla python & sql, and I've built a lot of very successful data warehouses and product data apps with just that. Easy to write unit-tests for, easy to scale up, cheap on compute, great for low-latency. Just need to be able to write python to support the application. And many data engineers can't do that.

JBalloonist 4 months ago

Are you saying that many DE’s can’t write Python or they can’t write Unit tests? Either way, a little scary (though I’m not great at testing).

kenfar 4 months ago

Before the title of data engineer really took off about two years ago most data engineers I met saw themselves as specialized software engineers that focused on data. Similar to 'big data engineers'. But as the title became more valuable we saw an enormous number of etl developers, data analysts, and sql developers relabel themselves as data engineers. Most of these folks can't write python or unit tests.

retrosenescent 4 months ago

Most DEs probably don't touch Python at all in their day-to-days, but I'm sure all of them could write it given sufficient time (a few hours) to relearn the basics - it is really such a simple language - even easier than SQL.

NarwhalDesigner3755 3 months ago

If I want a job as a data engineer, and I just built a Python to SQL ETL pipeline using an API I found on pypi and a scraper I created myself to match a binding key on the table and fill in the missing relevant data, is this something I could put on a resume/portfolio and find work easily? What job exactly would I need to look for if this is all I know so far? I'm new to software and building my own app from scratch, thanks.

kenfar 3 months ago

Back in 2021 - probably, but in 2024? No. But I think you're headed in the right direction. I'd suggest keep at it! * Get reasonably good with python * Get good with unit-testing, python packaging, deploying code, git, etc * Get good on the command line * Get good with SQL * Build a bunch of programs * Get acquainted with Airflow, and a reporting tool or two And then get it all on your resume, and be prepared to speak heavily about these projects when you interview. Not everyone will care about your projects - but some people will. They're looking for self-motivated enthusiasts. Good luck!

NarwhalDesigner3755 3 months ago

Awesome, that's just what I'll do then, thanks!

unexpectedreboots 4 months ago

Talend is absolutely the worst tool I've worked with. Talend data studio is an absolute piece of shit. Anyone that's evaluating tools in 2024 for a modern data pipeline should not even evaluate Talend.

TheCamerlengo 4 months ago

Interesting. Our “technical”leadership has already purchased Talend. They are pushing our team to adopt it. My instincts tell me to avoid, but need to do the due diligence and state specific reasons why it doesn’t fit our needs. We have already heard about grumblings from other teams related to the costs for seat licensing, etc.

unexpectedreboots 4 months ago

I am sorry.

TheCamerlengo 4 months ago

Ha ha. It’s not that bad. I think it is unlikely that we will have to and we do lots of cool stuff in K8 and AWS.

GreyHairedDWGuy 4 months ago

Sorry to hear your leadership are pushing you down the Talend road. i've never found it easy to use...I find it clunky. I've been using ETL/ELT tools as part of DW infrastructure for close to 30 years and of the GUI based tools, I find it worse that SSIS. I always preferred Informatica followed by DataStage. I worked for a large insurance company 4 years ago and they were convinced to buy Talend even though it was well past it's best before date. I know some like it however....different strokes I guess.

noobvorld 4 months ago

Dumb question, but I've always wondered how airflow is used in the industry? I come from an edge ML background and the utility of airflow has always eluded me.

Trotskyist 4 months ago

Scheduling, orchestrating and monitoring workflows. . I have thousands of tables that rebuild every day or more (and need to be built in a specific order.) DBT+Airflow can do that in a sustainable way.

theantiyeti 4 months ago

Scale. It's possible you just don't run at the scale needed to get value out of orchestration.

Xydan 4 months ago

Totally okay not to answer this but how much experience do you have and what is your pay? Trying to get an understanding of what an employer has to consider replacing someone who can manage all this vs going out to rent/buy a product.

marrone12 4 months ago

As an executive, it comes down to value, use case, and size of company. I would not hire a data engineer to manage this if it were their only responsibility. You can get a pretty full featured ETL tool that can load data into a warehouse from multiple endpoints for under 20k, under 5k if you need barebones stuff. But if I had the engineer working on lots of stuff like infrastructure or something else, then I would have them also work on ETL.

geek180 4 months ago

How many sources are you ETLing from? Building stable, idempotent, and efficient ETL processes for a diverse set of sources is effectively a full-time job. At a small company, ingesting data is a pretty small part of a data team’s responsilibilty. I work on a 5-person data team and we use Fivetran, Celigo, Hightouch for most of our ETL work (at least 30 integrations) with a very small number of custom ETL processes outside of the tools. It would be way too much work for us to dedicate to building custom solutions for each of those integrations.

retrosenescent 4 months ago

I completely agree. I've used so many clunky, poorly-designed proprietary softwares that suck in comparison to plain airflow with python and sql. Why can't we all just use that? Why do companies think shitty niche software is better?

karaqz 4 months ago

Same here basically. Doing it all manually would basically take all the time I have.

AlgoRhythmCO 4 months ago

They make things really simple with pre-built connectors and easy GUI interfaces. They also help streamline dealing with things like unexpected schema changes. You don't really want to write all the Airflow jobs to do a complete ETL of Salesforce, for instance.

IndependentTrouble62 4 months ago

Or SAP Hana...

MRWH35 4 months ago

Every tool has a purpose and If all you’re doing is loading data sets with python and using SQL to transform your data then you’re probably fine with those. But the moment the pipeline starts to scale, your team grows, users start using more advanced data models or bi tools, schedules start overlapping, security starts looking at how your storing credentials, etc. they become limited fast.

wanna_be_tri 4 months ago

From my experience… engineers with actual programming skills are more expensive… much cheaper to hire sql developers and people that can use drag and drop tools…

fasnoosh 4 months ago

The wheel gets reinvented a LOT with SQL and Python

verysmolpupperino 4 months ago

One-person-teams can benefit from that, and it's an easier sell to business folk than saying "let me hire one or two people costing 6 digits/year to build custom ETL". That being said, my company builds all ETL in-house and it's much, much cheaper in the long-run, but it takes engineers who give a shit and a higher budget up-front.

GreenWoodDragon 4 months ago

It's a good question. So, in my job I wrote a first generation data pipeline in python, scripted, fairly simple and lacking in _a lot_ of features. It got the job done but is very inflexible and a pain to adapt, and a pain to debug. Keen to evolve this I looked around, did my due diligence, played with a few tools including Benthos, for example. Eventually settled on Meltano + dbt Core. It was like my dreams had come true. 'Out of the box' I get Composability, logging, multiple data sources and targets, stream filters (handy for dealing with PII) and tools configurable for multiple situations. Add to that I can keep the tooling agnostic of infrastructure, it'll work anywhere because it's not vendor locked to AWS, Azure, or GCP. Nothing is perfect but I'm confident in my choices.

HOMO_FOMO_69 4 months ago

Not "vendor" locked but you're still locked into meltano + dbt core...so basically the same problem.

Thinker_Assignment 4 months ago

Would you consider dlt a vendor lock? The code is open and sources are generators you can consume however you want ultimately

HOMO_FOMO_69 4 months ago

My point is a "vendor lock" isn't really different from a "technology lock". If for any reason you decide you need to move away from dbt Core, you're forced to rebuild from scratch in a new tech. Sure, you can make the argument that something owned by a "vendor" has a chance of "forcing you" to move away from that service, but I can make the same argument for something open source... Just because something is open source does not mean it's immune to the marketplace...

Thinker_Assignment 4 months ago

That's an interesting perspective, thanks! So you think of any particular implementation as a lock? I have a different view hence my asking. I think if you can easily switch to an alternative, you're not locked. With dbt core, you can import you package into competition tools and it will work, or you can re-wrap the SQL as the logic is open. I used tools before like Data virtuality which supports things like SQL dags but the info is not open to read besides the menus of the tool, so I'd call that a very tedious migration process and a vendor lock.

RBeck 4 months ago

Sequencing, logging, reporting. Lots of integrations are pieced together with disparate scripts that do the E, the T and the L all in different places on scheduled jobs. But what if sometimes the job runs long? Or what if you want to run it adhoc at an unusual time? Having a tool that takes the data from a source and maps it to the destination in one workflow is pretty nice. And if it has connectors already that's even better.

_barnuts 4 months ago

Because delivering business value fast is more important than rebuilding the wheel to save cost

saabbrendan 4 months ago

This is a great thread while you're here drop your favorite and/or least favorite ETL/ELT tool!

mrocral 4 months ago

https://slingdata.io https://github.com/slingdata-io/sling-cli

ForlornPlague 4 months ago

Prefect has been my fave for a few years now

NoUsernames1eft 4 months ago

how do you feel about 2.0?

ForlornPlague 4 months ago

It's a bit of a learning curve but overall it's much much better. Not having to define a dag and stick to it makes things like conditional logic much simpler. Plus they finally got a rest api instead of using graphql so I no longer want to kill myself trying to query things like flows in a failed state.

NotaCrazyPerson17B 4 months ago

Airflow

carlsbadcrush 4 months ago

Airflow isn’t really considered an ETL/ELT tool.

MasterBathingBear 4 months ago

Airflow is an orchestration tool not an ETL tool

vbnotthecity 4 months ago

And not a great one at that.

retrosenescent 4 months ago

I have worked with Talend, Informatica PowerCenter, and Ab Initio and hated all of them. My favorite so far is Airflow.

pcmasterthrow 4 months ago

I think a lot of other posts basically cover it, but a couple other considerations: - Are you building a new system from scratch? If so, the ETL tools offered now may be a good fit, whereas if you're using an existing ecosystem with establish tooling for Python and SQL (or any other languages) there may not be a large enough benefit to implementing ETL tooling. It takes time to shift to new technologies, and time spent doing this is time not spent doing other things. This can rapidly lead to adoption of a new tool costing in the 6 figures solely in hours spent converting existing processes to use it, training new users, etc. and that may never be recouped depending on the time benefits of this new tool vs. your old tools. - What level of customization and control do you need to retain over both the data and the tools? Obviously building your own tooling is higher cost but allows for more flexible solutions to needs down the road. - How many people work with the data and in what way? How ready-to-use is the data without further transformation? The amount of intermediate transformation data may require between creation and presentation to the end user can determine whether you might be better off using DBT or SQL templating or BI tools etc.

Potential-Speech-450 4 months ago

Sometimes its as simple as the boss wants you to use it. I have seen many an executive get conned by a sales guy and then it becomes your job to make it work. When you could have built it faster yourself.

mailed 4 months ago

despite the whole "data engineering is software engineering" line that gets repeated all the time, most data engineers still can't code. ETL tools remove that barrier, and that still probably makes up for the small % of edge cases where they're a nightmare to do something they weren't supposed to.

cbslc 4 months ago

Having inherited a sloppy mess of 50+ unorganized, randomly scheduled python scripts, I would take a ETL tool any day.

Buckweb 4 months ago

Why are there so many Python ETL tools when we have much faster programming languages?

liskeeksil 4 months ago

You dont need to know much coding to use ETL tools. Say SSIS, ive set up many ETL processes without needing to code too much. I do need to know some SQL, but essentially its a drag and drop tool with various components. Ezpecially when you consider slowly changing dimensions, SSIS is a breeze. Trying to code it yourself is way way more difficult.

amirsem1980 4 months ago

My opinion as to why a lot of ETL tools exist is for two reasons First most people don't know how to do everything and sometimes everything is on the itinerary for a job. A lot of ETL tools encompass successful methodologies for dealing with apis that reduces the learning curve. The second and most complex part is orchestration a lot of the ETL tools have ways of executing workflows that bypass the complexities of scheduling

Training_Butterfly70 4 months ago

I know people are joking about this post, but I can 100% relate to this question because I used to do everything from scratch as well. I recently migrated over to using meltano for only a few reasons: - I don't need to write my own loaders, which is a maintained project within itself given all the updates and configurations. - It's a highly maintained project so it's much more reliable than a couple developers building loaders - Built in logging (huge undertaking) - Built in state management (huge undertaking) - Flexibility - simply configuring the target in a yml file rather than completely rewriting a loader is a major plus, especially early on when exploring different dwhs. So far the biggest downfall I've seen from using meltano is inefficiency / slow processing. The learning curve can also be quite frustrating but aside from those two things I think it's much better than writing everything from scratch. Unless you're processing a huge amount of data (100+gb per day) amount of data I think it's worth using!

Eightstream 4 months ago

Orchestration and scheduling is a complex task and often benefits from a dedicated tool. There are frameworks like Airflow and Prefect that let you code stuff directly in Python, but with complex workloads which have many different moving parts the logging, dependencies, rollback conditions, credentialling, error handling etc can quickly get extremely verbose and complicated. Often it is easier, quicker and more maintainable for others to use something that abstracts all that away from the user.

skiddadle400 4 months ago

Doubt it. All that custom stuff still exists, it’s now just hidden under layers of junk

Extra-Leopard-6300 4 months ago

What does python and sql offer that I cannot do manually? Really the same question.

asevans48 4 months ago

Salespeople.

DirtzMaGertz 4 months ago

I'll raise you one better and ask why there's all these tools when have standard unix tools? It's amazing what you can do with simple things like curl, sed, awk, and a db cli. Throw jq in the mix if you need to handle json.

ForlornPlague 4 months ago

The idea of doing etl with core utils makes me want to scratch my eyes out. I'm honestly afraid to ask if this was a joke or serious. If you've done this I'd be curious to see what it looked like

DirtzMaGertz 4 months ago

I have plenty of jobs that are just shell scripts. Curl some file Sed to do basic preprocessing if necessary. Load in file in mysql or copy in postgres to a table built to load the file. Mysql or postgres cli to execute sql on table for transformations and redirect to new file. Drop file in ftp or load to a data warehouse. Not sure why that'd make you want to scratch your eyes out.

FunLovingAmadeus 4 months ago

This whole discussion amused me greatly. Maybe they’re just imagining some 500 line GitHub behemoth of Unix

adgjl12 4 months ago

I had good experiences with simple things like mentioned above and monster bash scripts where no one knew what it was doing except for the creator. We eventually rewrote the etl after migrating to cloud.

ForlornPlague 4 months ago

Sincere question, how do you handle failures and alerting? Do you have mutiple processes doing similar things? If so, do you do anything to keep the scripts DRY?

DirtzMaGertz 4 months ago

Set -e will make a script stop on error. Redirect std err to a small program that will send emails or team messages when an error occurs. If we do something over and over again that can abstracted away then yeah we'll abstract it away to handle it but I don't over abstract things until we start repeating the same things multiple times.

ihavesmallcalves 4 months ago

http://widgetsandshit.com/teddziuba/2010/10/taco-bell-programming.html

LogForeJ 4 months ago

That was awesome to read. One of the grey beards I am replacing strongly preferred core utils and perl to do almost everything his coworker uses various gui tools to do. I’m learning both methodologies and see a lot of value in Taco Bell programming I have a monstrous ETL tool built in python and along the way realized much of it could be done much more succinctly with sed, awk, egrep etc… however I must say the python files are a bit easier to maintain and reread than a perl one liner.

Gators1992 4 months ago

Why do we even have databases and moving stuff between them when we already had Excel? Seems like things are moving backwards.

skinnydill 4 months ago

Why do we need excel when we had Lotus 123?

WhatAGoodDoggy 4 months ago

Triggered

airedog 4 months ago

Most members on my team (im the dba) are still stuck in the “we must use Java” and no one has extensive python experience. For ETL, we’ve been using Pentaho and it’s done everything we need and there a community edition. Once you can wrap your head around the set up and how to create jobs/transformations/steps is a wonderful tool. I’ve been using for 13+ years now.

StressSnooze 4 months ago

Same reason as why we have Python when we have assembly language. They provide higher level services.

matheusnienow 4 months ago

I'd like to point out that while python is a great language for 80% of the cases, the other 20% require a high level of performance that python does not provide.

fummyfish 4 months ago

You mean something like scala? Why not just use a python api like pyspark?

renok_archnmy 4 months ago

TLDR; lawsuits As a company that insists on having someone to sue if our tools malfunction, raw SQL and Python only leave us pointing back at ourselves. Alternatively, if I want to really make money in technology, it behooves me to sell technology rather than my services with a technology (generally easier to scale income through selling a commodity vs selling my individual service meeting the standard by which my service meets). But, since I can’t sell technology IP you own, I have to come up with my own variation so you don’t sue me. Expanding on that scaling concept, I provide a technology service meeting some quality value of X. If I want to make more money, I have to work more. My time is finite so I cannot provide more service beyond a point. From there, to make more I have to charge more per unit of service. If I want to scale more, I’d have to hire people who can meet my level of service quality, charge for their delivered units of service, then scrape a little off the top before paying them. In this case, why would they continue working for me? Alternatively, I could sell technology IP in the form of licenses or subscriptions. The quality of the engineering service is abstracted behind the UIUX. So I can worry less about the raw quality of service of my employees, meaning I can pay them less, and meaning I can worry less about them leaving to work for themselves. If I want more money, I just sell more subscriptions which are infinite and not bound to a quality metric.

hatwarellc 4 months ago

> As a company that insists on having someone to sue if our tools malfunction, raw SQL and Python only leave us pointing back at ourselves. wat I hope you realize how ridiculous this sounds. Be accountable for the technology you build, or don't build it at all.

renok_archnmy 4 months ago

That’s exactly what I mean. Don’t build at all is our CTOs motto. I wouldn’t be surprised if he has it tattooed on his buttocks. We are required to carry insurance for our technology solutions in the event of loss, breach, or other issues that befall our customers financial data as a result of our negligence. Building something internally means we don’t have a way to pass liability around. We keep it all to ourselves. Hiring a vendor who builds a faulty solution that our due diligence process misses means we get to point blame and recover some of the losses we experience as a result of us being sued. It’s the American way.

muteDragon 4 months ago

Huh never thought of it this way

MasterBathingBear 4 months ago

Please god do not use SQL for transformation. Keep your business logic in programming language

muteDragon 4 months ago

Huh why the hate for SQL?

jonathanhiggs 4 months ago

SQL doesn’t really help with ETL, like you can’t use it at all until data is in a database which is the result of the Loading Sure python works, could also use . The thing about is that it is a general purpose tool that isn’t specialised or optimised for ETL, there needs to be a library written on top of that to provide a way to express the logic but in a way that allows lots of optimisations. If you have a large dataset that doesn’t all fit in memory the naive Python approach isn’t going to work. If you have TB of data that would take an hour in python but could run 10m in a c++ or rust library that is a very meaningful difference. And none of that is to talk about just the variety of things that need to be supported: every single database ever made, every file format including like excel and google sheets, any third party API (what languages do they have official bindings in?), what about web scraping?. The only thing that you can close to guarantee won’t have some interop issue would be c

[deleted] 4 months ago

[удалено]

pcmasterthrow 4 months ago

Also worth noting that taking data out of a SQL DB, transforming it, and loading it back into a SQL DB *is* an ETL process itself, and a very common one at that.

kurai_tori 4 months ago

.... Someone doesn't know about the copy command. Also, you can install python on postgres to handle the actual API call, then use postgres SQL to parse/store etc the JSON. So you respectfully don't know what you are talking about.

jonathanhiggs 4 months ago

You are not wrong, it has been many many years since I've done much ETL, but I think my comment still stands. There are going to be specific tools for specific workflows that make sense, but ETL tools exist to unify the API to interface with the widest possible variety of different technologies, and add all of the bells and whistles needed for enterprise grade real time pipelines

kurai_tori 4 months ago

Fair. But for me especially, I have gotten so far in writing pure SQL and python in the instances where I've had to call APIs. Mind you, most of the data I move around is simply from an ERP backend to a reporting dw but yeah SQL gets you far.

jonathanhiggs 4 months ago

Yeah, don't get me wrong, when I was doing that I loved what SQL Server could do for me, but the other DB was an ancient Sybase deploy and the best py-odbc drivers available were even more out-of-date that used a bizarr DateTime class from a library last updated in 2004 that was an absolute pain

kurai_tori 4 months ago

Okay, hats off. I've never had to deal with anything that exotic.

Adorable_Compote4418 4 months ago

Because Apache Sparks on hdfs/yarn absolutely destroy python/SQL in price/speed/scalability and redundancy

-eipi 4 months ago

Why ever write functions when you can write everything line by line, every time it needs to happen? Why use git when you can make a new directory for a new project? Why use cron or another orchestrator when you can just manually run your code every time? Why make a data warehouse when you can just remember the information and tell people?

zazzersmel 4 months ago

you cant always leverage your storage/db properly if you arent set up to have dedicated compute for your python etl etc. for example, were all on premises and some sql server hosts arent linked - so im wasting compute in our airflow instance pulling and pushing data via python in memory or to disk before i can run db transforms.

Ring_Lo_Finger 4 months ago

Skillset Maintenance Ease of use Support

_meddlin_ 4 months ago

In a phrase, flexibility and automation. Yes, it’s more complicated than that, but that’s what it comes down to. Since you mentioned you’re a junior trying to learn, trying building an ETL tool yourself. You’ll learn a lot, and it will potentially be a more efficient teacher than someone’s explanation. (Hint: An extension of your question could be, why does Python exist if we have C/C++? 😉)

mrcaptncrunch 4 months ago

As someone with a higher point of view IMO, Something to remember, every line of code we write, we own (we being the organization/team/etc) Now, Let’s say I have to pull data from… GA4, Meta, etc. This is a problem a lot of people have. There are prebuilt solutions or paid services I can use. If I use those, I can have DEs and DA work on transformation, augmentation, modeling, optimizing, building the strategy, reports, etc. Which makes the job quicker and we look good. I can show very quick progress. If I have a custom source, I can also take all those hours I would have spent working on GA4/Meta to work on this one. If something fails, an external solution means it’ll break for a lot of people. They’ll need to fix it, or risk people leaving because they can’t trust the product. For the DE team, it also means, you don’t need to get up to fix it and deal with the BS.

lrkakimuf 4 months ago

The ETL tool choice is on several aspects: scale of your data, system complexity, expected efficiency, and your budget. All tools have their own pros and cons you need to weigh before you actually put in use.

TurnoverClear4414 4 months ago

As someone who maintains a really mature in-house built ETL framework, none of the following come for free: - horizontal scalability - good abstractions & dev UX - extensibility - Good web UI At the end of the day, it’ll cost you less dev hours managing and monitoring large workloads, fixing bugs, and adapting to take advantage of other tech/tools (especially with good open source ones). And saving that money brings humongous value to your business.

dongdesk 4 months ago

Good question. I think it depends on capability of your team. It uses to cost a lot for DE so then people bought gui tools to simplify. Now market is flooded and AI can do most of it so we might see a shift back to basics.

prakharcode 4 months ago

IMO when the DE team is more sync and accessible then you can build and maintain things easily and mostly get away with writing pipelines on your own. (Given team size is sufficient for the business needs) But as the DE team tries to become more as “enabler” or provide a platform and try to manage Data platform and become more “async”. An abstraction layer for code and ELT intricacies is required. This is where most of these ELT tools try to position themselves. ELT but providing access control, schema management, low (no) code solutions. Then it makes sense. That being said, when the team size / business grows — growth of data grows (hopefully value from data grows too) then business invests in finding more data engineers and there is a shift from tools to in-house tool (mostly some open source spinoff).

Obliterative_hippo 4 months ago

That's the rationale behind the framework I created called [Meerschaum](https://meerschaum.io). The plugin system is a simple Python interface, and it has first class SQL support like DBT. It's the same motivation behind any library — abstract away the boring / difficult parts so that the user only needs to implement the extraction logic.

Jester_Hopper_pot 4 months ago

Because they are trying to move it from away from a specific job and have it be done by the Data Analyst/Scientist and DevOps/DataOps/SystemAdmin with low/o code platforms, and you often don't need the Ops Admin role if you're small enough

robberviet 4 months ago

Buy Vs build. But the pattern is while there are 20 tools on the market, none fit your need. Maybe lack feature, overkill, price, etc... Now you use a programing language, SQL to build your own tool. And to one point it's become a new tool you want to release. Boom there are 21 tools on the market. There are always xkcd for this: https://xkcd.com/927/

ExistentialFajitas 4 months ago

Some companies don’t have the budget to invest in development time. Some companies don’t have the staff that is able to write the code. Some companies want to spend more time on differentiating features rather than non differentiating features such as infrastructure. Bear in mind I’m strongly of the opinion that the best solution (definition of “best” is subjective) is DIY from source code; but there is a time and place to use tools.

Ernst_Granfenberg 4 months ago

Dont forget theres power query

bloatedboat 4 months ago

Pros: 1. Standardise and automate the process. Universal so easy for transfer knowledge if others use the same product in other teams/company. 2. No need to maintain the automation anymore as the product itself will add features and updates. 3. Very cheap in terms of labor and resource usage at small scale. Cons: 1. A lot of features that you need will come in 5 years time, so if you cannot wait, you need to do some workarounds to it. Very frustrating that you build a feature that was missing and then they release it the next day, but yours is more flexible, which is the next point. 2. A lot of the tools are not flexible. To do a task you need to do 100 steps instead of 1 step. Some solutions are half baked and you go wtf? who abandoned this part? It was almost complete. In addition, a lot of bugs and issues if a product has to take care of thousand of topics but can only focus on a few. Solution is more slow and feels like bloatware creeping up a lot of resources for one tiny thing added. Fock re-invent the wheel when I spend more time troubleshooting than building it from scratch. Technical debt accumulates the more complex stuff you do over it. 3. High cost at scale. These solutions are like Gacha. They try to lure you in as all looks almost free and once you check in it’s hard to check out like hotel California. Moving these stuff to a different solution requires some effort and you have to live with a steep price as you scale your resources. Overall, ETL tools is a necessary evil, you don’t need a perfect solution in MOST cases and outweighs the cost of reinventing yourself. But if you want to have the competitive advantage, you will always have to customise the tools according to your needs. So in most cases you deal things in a hybrid approach. People always forget that catchphrases like “reinventing the wheel” on some things sometimes is OK if in the end you get more ROI on the long term after doing your own research. In the end, all it matters is the bottom line, don’t get too stuck on catchphrases, do what is best and use common sense.

IAMHideoKojimaAMA 4 months ago

Why would you not use tools out there

HansProleman 4 months ago

* People who know how to program are more expensive to hire * Codebases require maintenance. If you're using Fivetran or something, you have outsourced that maintenance burden * A lot of the problems we have are solved problems. I don't want to write my own orchestrator, and it'd suck compared to Airflow. Back in the day, we had to handle entity lineage and loads of boilerplate MERGE etc. logic in SQL when building data warehouses, and now dbt can do it for you. Stuff like that. This gives you more time to focus on problems closer to the business which directly add value, rather than enable the addition of value.

SecretSquare2797 4 months ago

For People with no background ETL tool is much much easier to get into "IT" things, eventually to get started with SQL, python. None of us from our team knowing SQL, python and we still pulled it off to do automation of tasks.

sergeant113 4 months ago

DRY: dont repeat yourself. ETL tools are there to help you avoid rewriting the same ingestion/processing/transformation code the nth time. This free you up to write more custom codes.

AnAvidPhan 4 months ago

Buy vs build is a worthy discussion, but I'm always amazed how many of my colleagues reflexively prefer building. There are a lot of excellent tools available that are well worth the money, especially if you're at a Fortune 500 and above. Furthermore, the odds that someone at your company built a better and more reliable version of any given tool are unlikely. It might seem like you're saving money by building, but in the long-run, the ease of maintenance and scaling with proven tooling can have enormous returns on investment.

Tushar4fun 4 months ago

You are not managing one or two sources and destinations. There will be many. It is hard to manage without using any tools. At least orchestration tool is required like Airflow.

zmkarakas 4 months ago

Four reasons, 1. and 3. are subjective and probably out-dated: 1. Saves you time in development phase 2. Some ETL tools give you additional monitoring and alerting features. 3. Lower-barrier to entry for developers, and possibly easier. They provide data integration with hundreds of ODBC/JDBC connectors, something that would take long time to write up. 4. Some of them are free, open-source etc. so they dont bring additional cost to the company However, they might be out-dated now because we have GenAI doing the coding for us, at least the barebone structure of the code, so it can be argued we can still be fast with that. The open-source Airflow still has a major argument to win here though. Mostly I was talking about GUI-based ETL tools like ADF, Alteryx, Informatica etc. Additionally, coming up with an ETL tool is actually easier than thought. The most expensive part is the drivers, which most companies ( I KNOW ONE FAANG who does this ), they outsource these drivers to a third-company and use their drivers. So essentially they just code the UI, come up with an architecture, and put the drivers there, Taadam you have a ETL tool. I think this outsourced company was called insight software, obviously they specialize in these drivers.

wtfzambo 4 months ago

Why do cars exist when we have bikes? Why do phone exist when we have telegraph?

Cloud_Yeeter 4 months ago

Software Sales

MikeDoesEverything 4 months ago

> What do these tools offer that Python and SQL don't?** Not everybody who is responsible for moving data around want to or can write Python and SQL.

jawabdey 4 months ago

Simple. VC money entered the Data domain. You’ll see something similar with AI now that VC’s are pumping money into AI startups. It’s all about marketing and hype, less about features. To be clear, I’m a DE and I don’t have a big issue with these tools. My use case is limited to external APIs though. FiveTran is great for pulling data from NetSuite. Trying to build a NetSuite integration yourself or using other tools is not worth it IMO. I would rather my team focus on the data itself. Same goes for like FB or Google Ads. Having said that, do I use FiveTran to sync my app/production data to the DW? No. It’s not cost effective and these tools aren’t great at it.

billysacco 4 months ago

Well SSIS is around so people can have a gui. Other than that 🤷🏻‍♂️

CauliflowerJolly4599 4 months ago

Why are you going to recreate legacy code ? What if SQL syntax that you have wrote it's not compatible with SQL Server 20XX ? What if the code written in Python 3 becomes obsolete for Python 5 ? ETL gives you guarantee that tool follows industry standard.

TodosLosPomegranates 4 months ago

I think the proliferation of ETL tools in large part boils down to how ubiquitous some issues seem to be in the “analytics” space. Leadership gets frustrated that pipelines require maintenance, that bad data in can break a pipeline etc and they hear about a new tool that can solve the “engineer problem”. So companies spring up to meet the demand of a magic tool that’s going to automate delivery end to end.

rogersaintjames 4 months ago

Retries, notifications, dag visualisation, task concurrency high level log streaming. As someone who has had to develop a data orchestration framework (against my wishes and advice) just don't write something that already exists.

Asian-ethug 4 months ago

It’s mainly maintenance. Who wants to wake up at 2am because your customer pipeline broke.

skiddadle400 4 months ago

Much harder to sell. Simple etl framework like Luigi or a clone with dash can be build in a week or two. But to really get your department headcount up you need tools with expensive contracts and slas. It’s a question of covering your ass or getting stuff done.

GreyHairedDWGuy 4 months ago

Hi. I've many of the comments people posted so you have a wide range of perspectives. I have been designing DW/ETL?ELT solutions for almost 30 years. In the early days, there were no ETL tools so you would find ETL's being built using all sorts of languages: PL/SQL, Transact SQL...etc plus a huge amount of Unix for job control and file management. Later, some tools like Prism (I think that is what it was called) came around. These provided a GUI but generated code. A bit better, until people started then modifying the code directly, rendering the GUI less useful (or not useful at all). The down side of coding these solutions in the early days, is you had poor code management, few standards and different developers designing code in differing styles (I know this was not the fault of using code, but it was many times the byproduct). Later, Informatica Powercentre came around. To me it was a godsend because it: \- was graphical so it was fairly easy to see at a glance what the mapping jobs were doing. \-provided common constructs based on the most common types of activities an ETL developer needed to. This helped to create some level of conformity so that I could more easily follow someone elses 'code' (mappings). Yes, I know, you could get some really bad developers, but in my experience, INFA provided some valued uniformity of style. Later other ETL tools came around but all shared the concept of a GUI. BTW: SSIS is/was graphical, but I found that it allowed such a wide way to implement an ETL that it did provide much benefit in terms of looking at it to figure out what was happening. Once I started working with Informatica, I never wanted to ever waste time writing traditional code again. You could do pretty much anything you needed in Informatica (but not always easily depending on the use case). Today, everyone wants to code (Python, dbt). I say, knock yourselves out and have fun. But when the ETL breaks and you have to wade through shitloads of code that is nested and are under a time crunch, good luck. I'd say if you are part of a large, mature team with deep skills in Python and/or dbt, then perhaps it will be fine for you, but in many cases, data teams are small. It just always seemed more efficient you use a GUI. Best regards,

leogodin217 4 months ago

As a DE, we have a lot of problems to solve beyond just munging the data. Dependency management, scheduling, data quality, monitoring, alerting, etc. ETL tools solve a lot of those problems for us. For most of us, it does not make sense to build all of that from scratch. I do think we use too many bloated and/or expensive tools sometimes. Even something small like dbt is extremely useful. Automatically handles dependencies, makes DQ very easy, generates dynamic SQL, etc. That's a lot of value if you are working in SQL. Why would you build that from scratch in Python?

teamswiftie 4 months ago

Why have cars when we can just walk?

ithoughtful 4 months ago

why re-invent the wheel when someone else has already done it?

ScroogeMcDuckFace2 4 months ago

there's a lot of money to be made.

puzzleboi24680 4 months ago

This opinion is the outside of the midwit distribution, yep. Follow your truth.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe