top of page

Coffee and Tips Newsletter

Inscreva-se na nossa newsletter semanal

Nos vemos em breve!

Creating Simple ETLs with Python

  • Writer: JP
    JP
  • Apr 2
  • 4 min read

ETL with Python
Creating ETL with Python

ETL Made Simple


ETL stands for Extract, Transform, Load. It is a fundamental process in data engineering that allows the collection of information from different sources, its standardization, and storage in a suitable destination such as a database or data warehouse.


The purpose of this post is to teach you how to create ETLs in a simple and practical way, especially for those looking to understand the fundamentals. To do that, we’ll build a simple but very clear example of how to create an ETL in Python. We'll get into the details later, but for now, let’s dive into the theory.


The Three ETL Steps


  1. Extract


    Collects data from various sources: In the Big Data world, there are endless possibilities for extracting data to gain insights. But what does that really mean? In the end, what a Data Engineer does is make it possible for decision-making to happen through the collection of data. Of course, they don’t do this alone — there’s an entire data pipeline, both architectural and business-related, where Data Scientists and Data Analysts work together using a data platform to turn collected data into something valuable.


    But where can you extract data from to build an ETL?

    The internet using a crawler, scattered files in the company’s repositories, databases, APIs, or even the data lake itself, which serves as a central data repository.


  2. Transform

    The transformation step involves taking the extracted data and enriching or cleaning it — for example, avoiding duplication or removing unnecessary information.


  3. Load

    After extracting and transforming the data, it needs to be loaded into a destination so it can be consumed. The consumption of this data and the decisions that will be made based on it add value to all the previous work.

    This loading can be done into a new file — for instance, a CSV file — or into a database, or most commonly in Big Data scenarios, into a data warehouse.



Why Is ETL Important?


  • Improves data quality: One of the ETL process’s roles is to ensure data quality after extraction. You can apply any necessary cleaning and formatting during the transformation phase.

  • Facilitates analysis: Data is separated from the production environment, making it easier to consume and improving performance.

  • Automates processes: The entire extract, transform, and load process can be automated to run at specific times during the day. This allows easy access to data without manual intervention.

  • Reduces workload on transactional databases: In many companies, strategic areas consume data directly from production databases, such as the company’s main system database, to generate reports. Depending on the volume, this can affect database performance. That’s why ETLs are created to isolate this data consumption and move it to a more appropriate environment, like a data warehouse.


Popular ETL Tools


  • Talend: Open-source solution with various connectors.

  • Apache NiFi: Ideal for real-time data flows.

  • Pentaho Data Integration (PDI): Commonly used for complex ETLs.

  • AWS Glue: Amazon’s managed ETL service for the cloud.

  • Google Dataflow: Focused on scalable data processing.



ETL Examples with Python


The goal of this example is to walk you through a step-by-step guide on how to create an ETL using Python. Keep in mind that ETL is not tied to any specific programming language, but Python and SQL are commonly used due to the many advantages they offer (which we won't cover in detail in this post).

In this example, we’ll extract data from a public API at https://jsonplaceholder.typicode.com/, which provides a JSON response.


This API returns 100 fictional records related to posts, with JSON objects like the following:


{
"userId": 1,
"id": 4,
"title": "eum et est occaecati",
"body": "ullam et saepe reiciendis voluptatem adipisci\nsit amet   		autem assumenda provident rerum culpa\nquis hic commodi nesciunt rem tenetur doloremque ipsam iure\nquis sunt voluptatem rerum illo velit"
}

After extracting the data, we will perform a transformation step aimed at enriching the data so that it can be loaded into a CSV file. Once these steps are completed, we’ll have a simple and functional ETL! 🎉


Let’s look at the code below:



Understanding the ETL


Notice that this is a simple Python script divided into three steps, where:

We start by importing the necessary libraries such as pandas and requests. The latter is responsible for calling the external API, while pandas is used for data manipulation.


After importing the libraries on lines 6 and 7, the data extraction begins, where a request is made to the external API https://jsonplaceholder.typicode.com/posts, which returns 100 records that are then converted into JSON format.


On line 10, a DataFrame is created. If you want to learn more about DataFrames, check out this link: Working with Schemas in Spark Dataframes using PySpark. The idea is to leverage Pandas’ computational power to keep and process the data in memory.


Between lines 18 and 27, we perform the data transformation, which means we rename some columns and normalize certain values — for example, the content field, where we remove spaces and line breaks using Regex.


After transforming the data, we move on to the load step, where the data is saved to a CSV file so it can be analyzed later.


Conclusion


At first, the term ETL might sound intimidating, but it’s actually much simpler than it seems. There are many ways to build an ETL — using Python, SQL, or market tools that automate the entire process. You just need to evaluate and choose the best approach for your context.


See y'all




 
 
 

Comments


bottom of page