Buscar | Coffee and Tips

49 results found for ""

First steps with Delta Lake
What's Delta Lake? Delta Lake is an open-source project that manages storage layer in your Data lake. In practice it's an Apache Spark abstraction reusing the same mechanisms offering extra resources such as ACID transactions support. Everyone knows that keeping data integrity in a data pipeline is a critical task in face of high data read and write concurrency. Delta Lake provides audit history, data versioning and supports DML operations such as deletes, updates and merges. For this tutorial, we're going to simulate a data pipeline locally focusing on Delta Lake advantages. First, we'll load a Spark Dataframe from a JSON file, a temporary view and then a Delta Table which we'll perform some Delta operations. Last, Java as programming language and Maven as dependency manager, besides Spark and Hive to keep our data catalog. Maven org.apache.spark spark-core_2.12 3.0.1 org.apache.spark spark-sql_2.12 3.0.1 org.apache.spark spark-hive_2.12 3.0.1 io.delta delta-core_2.12 0.8.0 The code will be developed in short snippets to a better understanding. Setting Spark with Delta and Hive String val_ext="io.delta.sql.DeltaSparkSessionExtension"; String val_ctl="org.apache.spark.sql.delta.catalog.DeltaCatalog"; SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("app"); sparkConf.setMaster("local[1]"); sparkConf.set("spark.sql.extensions",var_ext); sparkConf.set("spark.sql.catalog.spark_catalog",val_ctl); SparkSession sparkSession = SparkSession.builder() .config(sparkConf) .enableHiveSupport() .getOrCreate(); Understanding the code above We define two variables val_ext and val_ctl by assigning the values to the keys (spark.sql.extensions and spark.sql.catalog.spark_catalog). These are necessary for configuring Delta together with Spark We named the Spark context of app Since we are not running Spark on a cluster, the master is configured to run local local[1] Spark supports Hive, in this case we enable it in the enableHiveSupport( ) Data Ingest Let's work with Spark Dataframe as the data source. We load a Dataframe from a JSON file. order.json file {"id":1, "date_order": "2021-01-23", "customer": "Jerry", "product": "BigMac", "unit": 1, "price": 8.00} {"id":2, "date_order": "2021-01-22", "customer": "Olivia", "product": "Cheese Burguer", "unit": 3, "price": 21.60} {"id":3, "date_order": "2021-01-21", "customer": "Monica", "product": "Quarter", "unit": 2, "price": 12.40} {"id":4, "date_order": "2021-01-23", "customer": "Monica", "product": "McDouble", "unit": 2, "price": 13.00} {"id":5, "date_order": "2021-01-23", "customer": "Suzie", "product": "Double Cheese", "unit": 2, "price": 12.00} {"id":6, "date_order": "2021-01-25", "customer": "Liv", "product": "Hamburger", "unit": 1, "price": 2.00} {"id":7, "date_order": "2021-01-25", "customer": "Paul", "product": "McChicken", "unit": 1, "price": 2.40} Creating a Dataframe Dataset df = sparkSession.read().json("datasource/"); df.createOrReplaceGlobalTempView("order_view"); Understanding the code above In the previous section, we're creating a Dataframe from the JSON file that is inside the datasource/ directory, create this directory so that the structure of your code is more comprehensive and then create the order.json file based on the content shown earlier . Finally, we create a temporary view that will help us in the next steps. Creating a Delta Table Let's create the Delta Table from an SQL script. At first the creation is simple, but notice that we used different types of a table used in a relational database. For example, we use STRING instead of VARCHAR and so on. We are partitioning the table by the date_order field. This field was chosen as a partition because we believe there will be different dates. In this way, queries can use this field as a filter, aiming at better performance. And finally, we define the table as Delta Table from the USING DELTA snippet. String statement = "CREATE OR REPLACE TABLE orders (" + "id STRING, " + "date_order STRING," + "customer STRING," + "product STRING," + "unit INTEGER," + "price DOUBLE) " + "USING DELTA " + "PARTITIONED BY (date_order) "; sparkSession.sql(statement); Understanding the code above In the previous section we're creating a Delta table called orders and then we execute the creation. DML Operations Delta supports Delete, Update and Insert operations using Merge Using Merge together with Insert and Update In this step, we are going to execute a Merge that makes it possible to control the flow of inserting and updating data through a table, Dataframe or view. Merge works from row matches, which will be more understandable in the next section. String mergeStatement = "Merge into orders " + "using global_temp.order_view as orders_view " + "ON orders.id = orders_view.id " + "WHEN MATCHED THEN " + "UPDATE SET orders.product = orders_view.product," + "orders.price = orders_view.price " + "WHEN NOT MATCHED THEN INSERT * "; sparkSession.sql(mergeStatement); Understanding the code above In the snippet above we're executing the Merge operation from the view order_view created in the previous steps. In the same section we have a condition orders.id = orders_view.id that will help in the following matches. If the previous condition is true, that is, MATCHED is true. The data will be updated. Otherwise, NOT MATCHED. Data will be inserted. In the case above, the data will be inserted, because until then there was no data in the orders table. Run the command below to view the inserted data. sparkSession.sql("select * from orders").show(); Update the datasource/order.json file by changing the product, price field and run all snippets again. You will see that all records will be updated. Update operation It is possible to run Update without the need to use Merge, just run the command below: String updateStatement = "update orders " + "set product = 'Milk-Shake' " + "where id = 2"; sparkSession.sql(updateStatement); Delete operation String deleteStatement = "delete from pedidos where id = 2"; sparkSession.sql(deleteStatement); In addition to being able to execute the Delete command, it is possible to use this command with Merge. Understanding Delta Lake Transaction Log (DeltaLog) In addition to supporting ACID transactions, delta generates some JSON files that serve as a way to audit and maintain the history of each transaction, from DDL and DML commands This mechanism it is even possible to go back to a specific state of the table if necessary. For each executed transaction a JSON file is created inside the _delta_log folder. The initial file will always be 000000000.json, containing the transaction commits. In our scenario, this first file contains the commits for creating the orders table. For a better view, go to the local folder that was probably created in the root directory of your project called spark-warehouse. This folder was created by Hive to hold resources created from JSON files and parquets. Inside it will have a folder structure as shown below: Note that the files are created in ascending order from each executed transaction. Access each JSON file and you will see each transaction that was executed through the operation field, in addition to other information. 00000000000000000000.json "operation":"CREATE OR REPLACE TABLE" 00000000000000000001.json "operation":"MERGE" 00000000000000000002.json "operation":"UPDATE" 00000000000000000003.json "operation":"DELETE" Also note that the parquet files were generated partitioned into folders by the date_order field. Hope you enjoyed!
Using Comparator.comparing to sort Java Stream
Introduction Sorting data is a common task in many software development projects. When working with collections of objects in Java, a powerful and flexible approach to sorting is to use the Comparator.comparing interface in conjunction with Streams. In this post, we are going to show that using Comparator.comparing to sort Java Stream can make sorting elegant and efficient. What is the Comparator.comparing interface? The Comparator.comparing interface is a feature introduced in Java 8 as part of the java.util.Comparator. package. It provides a static method called comparing that allows you to specify a key function (sort key) to compare objects. This function is used to extract a value from an object and compare it against that value during sorting. Flexibility in sorting with Comparator.comparing One of the main advantages of the Comparator.comparing interface is its flexibility. With it, we can perform sorting in different fields of an object, allowing the creation of complex sorting logic in a simple and concise way. Notice in the code below that simply in the sorted() method, we pass the Comparator.comparing interface as an argument, which in turn, passes the city field as an argument using method reference (People::getCity) performing the sort by this field. Output Monica John Mary Anthony Seth Multi-criteria ordering Often, it is necessary to perform sorting based on multiple criteria. This is easily achieved with the Comparator.comparing. interface by simply chaining together several comparing methods, each specifying a different criterion. Java will carry out the ordering according to the specified sequence. For example, we can sort the same list by city and then by name: Comparator.comparing(People::getCity).thenComparing(People:: getName). Ascending and descending sort Another important advantage of the Comparator.comparing interface is the ability to perform sorting in both ascending and descending order. To do this, just chain the method reversed() as in the code below: Output Seth Mary John Anthony Monica Efficiency and simplicity By using the Comparator.comparing interface in conjunction with Streams, sorting becomes more efficient and elegant. The combination of these features allows you to write clean code that is easy to read and maintain. Furthermore, Java internally optimizes sorting using efficient algorithms, resulting in satisfactory performance even for large datasets. Final conclusion The Comparator.comparing interface is a powerful tool to perform the sorting of Streams in Java. Its flexibility, ascending and descending sorting capabilities, support for multiple criteria, and efficient execution make it a valuable choice for any Java developer. By taking advantage of this interface, we can obtain a more concise, less verbose and efficient code, facilitating the manipulation of objects in a Stream. Hope you enjoyed!
Applying Change Data Feed for auditing on Delta tables
What is the Change Data Feed? Change Data Feed is a Delta Lake feature as of version 2.0.0 that allows tracking at row levels in Delta tables, changes such as DML operations (Merge, Delete or Update), data versions and the timestamp of when the change happened. The process maps Merge, Delete and Update operations, maintaining the history of changes at line level, that is, each event suffered in a record, Delta through the Change Data Feed manages to register as a kind of audit . Of course it is possible to use it for different use cases, the possibilities are extensive. How it works in practice Applying Change Data Feed for Delta tables is an interesting way to handle with row level records and for this post we will show how it works. We will perform some operations to explore more about the power of the Change Data Feed. We will work with the following Dataset: Creating the Spark Session and configuring some Delta parameters From now on, we'll create the code in chunks for easy understanding. In the code below we are creating the method responsible for maintaining the Spark session and configuring some parameters for Delta to work. Loading the Dataset Let's load the Dataset and create a temporary view to be used in our pipeline later. Creating the Delta Table Now we will create the Delta table already configuring Change Data Feed in the table properties and all the metadata will be based on the previously presented Dataset. Note that we're using the following parameter in the property delta.enableChangeDataFeed = true for activating the Change Data Feed. Performing a Data Merge Now we'll perform a simple Merge operation so that the Change Data Feed can register it as a change in our table. See what Merge uses in our previously created global_temp.raw_product view to upsert the data. Auditing the table Now that the Merge has been executed, let's perform a read on our table to understand what happened and how the Change Data Feed works. Notice that we're passing the following parameters: 1. readChangeFeed where required for using the Change Data Feed. 2. startingVersion is the parameter responsible for restricting which version we want it to be displayed from. Result after execution: See that in addition to the columns defined when creating the table, we have 3 new columns managed by the Change Data Feed. 1. _change_type: Column containing values according to each operation performed as insert, update_preimage , update_postimage, delete 2. _commit_version: Change version 3. _commit_timestamp: Timestamp representing the change date In the above result, the result of the upsert was a simple insert, as it didn't contain all the possible conditions of an update. Deleting a record In this step we will do a simple delete in a table record, just to validate how the Change Data Feed will behave. Auditing the table (again) Note below that after deleting record with id 6, we now have a new record created as delete in the table and its version incremented to 2. Another point is that the original record was maintained, but with the old version. Updating a record Now as a last test, we will update a record to understand again the behavior of the Change Data Feed. Auditing the table (last time) Now as a last test, we run a simple update on a record to understand how it will behave. Notice that 2 new values have been added/updated in the _change_type column. The update_postimage value is the value after the update was performed and this time, for the old record, the same version of the new one was kept in the column _commit_version, because this same record was updated according to column _change_type to update_preimage, that is, value before change. Conclusion The Change Data Feed is a great resource to understand the behavior of your data pipeline and also a way to audit records in order to better understand the operations performed there. According to the Delta team itself, it is a feature that, if maintained, does not generate any significant overhead. It's a feature that can be fully adopted in your data strategy as it has several benefits as shown in this post. Repository GitHub Hope you enjoyed!
Understanding Java Record Class in 2 minutes
Introduction Released in Java 14 as a preview, more specifically in JEP 395, Record Class is an alternative to working with Classes in Java. Record Class was a very interesting approach designed to eliminate the verbosity when you need to create a class and its components, such as: Canonical constructors Public access methods Implement the equals and hashCode methods Implement the toString method Using Record Classes it is no longer necessary to declare the items above, helping the developer to be more focused on other tasks. Let's understand better in practice. Let's create a Java class called User and add some fields and methods. Note that for a simple class with 4 fields, we create a constructor, public access methods, implement the equals and hashCode methods and finally, the toString method. It works well, but we could avoid the complexity and create less verbose code. In that case, we can use Record Classes instead User class above. User Record Class The difference between Record and a traditional Java Class is remarkable. Note that it wasn't necessary to declare the fields, create the access methods or implement any other method. In a Record Class when created, implicitly the public access methods are created, the implementations of the equals, hashCode and toString methods are also created automatically and it is not necessary to implement them explicitly. And finally, the reference fields or components are created as private final with the same names. Output Disadvantages Record Class behaves like a common Java class, but the difference is that you can't work with inheritance. You can't extends another class, only implement one or more interfaces. Another point is that it's not possible to create non-static instance variables. Final conclusion Record Classes is a great approach for anyone looking for less verbose code or who needs agility in implementing models. Despite the limitation of not being able to extends other Record Classes, it's a limitation that doesn't affect its use in general. Hope you enjoyed!
Getting started with Java Reflection in 2 minutes
Introduction Java Reflection is a powerful API that allows a Java program to examine and manipulate information about its own classes at runtime. With Reflection, you can get information about a class's fields, methods, and constructors, and access and modify those elements even if they're private. In this post we're going to write some Java codes exploring some of the facilities of using Reflection and when to apply it in your projects. Bank Class We'll create a simple class called Bank, where we'll create some fields, methods and constructors to be explored using Reflection. Accessing the fields of the Bank class With the Bank class created, let's explore via Reflection the listing of all fields of the class through the getDeclaredFields method of the Class class. Note that through the static method Class.forName, we pass a string with the name of the class we want to explore via Reflection as a parameter. Output Field name: code Field type: class java.lang.Integer ************ Field name: nameOfBank Field type: class java.lang.String ************ Field name: amountOfDepositedMoney Field type: class java.lang.Double ************ Field name: totalOfCustomers Field type: class java.lang.Integer ************ Accessing the methods of the Bank class Through the getDeclaredMethods method, we can retrieve all methods of the Bank class. Output Method name: doDeposit Method type: class java.lang.String ************ Method name: doWithDraw Method type: class java.lang.String ************ Method name: getReceipt Method type: class java.lang.String ************ Creating objects With the use of Reflection to create objects, it is necessary to create them through a constructor. In this case, we must first invoke a constructor to create the object. The detail is that to retrieve this constructor, we must pay attention to the types of parameters that make up the constructor and the order in which they are declared. This makes it flexible to retrieve different constructors with different parameter numbers and type in a class. Notice below that it was necessary to create an array of type Class assigning different types according to the composition of the constructor that we will use to create our object. In this scenario, it will be necessary to invoke the method class.getConstructor(argType) passing the previously created array as an argument. This way, we will have a constructor object that will be used in the creation of our object. Finally, we create a new array of type Object assigning the values that will compose our object following the order defined in the constructor and then just invoke the method constructor.newInstance(argumentsValue) passing the array as a parameter returning the object we want to create. Output Bank{code=1, nameOfBank='Bank of America', amountOfDepositedMoney=1.5, totalOfCustomers=2500} Invoking methods To invoke a method through Reflection is quite simple as shown in the code below. Note that it is necessary to pass as a parameter in the method cls.getMethod("doDeposit", argumentsType) the explicit name of the method, in this case "doDeposit" and in the second parameter, an array representing the type of data used in the parameter of the method doDeposit( double amount), in this case a parameter of type double. Finally, invoke the method method.invoke passing at the first parameter the object referencing the class, in this case an object of type Bank. And as the second parameter, the value that will be executed in the method. Output 145.85 of money has been deposited Conclusion Using Reflection is a good strategy when you need flexibility in exploring different classes and their methods without the need to instantiate objects. Normally, Reflection is used in specific components of an architecture, but it does not prevent it from being used in different scenarios. From the examples shown above, you can see infinite scenarios of its application and the advantages of its use. Hope you enjoyed!
Tutorial : Apache Airflow for beginners
Intro Airflow has been one of the main orchestration tools on the market and much talked about in the Modern Data Stack world, as it is a tool capable of orchestrating data workloads through ETLs or ELTs. But in fact, Airflow is not just about that, it can be applied in several cases of day-to-day use of a Data or Software Engineer. In this Apache Airflow for Beginners Tutorial, we will introduce Airflow in the simplest way, without the need to know or create ETLs. But what is Airflow actually? Apache Airflow is a widely used workflow orchestration platform for scheduling, monitoring, and managing data pipelines. It has several components that work together to provide its functionalities. Airflow components DAG The DAG (Directed Acyclic Graph) is the main component and workflow representation in Airflow. It is composed of tasks (tasks) and dependencies between them. Tasks are defined as operators (operators), such as PythonOperator, BashOperator, SQLOperator and others. The DAG defines the task execution order and dependency relationships. Webserver The Webserver component provides a web interface for interacting with Airflow. It allows you to view, manage and monitor your workflows, tasks, DAGs and logs. The Webserver also allows user authentication and role-based access control. Scheduler The Scheduler is responsible for scheduling the execution of tasks according to the DAG definition. It periodically checks for pending tasks to run and allocates available resources to perform the tasks at the appropriate time. The Scheduler also handles crash recovery and scheduling task retries. Executor The Executor is responsible for executing the tasks defined in the DAGs. There are different types of executors available in Airflow such as LocalExecutor, CeleryExecutor, KubernetesExecutor and etc. Each executor has its own settings and execution behaviors. Metadatabase Metadatabase is a database where Airflow stores metadata about tasks, DAGs, executions, schedules, among others. It is used to track the status of tasks, record execution history, and provide information for workflow monitoring and visualization. It is possible to use several other databases to record the history such as MySQL, Postgres and among others. Workers Workers are the execution nodes in a distributed environment. They receive tasks assigned by the Scheduler and execute them. Workers can be scaled horizontally to handle larger data pipelines or to spread the workload across multiple resources. Plugins Plugins are Airflow extensions that allow you to add new features and functionality to the system. They can include new operators, hooks, sensors, connections to external systems, and more. Plugins provide a way to customize and extend Airflow's capabilities to meet the specific needs of a workflow. Operators Operators are basically the composition of a DAG. Understand an operator as a block of code with its own responsibility. Because Airflow is an orchestrator and executes a workflow, we can have different tasks to be performed, such as accessing an API, sending an email, accessing a table in a database and performing an operation, executing a Python code or even a Bash command. For each of the above tasks, we must use an operator. Next, we will discuss some of the main operators: BashOperator BashOperator allows you to run Bash commands or scripts directly on the operating system where Airflow is running. It is useful for tasks that involve running shell scripts, utilities, or any action that can be performed in the terminal. In short, when we need to open our system's terminal and execute some command to manipulate files or something related to the system itself, but within a DAG, this is the operator to be used. PythonOperator The PythonOperator allows you to run Python functions as tasks in Airflow. You can write your own custom Python functions and use the PythonOperator to call those functions as part of your workflow. DummyOperator The DummyOperator is a "dummy" task that takes no action. It is useful for creating complex dependencies and workflows without having to perform any real action. Sensor Sensors are used to wait for some external event to occur before continuing the workflow, it can work as a listener. For example, the HttpSensor, which is a type of Sensor, can validate if an external API is active, if so, the flow continues to run. It's not an HTTP operator that should return something, but a type of listener. HttpOperator Unlike a Sensor, the HttpOperator is used to perform HTTP requests such as GET, POST, PUT, DELETE end etc. In this case, it allows you to interact more fully with internal or external APIs. SqlOperator SqlOperator is the operator responsible for performing DML and DDL operations in a database, that is, from data manipulations such as SELECTS, INSERTS, UPDATES and so on. Executors Executors are responsible for executing the tasks defined in a workflow (DAG). They manage the allocation and execution of tasks at runtime, ensuring that each task runs efficiently and reliably. Airflow offers different types of executors, each with different characteristics and functionalities, allowing you to choose the most suitable one for your specific needs. Below, we’ll cover some of the top performers: LocalExecutor LocalExecutor is the default executor in Apache Airflow. It is designed to be used in development and test environments where scalability isn't a concern. LocalExecutor runs tasks on separate threads within the same Airflow process. This approach is simple and efficient for smaller pipelines or single-node runs. CeleryExecutor If you need an executor for distributed and high-scale environments, CeleryExecutor is an excellent choice. It uses Celery, a queued task library, to distribute tasks across separate execution nodes. This approach makes Airflow well-suited for running pipelines on clusters of servers, allowing you to scale horizontally on demand. KubernetesExecutor For environments that use Kubernetes as their container orchestration platform, KubernetesExecutor is a natural choice. It leverages Kubernetes' orchestration capability to run tasks in separate pods, which can result in better resource isolation and easier task execution in containers. DaskExecutor If your workflow requires parallel and distributed processing, DaskExecutor might be the right choice. It uses the Dask library to perform parallel computing on a cluster of resources. This approach is ideal for tasks that can be divided into independent sub-tasks, allowing better use of available resources. Programming language Airflow supports Python as programming language. To be honest, it's not a limiter for those who don't know the language well. In practice, the process of creating DAGs is standard, which can change according to your needs, it will deal with different types of operators, whether or not you can use Python. Hands-on Setting up the environment For this tutorial we will use Docker that will help us provision our environment without the need to install Airflow. If you don't have Docker installed, I recommend following the recommendations in this link and after installing it, come back to follow the tutorial. Downloading project To make it easier, clone the project from the following repository and follow the steps to deploy Airflow. Steps to deploy With docker installed and after downloading the project according to the previous item, access the directory where the project is located and open the terminal, run the following docker command: docker-compose up The above command will start the docker containers where the services of Airflow itself, postgres and more. If you're curious about how these services are mapped, open the project's docker-compose.yaml file and there you'll find more details. Anyway, after executing the above command and the containers already started, access the following address via browser http://localhost:8080/ A screen like below will open, just type airflow for the username and password and access the Airflow UI. Creating a DAG Creating a simple Hello World For this tutorial, we will create a simple DAG where the classic "Hello World" will be printed. In the project you downloaded, go to the /dags folder and create the following python file called hello_world.py. The code above is a simple example of a DAG written in Python. We noticed that we started import some functions, including the DAG itself, functions related to the datetime and the Python operator. Next, we create a Python function that will print to the console "Hello World" called by print_hello function. This function will be called by the DAG later on. The declaration of a DAG starts using the following syntax with DAG(..) passing some arguments like: dag_id: DAG identifier in Airflow context start_date: The defined date is only a point of reference and not necessarily the date of the beginning of the execution nor of the creation of the DAG. Usually the executions are carried out at a later date than the one defined in this parameter, and it is important when we need to calculate executions between the beginning and the one defined in the schedule_interval parameter. schedule_interval: In this parameter we define the periodicity in which the DAG will be executed. It is possible to define different forms of executions through CRON expressions or through Strings already defined as @daily, @hourly, @once, @weekly and etc. In the case of the example, the flow will run only once. catchup: This parameter controls retroactive executions, that is, if set to True, Airflow will execute the retroactive period from the date defined in start_date until the current date. In the previous example we defined it as False because there is no need for retroactive execution. After filling in the arguments, we create the hello_task within the DAG itself using the PythonOperator operator, which provides ways to execute python functions within a DAG. Note that we declared an identifier through the task_id and in the python_callable argument, which is native to the PythonOperator operator, we passed the python print_hello function created earlier. Finally, invoke the hello_task. This way, the DAG will understand that this will be the task to be performed. If you have already deployed it, the DAG will appear in Airflow in a short time to be executed as shown in the image below: After the DAG is created, activate and execute it by clicking on Trigger DAG as shown in the image above. Click on the hello_operator task (center) and then a window will open as shown in the image below: Click the Log button to see more execution details: Note how is simple it to create a DAG, just think about the different possibilities and applicability scenarios. For the next tutorials, we'll do more examples that are a bit more complex by exploring several other scenarios. Conclusion Based on the simple example shown, Airflow presented a flexible and simple approach to controlling automated flows, from creating DAGs to navigating your web component. As I mentioned at the beginning, its use is not limited only to the orchestration of ETLs, but also to the possibility of its use in tasks that require any need to control flows that have dependencies between their components within a context. scalable or not. GitHub Repository Hope you enjoyed!
Creating Asynchronous Java Code with Future
Intro Java Future is one of several ways to work with the language asynchronously, providing a multi-thread context in which it is possible to execute tasks in parallel without blocking the process. In the example below, we will simulate sending a fictitious email in which, even during sending, the process will not be blocked, that is, it will not be necessary to wait for the sending to finish for the other functionalities or mechanisms to operate again. EmailService class Understanding the EmailService class The class above represents the sending emails in a fictitious way, the idea of using the loop is to simulate the sending is precisely to delay the process itself. Finally, at the end of sending, the method sendEmailBatch(int numberOfEmailsToBeSent) returns a String containing a message referring to the end of the process. EmailServiceAsync class Understanding the EmailServiceAsync class The EmailServiceAsync class represents the asynchronous mechanism itself, in it we have the method sendEmailBatchAsync(int numberOfEmailsToBeSent) which will be responsible for making the process of sending dummy e-mails asynchronous. The asynchronous process is managed by using the ExecutorService instance which facilitates the management of tasks asynchronously which are assigned to a pool of threads. In this case, the call to the sendEmailBatch(int numberOfEmailsToBeSent) method boils down to a task (task) which will be assigned to a Thread defined in Executors.newFixedThreadPool(1). Finally, the method returns a Future that is literally a promise that task will be completed at some point, representing an asynchronous process. EmailServiceAsyncRun class Understanding the EmailServiceAsyncRun class It is in this class where we will test the asynchronous process using Future. Let's recap, in the EmailService class, we've created a method called sendEmailBatch(int numberOfEmailsToBeSent) in which we're simulating through the for the sending of dummy email and printing a sending message that we'll use to test the concurrency. In the EmailServiceAsync class, the sendEmailBatchAsync(int numberOfEmailsToBeSent) method creates an ExecutorService instance that will manage the tasks together with the thread pool, which in this case, we are creating just one Thread defined in Executors.newFixedThreadPool(1) and will return a Future. Now in the EmailServiceAsyncRun class, this is where we actually test the process, let's understand by parts: We instantiate an object of type EmailServiceAsync We create an object of type Future and assign it to the return of the emailAsync.sendEmailBatchAsync(500) method. The idea of argument 500 is just to control the iteration of the For, delaying the process to be finished. We could even use Thread.sleep() as an alternative and set a delay time which would also work fine. Note that we are using the futureReturn.isDone() method to control the while iteration control, that is, this method allows the process not to be blocked while the email flow is executed. In this case, any process that you want to implement to compete while sending is done, can be created inside the while, such as a flow of updating customer tables or any other process. On line 20, using the futureReturn.get() method, we're printing the result of sending the emails. And finally, we finish the executorService and its tasks through the executorService.shutdown() method. Running the process Notice clearly that there are two distinct processes running, the process of sending email "Sending email Nº 498.." and the process of updating a customer table. Finally the process is finished when the message "A total of 500 emails has been sent" is printed. Working with blocking processes The use of Future is widely used for use cases where we need to block a proces. The current Thread will be blocked until the process being executed by Future ends. To do so, simply invoke the futureReturn.get() method directly without using any iteration control as used in the previous example. An important point is that this type of approach can cause resources to be wasted due to the blocking of the current Thread. Conclusion The use of Future is very promising when we need to add asynchronous processes to our code in the simplest way or even use it to block processes. It's a lean API with a certain resource limitation but that works well for some scenarios. Hope you enjoyed!
Accessing APIs and Extracting Data with Airflow
Intro Airflow provides different ways of working with automated flows and one of the ways is the possibility of accessing external APIs using HTTP operators and extracting the necessary data. hands-on In this tutorial we will create a DAG which will access an external API and extract the data directly to a local file. If this is your first time using Airflow, I recommend accessing this link to understand more about Airflow and how to set up an environment. Creating the DAG For this tutorial, we will create a DAG that will trigger every 1 hour (schedule_interval="0 * * * *") and access an external API by extracting some data directly to a local JSON file. In this scenario we will use the SimpleHttpOperator operator which provides an API capable of executing requests to external APIs. Note that we use two operators within the same DAG. The SimpleHttpOperator operator that provides ways of accessing external APIs that through the method field we define HTTPs methods (GET, POST, PUT, DELETE). The endpoint field allows specifying the endpoint of the API, which in this case is products and finally, the http_conn_id parameter, where it's necessary to pass the identifier of the connection that will be defined next through the Airflow UI. As shown below, access the menu Admin > Connections Fill in the data as shown in the image below and then save. About the PythonOperator operator, we are only using it to execute a Python function called _write_response using XComs where through the task_id of the write_response task, it is possible to retrieve the result of the response and use it in any part of the code. In this scenario we are using the result retrieved from the API to write to the file. XCom is a communication mechanism between different tasks that makes Airflow very flexible. Tasks can often be executed on different machines and with the use of XComs, communication and information exchange between Tasks is possible. Finally, we define the execution of the tasks and their dependencies, see that we use the >> operator, which is basically to define the order of execution between the tasks. In our case, API access and extraction must be performed before writing to the file extract_data >> write_response. After executing the DAG, it is possible to access the file that was generated with the result of the extraction, just access one of the workers via the terminal, which in this case will only have one. Run the following command below to list the containers: docker ps A listing similar to the one below will be displayed. Notice that one of the lines in the NAMES column refers to the worker, in this case coffee_and_tips_airflow-worker_1. Continuing in the terminal, type the following command to access the Airflow directory where the extract_data.json file is located. docker exec -it coffee_and_tips_airflow-worker_1 /bin/bash It's done, now just open the file and check the content. Conclusion Once again we saw the power of Airflow for automated processes that require easy access and integration of external APIs with few lines of code. In this example, we explore the use of XComs, which aims to make the exchange of messages between tasks that can be executed on different machines in a distributed environment more flexible. Hope you enjoyed!
Quick guide about Apache Kafka: Powering Event-Driven architecture
Introduction In today's data-driven world, the ability to efficiently process and analyze vast amounts of data in real-time has become a game-changer for businesses and organizations of all sizes. From e-commerce platforms and social media to financial institutions and IoT devices, the demand for handling data streams at scale is ever-increasing. This is where Apache Kafka steps in as a pivotal tool in the world of event-driven architecture. Imagine a technology that can seamlessly connect, process, and deliver data between countless systems and applications in real-time. Apache Kafka, often referred to as a distributed streaming platform, is precisely that technology. It's the unsung hero behind the scenes, enabling real-time data flow and providing a foundation for a multitude of modern data-driven applications. In this quick guide about Apache Kafka, we'll take a deep dive into Apache Kafka, unraveling its core concepts, architecture, and use cases. Whether you're new to Kafka or looking to deepen your understanding, this guide will serve as your compass on a journey through the exciting world of real-time data streaming. We'll explore the fundamental principles of Kafka, share real-world examples of its applications, and provide practical insights for setting up your own Kafka environment. So, let's embark on this adventure and discover how Apache Kafka is revolutionizing the way we handle data in the 21st century. Key Concepts of Kafka 1. Topics What Are Kafka Topics? In Kafka, a topic is a logical channel or category for data. It acts as a named conduit for records, allowing producers to write data to specific topics and consumers to read from them. Think of topics as a way to categorize and segregate data streams. For example, in an e-commerce platform, you might have topics like "OrderUpdates," "InventoryChanges," and "CustomerFeedback," each dedicated to a specific type of data. Partitioning within Topics One of the powerful features of Kafka topics is partitioning. When a topic is divided into partitions, it enhances Kafka's ability to handle large volumes of data and distribute the load across multiple brokers. Partitions are the unit of parallelism in Kafka, and they provide fault tolerance, scalability, and parallel processing capabilities. Each partition is ordered and immutable, and records within a partition are assigned a unique offset, which is a numeric identifier representing the position of a record within the partition. This offset is used by consumers to keep track of the data they have consumed, allowing them to resume from where they left off in case of failure or when processing real-time data. Data organization Topics provide a structured way to organize data. They are particularly useful when dealing with multiple data sources and data types. Topics works as a storage within Kafka context where data sent by producers is organized into topics and partitions. Publish-Subscribe Model Kafka topics implement a publish-subscribe model, where producers publish data to a topic, and consumers subscribe to topics of interest to receive the data. An analogy that we can do is when we subscribe to a newsletter to receive some news or articles. When some news is posted, you as a subscriber will receive it. Scalability Topics can be split into partitions, allowing Kafka to distribute data across multiple brokers for scalability and parallel processing. Data Retention Each topic can have its own data retention policy, defining how long data remains in the topic. This makes easier to manage the data volume wheter or not frees up space. 2. Producers In Kafka, a producer is a crucial component responsible for sending data to Kafka topics. Think of producers as information originators—applications or systems that generate and publish records to specific topics within the Kafka cluster. These records could represent anything from user events on a website to system logs or financial transactions. Producers are the source of truth for data in Kafka. They generate records and push them to designated topics for further processing. Also decide which Topic the message will be send based on the nature of the data. This ensures that data is appropriately categorized within the Kafka ecosystem. Data Type Usually producers send messages based on JSON format that makes easier the data transferring into the storage. Acknowledgment Handling Producers can handle acknowledgments from the Kafka broker, ensuring that data is successfully received and persisted. This acknowledgment mechanism contributes to data reliability. Sending data to specific partitions Producers can send messages directly to a specific partition within a Topic. 3. Consumers Consumers are important components in the Kafka context, they are responsible for consuming and providing data from the source. Basically, consumers subscribe to Kafa Topics and any data produced there will be received by consumers representing the pub/sub approach. Subscribing to Topics Consumers actively subscribe to Kafka topics, indicating their interest in specific streams of data. This subscription model enables consumers to receive relevant information aligned with their use case. Data Processing Consumers will always receive new data from topics, each consumer is responsible for processing this data according to their needs. A microservice that works as a consumer for example, it can consume data from a topic responsible for storing application logs and performing any processing before delivering it to the user or to other third-party applications. Integration between apps As mentioned previously, Kafka enables applications to easily integrate their services across varied topics and consumers. One of the most common use cases is integration between applications. In the past, applications needed to connect to different databases to access data from other applications, this created vulnerabilities and violated principles of responsibilities between applications. Technologies like Kafka make it possible to integrate different services using the pub/sub pattern where different consumers represented by applications can access the same topics and process this data in real time without the need to access third-party databases or any other data source, avoiding any security risk and added agility to the data delivery process. 4. Brokers Brokers are fundamental pieces in Kafka's architecture, they are responsible for mediating and managing the exchange of messages between producers and consumers. Brokers manage the storage of data produced by producers and guarantee reliable transmission of data within a Kafka cluster. In practice, Brokers have a transparent role within a Kafka cluster, but below I will highlight some of their responsibilities that make all the difference to the functioning of Kafka. Data reception Brokers are responsible for receiving the data, they function as an entry-point or proxy for the data produced and then manage all storage so that it can be consumed by any consumer. Fault tolerance Like all data architecture, we need to think about fault tolerance. In the context of Kafka, Brokers are responsible for ensuring that even in the event of failures, data is durable and maintains high availability. Brokers are responsible for managing the partitions within the topics capable of replicating the data, predicting any failure and reducing the possibility of data loss. Data replication As mentioned in the previous item, data replication is a way to reduce data loss in cases of failure. Data replication is done from multiple replicas of partitions stored in different Brokers, this allows that even if one Broker fails, there is data replicated in several others. Responsible for managing partitions We mentioned a recent article about partitions within topics but we did not mention who manages them. Partitions are managed by a Broker that works by coordinating reading and writing to that partition and also distributing data loading across the cluster. In short, Brokers perform orchestration work within a Kafka cluster, managing the reading and writing done by producers and consumers, ensuring that message exchanges are carried out and that there will be no loss of data in the event of failures in some of its components through data replication also managed by them. Conclusion Apache Kafka stands as a versatile and powerful solution, addressing the complex demands of modern data-driven environments. Its scalable, fault-tolerant, and real-time capabilities make it an integral part of architectures handling large-scale, dynamic data streams. Kafka has been adopted by different companies and business sectors such as Linkedin, where Kafka was developed by the way, Netflix, Uber, Airbnb, Wallmart, Goldman Sachs, Twitter and more.
Differences between Future and CompletableFuture
Introduction In the realm of asynchronous and concurrent programming in Java, Future and CompletableFuture serve as essential tools for managing and executing asynchronous tasks. Both constructs offer ways to represent the result of an asynchronous computation, but they differ significantly in terms of functionality, flexibility, and ease of use. Understanding the distinctions between Future and CompletableFuture is crucial for Java developers aiming to design robust and efficient asynchronous systems. At its core, a Future represents the result of an asynchronous computation that may or may not be complete. It allows developers to submit tasks for asynchronous execution and obtain a handle to retrieve the result at a later point. While Future provides a basic mechanism for asynchronous programming, its capabilities are somewhat limited in terms of composability, exception handling, and asynchronous workflow management. On the other hand, CompletableFuture introduces a more advanced and versatile approach to asynchronous programming in Java. It extends the capabilities of Future by offering a fluent API for composing, combining, and handling asynchronous tasks with greater flexibility and control. CompletableFuture empowers developers to construct complex asynchronous workflows, handle exceptions gracefully, and coordinate the execution of multiple tasks seamlessly. In this article, we will dive deeper into the differences between Future and CompletableFuture, exploring their respective features, use cases, and best practices. By understanding the distinct advantages and trade-offs of each construct, developers can make informed decisions when designing asynchronous systems and leveraging concurrency in Java applications. Let's embark on a journey to explore the nuances of Future and CompletableFuture in the Java ecosystem. Use Cases for Future Parallel Processing: Use Future to parallelize independent tasks across multiple threads and gather results asynchronously. For example, processing multiple files concurrently. Asynchronous IO: When performing IO operations that are blocking, such as reading from a file or making network requests, you can use Future to perform these operations in separate threads and continue with other tasks while waiting for IO completion. Task Execution and Coordination: Use Future to execute tasks asynchronously and coordinate their completion. For example, in a web server, handle multiple requests concurrently using futures for each request processing. Timeout Handling: You can set timeouts for Future tasks to avoid waiting indefinitely for completion. This is useful when dealing with resources with unpredictable response times. Use Cases for CompletableFuture Async/Await Pattern: CompletableFuture supports a fluent API for chaining asynchronous operations, allowing you to express complex asynchronous workflows in a clear and concise manner, similar to the async/await pattern in other programming languages. Combining Results: Use CompletableFuture to combine the results of multiple asynchronous tasks, either by waiting for all tasks to complete (allOf) or by combining the results of two tasks (thenCombine, thenCompose). Exception Handling: CompletableFuture provides robust exception handling mechanisms, allowing you to handle exceptions thrown during asynchronous computations gracefully using methods like exceptionally or handle. Dependency Graphs: You can build complex dependency graphs of asynchronous tasks using CompletableFuture, where the completion of one task triggers the execution of another, allowing for fine-grained control over the execution flow. Non-blocking Callbacks: CompletableFuture allows you to attach callbacks that are executed upon completion of the future, enabling non-blocking handling of results or errors. Completing Future Manually: Unlike Future, you can complete a CompletableFuture manually using methods like complete, completeExceptionally, or cancel. This feature can be useful in scenarios where you want to provide a result or handle exceptional cases explicitly. Examples Creation and Completion Future code example of creation and completion. ExecutorService executor = Executors.newSingleThreadExecutor(); Future future = executor.submit(() -> { Thread.sleep(2000); return 10; }); CompletableFuture code example of creation and completion. CompletableFuture completableFuture = CompletableFuture.supplyAsync(() -> { try { Thread.sleep(2000); } catch (InterruptedException e) { e.printStackTrace(); } return 10; }); In CompletableFuture, supplyAsync method allows for asynchronous execution without the need for an external executor service an shown in the first example. Chaining Actions Example below in how to chain actions using Future. Future future = executor.submit(() -> 10); Future result = future.thenApply(i -> "Result: " + i); Now, an example using CompletableFuture in how to chain actions. CompletableFuture completableFuture = CompletableFuture.supplyAsync(() -> 10); CompletableFuture result = completableFuture.thenApply(i -> "Result: " + i); CompletableFuture offers a fluent API (thenApply, thenCompose, etc.) to chain actions, making it easier to express asynchronous workflows. Exception Handling Handling exception using Future Future future = executor.submit(() -> { throw new RuntimeException("Exception occurred"); }); Handling exception using CompletableFuture CompletableFuture completableFuture = CompletableFuture.supplyAsync(() -> { throw new RuntimeException("Exception occurred"); }); CompletableFuture allows for more flexible exception handling using methods like exceptionally or handle. Waiting for Completion // Future Integer result = future.get(); // CompletableFuture Integer result = completableFuture.get(); Both Future and CompletableFuture provide the get() method to wait for the completion of the computation and retrieve the result. Combining Multiple CompletableFutures CompletableFuture future1 = CompletableFuture.supplyAsync(() -> 10); CompletableFuture future2 = CompletableFuture.supplyAsync(() -> 20); CompletableFuture combinedFuture = future1.thenCombine(future2, (x, y) -> x + y); CompletableFuture provides methods like thenCombine, thenCompose, and allOf to perform combinations or compositions of multiple asynchronous tasks. Conclusion In the dynamic landscape of asynchronous and concurrent programming in Java, both Future and CompletableFuture stand as indispensable tools, offering distinct advantages and use cases. While Future provides a basic mechanism for representing the result of asynchronous computations, its capabilities are somewhat limited when it comes to composability, exception handling, and asynchronous workflow management. On the other hand, CompletableFuture emerges as a powerful and flexible alternative, extending the functionalities of Future with a fluent API for composing, combining, and handling asynchronous tasks with greater control and elegance. The choice between Future and CompletableFuture hinges on the specific requirements and complexities of the task at hand. For simple asynchronous operations or when working within the confines of existing codebases, Future may suffice. However, in scenarios that demand more sophisticated asynchronous workflows, exception handling, or task coordination, CompletableFuture offers a compelling solution with its rich feature set and intuitive API.
How to create a serverless app with AWS SAM
For this post, I will teach you how to create a serverless app with AWS SAM. AWS SAM (Serverless Application Model) is an extension of AWS CloudFormation, specifically designed for serverless application development and deployment, famous serverless like AWS Lambda, API Gateway, DynamoDB, among other AWS features. Level of abstraction AWS SAM is an application-level tool primarily focused on building and deploying serverless applications on AWS. It provides higher level abstractions to facilitate the development and deployment of serverless applications, with a focus on the AWS services needed to support this type of architecture, i.e. the whole focus is on AWS and not another cloud. AWS SAM has a whole way to generate the project's code locally and makes it possible to generate tests, Build and Deploy through SAM CLI. How to install AWS SAM Go to this link and follow the steps according to each operating system. How to create a serverless project After installing, through a terminal, manage your project locally by generating the necessary files to then deploy the application. First, go to the folder where you want to generate your serverless resource and then open the terminal. Type the following command in the terminal to start the SAM: sam init After typing, a prompt will appear with some options for you to fill in your project information. Above we have 2 options to generate our initial template, let's type 1 to generate the option 1 - AWS Quick Start Templates. After typing, a new list will be shown with some template options. Note that each option boils down to a resource such as Lambda, Dynamo table and even a simple API using API Gateway. For this scenario, let's create a Dynamo table, in this case, type the option 13 and press enter. After typing, some questions will be asked, just type y to proceed until a new screen about the project information is offered as below. Type the name of the project you want and press enter. In our case I typed the following name for the project dynamo-table-using-aws-sam as in the image below. After typing the project name, the template and files containing the base code will be available and ready for deployment. Access the folder and see that a file called template.yaml has been created containing information about the resources that will be created. It's very similar to a CloudFormation template, but shorter. Open the file and notice that several helper resources have been mapped into the template, such as Dynamo itself, a Lambda and an API Gateway. Were also created, some base codes related to Lambda and some unit tests that allow local invocations. How to deploy Now that our template and base code has been generated, it's time to create the Dynamo table in AWS, just follow the next steps. Access the terminal again and type the following command: sam deploy --guided After executing this command, the following options will be shown in the terminal prompt for completion: For the Stack Name field, enter a value that will be the identifier of that stack which will be used by CloudFormation to create the necessary resources. When in doubt, follow what was typed as per the image above, in this case dynamo-stack. After filling in all the fields, a summary of what will be created will be presented as shown in the image below: Finally, one more last question will be asked about the desire to deploy, just type y to confirm. After confirming the operation, the progress of creating the resources will be displayed in the terminal until the end of the process. Deploy finished, notice again the resources that were created. Now just access the AWS console and check the table created in Dynamo. Deleting Resources If necessary, you can delete the resources via SAM CLI, just run the command below: sam delete dynamo-stack The dynamo-stack argument refers to the identifier we typed earlier in the Stack Name field, remember? Use the same to delete the entire created stack. After typing the command above, just confirm the next steps. It's quite simple how to create a serverless resource with AWS SAM, there are some advantages and disadvantages and it all depends on your strategy. Hope you enjoyed!
Understanding the different Amazon S3 Storage Classes
What are Amazon S3 Storage Classes? Amazon S3 (Simple Storage Service) provides a strategic way to organize objects in different layers, where each layer has its particularities that we will detail later. The storage classes are characterized by offering different levels of durability, availability, performance and costs. For this, you must understand well which strategy to use to keep the objects aiming at the best cost benefit. Next, we'll detail each class, describing its advantages and disadvantages. S3 Standard The S3 Standard storage class is the default and most widely used option for Amazon S3. It is designed to provide high durability, availability, and performance for frequently accessed objects. Advantages S3 Standard is the most common class used in storing and accessing objects more frequently, as it is the layer that offers low latency and this allows it to be used for different use cases where dynamic access to objects is essential. Another advantage is the durability of 99.999999999%, which means that the chances of objects being corrupted or even lost is very low. As for availability, this class provides a SLA of 99.99%, which means that the objects have high availability for access. Disadvantages S3 Standard has some disadvantages compared to other classes. One of them is the high cost of storage for rarely accessed objects. That's why it's important to define lifecycle policies to deal with infrequently accessed objects. In this case, there is the S3 Standard-Infrequent Access class, which would be most appropriate for this context. We will talk about this class shortly. Another disadvantage is related to accessing newly created objects. Even though this class has low latency as one of its main characteristics. Newly created objects may not be immediately available in all regions, and it may take time for objects to become available for some regions, causing high latency S3 Intelligent-Tiering The S3 Intelligent-Tiering storage class provides a mechanism where you can automatically move objects based on usage pattern to more suitable tiers, looking for lower storage costs. Advantages The concept itself says it all about one of the advantages of using S3 Intelligent-Tiering. This class is capable of managing objects based on the usage pattern. So, for those objects that are rarely accessed, the class itself moves to more suitable classes aiming at lower storage costs. S3 Intelligent-Tiering automatically monitors and moves objects to the most suitable layers according to the usage pattern, generally this integration works for 3 types of layers. An optimized layer for frequently accessed objects, an optimized layer for rarely accessed objects, which according to AWS generates savings of up to 40%. And a last layer targeted at objects that are rarely accessed, generating storage savings of around 68%. Another point of advantage is that there's no charge for data access using S3-Intelligent-Tiering. Only charges for storage and transfer. Disadvantages Possible increase in latency for objects accessed for the first time. The reason is that when moving objects to more suitable layers, there's the possibility of increasing latency for these objects that are rarely accessed. S3 Standard-Infrequent Access (S3 Standard-IA) Suitable class for storing objects with less frequent accesses but that need to be available for quick accesses, keeping a low latency. It is a typical class for storing long-term data. Advantages The storage cost is lower compared to the S3 Standard class, maintaining the same durability characteristics. Regarding data availability, it has the same characteristics as the S3 Intelligent-Tiering class, with 99.9% SLA. Also, it allows fast access to data by offering a high throughput rate. The minimum storage fee is charged monthly, unlike classes such as S3-Standard and S3-Intelligent Tiering. Disadvantages Access data is charged per gigabyte accessed. So, depending on the frequency of access and volume accessed, it would be better to keep the data in a layer like S3 Standard. Everything will depend on your strategy. S3 One Zone-Infrequent Access (S3 One Zone-IA) Ideal storage class for objects that are accessed infrequently and will only be available in one zone (Availability Zone). AWS itself suggests this class for secondary data backup copies. Advantages The cost is lower compared to other storage classes, as the data will be stored in only one zone, making a low cost operation. Disadvantages Unlike other storage classes, where object storage is available in at least 3 availability zones (AZ). The S3 One Zone-Infrequent Access makes data available in only 1 zone, meaning that there is no redundancy. So there's a possibility of data loss if that zone fails. S3 Glacier Instant Retrieval S3 Glacier Instant Retrieval is part of the Glacier family, which features low-cost storage for accessed objects. It's an ideal storage class for archiving data that needs immediate access. Advantages Low storage costs. It has the same availability compared to S3 Intelligent-Tiering and S3 Standard-IA classes. Provides redundancy, which means that the data is replicated to at least 3 Availability Zones (AZ). Disadvantages Although it offers immediate data recovery while maintaining the same throughput as classes like S3 Standard and S3 Standard-IA, the cost becomes high when it's necessary to recover this data with a high frequency in a short period. S3 Glacier Flexible Retrieval S3 Glacier Flexible Retrieval is the old storage class called just S3 Glacier, it has a characteristic to store objects with long life duration, like any other class of the Glacier family. This class is ideal for objects that are accessed 1 to 2 times a year and that require recovery asynchronously, without immediate access. Advantages This class is ideal for keeping objects that don't require immediate recovery, making it a cost advantage. In this case, data as a backup, in which recovery is very rare, this class does not offer recovery costs due to the idea that the frequency of accessing this data is very close to zero. Disadvantages Retrieval time can be slow for some scenarios. As a feature of its own class, S3 Glacier Flexible Retrieval may fall short when immediate access to data is required. S3 Glacier Deep Archive Lowest cost storage class among the Glacier family classes. Ideal for storing data that can be accessed 1 to 2 times a year. AWS suggests using this class for scenarios where we have to keep data between 8 to 10 years in order to comply with regulations related to compliance or any other type of rules related to data retention for long periods. Advantages The lowest cost among classes in the same segment and with 99.99% availability. Available class in at least 3 Availability Zones (AZ) and ideal for data that requires long retention periods. Disadvantages Long recovery time. So, if you need a quick data retrieval, maybe this SLA may not meet expectations. Because it has a characteristic in which the data must be rarely accessed and the cost of recovery can be higher, depending on the frequency of accesses. Well that’s it, I hope you enjoyed it!