Search
47 items found for ""
- Understanding AWS Lambda in 3 minutes
AWS Lambda is a compute service with no server also know as Serverless computing. Running Lambda enables you create backend applications using different programming languages as Java, Python, Node.js, .Net, Ruby, Go and more. The best part about Lambda is that you don't need to worry about servers instance to deploy and run it. No concerns about provisioning capacity responsibilities that usually happens to EC2 instances becoming a cheapest alternative to compose architectures. How it works Lambdas also are used to compose architectures being responsible for specific workloads. For example, using Lambda you can start listening to files from a S3 Bucket processing them to normalize it or you can use EventBridge (Cloudwatch events) creating schedules through a cron expression to trigger the Lambda to run workloads and after that shutdown the process. As shown in the image below, we have a some examples of AWS Lambda integrations, so you can use them to invoke Lambdas for a variety of scenarios. Limitations Lambdas can run for up to 15 minutes, so if you want to try it out, be careful handling workloads that take longer than 15 minutes. Integrations As mentioned earlier, AWS Lambda allows for various service integrations to be used as a trigger. If you want to listen to objects created on S3 Bucket, you can use S3 as a trigger. If you need to process notifications from SNS, you also can set Amazon Simple Notification Service (SNS) as the trigger and it will receive all the notification to process. Note that we have different scenarios that Lambda can solve solutions efficiently. Here you can see a complete list about the integrated services. Prices AWS has certain policies regarding the use for each service. Lambdas are basically billed by number of requests and code execution time. For more details, see here. Use Cases Here we'll have some example where the use of Lambda can be an interesting option. Data processing: Imagine you must normalize unstructured files into semi-structured to be read by some process. In this case it is possible to listen to a S3 Bucket looking for new objects to be transformed. Security: A Lambda that updates an application's users token Data Transform: You can use Kinesis/Firehose as a trigger, so that Lambda can listen to each event, transform it, and send it back to Kinesis to delivery the Data to S3. Benefits Price: Pay just for requests and code runtime Serverless: No need server application Integrated: Lambda provides integration with AWS Services Programming Language: It's possible to use main programming languages Scaling and Concurrency: Allows you to control the concurrency and scaling the number of executions until account limit Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
- Converting Parquet table to Delta Table
For this post we're going to create examples to how convert parquet table to Delta table. First, we'll create a parquet table from scratch through a Spark Dataframe and then converting to Delta table. Using Delta table has some benefits comparing to a Parquet table. Delta enables to restore versions of your table through time travel function, ACID supports and more. Creating a Parquet table First of all, let's create a parquet table to be converted later to Delta Table. I'll prefer create a parquet table from scratch to bring a better understanding. The following code will be executed once, just to create a parquet table. We're going to use a Spark Dataframe that will be loaded from a JSON file containing semi-structured records. public static void main(String[] args){ SparkConf conf = new SparkConf(); conf.setAppName("spark-delta-table"); conf.setMaster("local[1]"); SparkSession session = SparkSession.builder() .config(conf) .getOrCreate(); Dataset dataFrame = session.read().json("product.json"); dataframe.write().format("parquet").save("table/product"); } The above example, we start creating a SparkSession object to create and manage a Spark Dataframe that was loaded from the product.json file content. Alter load, the Dataframe creates (or write) a table in parquet format in the table/product directory. JSON content File represented by product.json file that contains semi-structured records. {"id":1, "name":"rice", "price":12.0, "qty": 2} {"id":2, "name":"beans", "price":7.50, "qty": 5} {"id":3, "name":"coke", "price":5.50, "qty": 2} {"id":4, "name":"juice", "price":3.80, "qty": 1} {"id":5, "name":"meat", "price":1.50, "qty": 1} {"id":6, "name":"ice-cream", "price":6.0, "qty": 2} {"id":7, "name":"potato", "price":3.70, "qty": 10} {"id":8, "name":"apple", "price":5.60, "qty": 5} After running the code above, parquet files will be generated in the table/product directory containing the files below. Converting Parquet table to Delta Table Now that we have a Parquet table already created, we can convert easily to Delta Table, let's do this. public static void main(String[] args){ SparkConf conf = new SparkConf(); conf.setAppName("spark-delta-table"); conf.setMaster("local[1]"); SparkSession session = SparkSession.builder() .config(conf) .getOrCreate(); DeltaTable.convertToDelta(session, "parquet.`table/product`"); } DeltaTable.convertToDelta method is responsible to convert parquet table to Delta table. Note that we had to use SparkSession as a parameter and also specify the path of parquet table using this format "parquet.``" . The result after execution you can see in the picture below. After conversion running, Delta creates the famous _delta_log directory containing commit info and checkpoint files. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that's it, I hope you enjoyed it.
- Understanding Delta Lake Time Travel in 2 minutes
Delta Lake provides a way to version data for operations like merge, update and delete. This makes transparent how data life cycle inside Delta Lake works it. For each operation a version will be incremented and if you have a table with multiple operations, different versions of table will be created. Delta Lake offers a mechanism to navigate over the different versions called Time Travel. It's a temporary way to access data from the past. For this post we're going to use this feature to see different versions of table. Below we have a Delta Table called people that all versions were generated through write operations using append mode. Current version When we perform a simple read on a table, the current version is always the must recent one. So, for this scenario, the current version is 2 (two). Note that we don't need to specify which version we want to use because we're not using Time Travel yet. session.read().format("delta").load("table/people") .orderBy("id").show(); Nothing changes at the moment, let's keep for the next steps. Working with Time Travel Here begins how we can work with Time Travel, for the next steps, we'll perform readings on the people table specifying different versions to understand how Time travel works. Reading Delta table - Version 0 (zero) Now we're going to work with different versions starting from the 0 (zero) version, let's read the table again but now adding a new parameter, take a look at the code below. session.read().format("delta") .option("versionAsOf", 0) .load("table/people") .orderBy("id").show(); Notice that we added a new parameter called versionAsOf , this parameter allows us to configure the number of version you want to restore temporarily for a table. For this scenario we configure the reading for the Delta Table version zero (0). This was the first version generated by Delta Lake after write operation. Reading Delta table - Version 1 (one) For this last step we're using the version one (1), note that the data from the previous version has been maintained because an append mode was executed. session.read().format("delta") .option("versionAsOf", 1) .load("table/people") .orderBy("id").show(); Delta lake has a lot of benefits and Time travels allows us flexibility in a Big Data architecture, for more details I recommend see the Delta Lake docs . Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that's it, I hope you enjoyed it.
- Tutorial: Creating AWS Lambda with Terraform
For this post, we're going to create an AWS Lambda with Terraform and Java as runtime. But first of all, have you heard about Lambda? I recommend seeing this post about Lambda. And about Terraform? There's another post that I can show you the first steps using Terraform, just click here. The idea about this post is to create an AWS Lambda that will triggered by CloudWatch Events through an automated schedule using a cron or rate expressions. Usually we can create any AWS resource using the console but here, we're going to use Terraform as an IAC (Infrastructure as code) tool that will create any necessary resource to run our AWS Lambda. As runtime, we choose Java. So it's important that you understand about maven at least. Remember you can run Lambda using different languages as runtime such as Java, Python, .NET, Node.js and more. Even if it's a Java project, the most important part of this post is try to understand about Lambda and how can you provisioning through Terraform. Intro Terraform will be responsible to create all resources for this post such as Lambda, roles, policies, CloudWatch Events and S3 Bucket where we're going to keep the JAR file from our application. Our Lambda will be invoked by CloudWatch Events every 3 minutes running a simple Java method that prints a message. This is going to be a simple example that you can reuse in your projects. You can note in the image above we're using S3 to stored our deployment package, JAR file in this case. It's an AWS recommendation to upload larger deployment packages directly to S3 instead maintaining on Lambda itself. S3 has better support for uploading large files without worry with storage. Don't worry to upload files manually, Terraform also will be responsible to do that during the build phase. Creating the project For this post we're going to use Java as language and Maven as a dependency manager. Therefore is necessary to generate a Maven project that will create our project structure. If you don't know how to generate a Maven project, I recommend seeing this post where I show how to generate it. Project structure After generating the Maven project, we're going to create the same files and packages on the side, except pom.xml that was created by the maven generator. It's a characteristic of Maven projects to generate these folders structure as shown src/main/java/. Within java/ folder, create a package called coffee.tips.lambda and create a Java class named Handler.java within this same package. Updating pom.xml For this post, add the following dependencies and build below. Creating a Handler A handler is basically the Lambda controller. Lambda always look for a handler to start its process, to summarize, it's the first code to be invoked. For the handler below, we created a basic handler just to log messages when invoked by CloudWatch Events. Note that we implemented RequestHandler interface allowing receiving as parameter a Map object. But for this example we won´t explore data from this parameter. Understanding Terraform files Now we're going to understand how the resources will be created using Terraform. vars.tf vars.tf file is where we declare the variables. Variables provides flexibility when we need to work with different resources. vars.tfvars Now we need to set the values of these variables. So, let's create a folder called /development inside the terraform folder. After folder creation. Create a file called vars.tfvars like side image and paste the content below. Note the for bucket field you must specify the name of your own bucket. Bucket's name must be unique. main.tf To this file we just declare the provider. Provider is the cloud service where we're going to use to create our resources. In this case, we're using AWS as provider and Terraform will download the necessary packages to create the resources. Note that for region field, we're using var keyword to get the region value already declared in vars.tfvars file. s3.tf This file is where we're declaring resources related to S3. In this case, we only created S3 bucket. But if you want to create more resources related to S3 such as policies, roles, S3 notifications and etc, you can declare here. It's a way to separate by resource. Note again, we're using var keyword to bucket variable declared in vars.tf file. lambda.tf Finally our last terraform file, in this file we're declaring resources related to the Lambda and the Lambda itself. Now I think it's worth explaining some details about the above file. So, let's do this. 1. We declared 2 aws_iam_policy_document data sources that describes what actions the resources that will be assigned to these policies can perform. 2. aws_iam_role resource that provides IAM role and will control some Lambda's actions. 3. aws_iam_role_policy that provides IAM role inline policy and will register the previous role and policies related to the aws_iam_policy_document.aws_iam_policy_coffee_tips_aws_lambda_iam_policy_document. 4. We declared aws_s3_object resource because we want to store our jar file on S3. So, during the deploy phase, Terraform will get the jar file that will be created on target folder and uploading to S3. depends_on: Terraform must create this resource before the current. bucket: It's the bucket's name where will store the jar file. key: jar's name. source: source file's location etag: triggers updates when the value changes 5. aws_lambda_function is the resource responsible to create Lambda and we need to fill some fields such as: function_name: Lambda's name. role: Lambda role declared in previous steps that provides access to AWS services and resources. handler: In this field you need to pass main class directory. source_code_hash: This field is responsible to trigger lambda updates. s3_bucket: It's the bucket's name where will store the jar file generated during deploy. s3_key: Jar's name. runtime: Here you can pass the Lambda supported programming languages. For this example, java11. timeout: Lambda's timeout of execution. 6. aws_cloudwatch_event_rule is the rule related to the CloudWatch event execution. In this case, we can set the cron through schedule_expression field to define when the lambda will run. 7. aws_cloudwatch_event_target is the resource responsible to trigger the Lambda using CloudWatch events. 8. aws_lambda_permission allows some executions from CloudWatch. Packaging Now you're familiar about Lambda and Terraform, let's packaging our project via Maven before Lambda creation. The idea is to create a jar file that will be used for Lambda executions and store at S3. For this example, we're going to package locally. Remember that for an environment production we could use continuous integrations tool such Jenkins, Drone or even Github actions to automate this process. First, open the terminal and be sure you're root project directory and running the following maven command: mvn clean install -U This command besides packaging the project, will download and install the dependencies declared on pom.xml file. After running the above command, a jar file will be generated within target/ folder also created. Running Terraform Well, we're almost there. Now, let's provision our Lambda via Terraform. So let's run some Terraform commands. Inside terraform folder, run the following commands on terminal: terraform init The above command will initiate terraform, downloading terraform libraries and also validate the terraform files. For the next command, let's run the plan command to check which resources will be created. terraform plan -var-file=development/vars.tfvars After running, you'll see similar logs on console: Finally, we can apply to create the resources through the following command: terraform apply -var-file=development/vars.tfvars After running, you must confirm to perform actions , type "yes". Now the provision has been completed! Lambda Running Go and access the AWS console to see the Lambda execution. Access monitor tab Access Logs tab inside Monitor section See the messages below that will printed every 2 minutes Destroy AWS Billing charges will happen if you don't destroy these resources. So I recommend destroying them by avoiding some unnecessary charges. To avoid it, run the command below. terraform destroy -var-file=development/vars.tfvars Remember you need to confirm this operation, cool? Conclusion In this post, we created an AWS Lambda provisioned by Terraform. Lambda is an AWS service that we can use for different use cases bringing facility to compose an architecture of software. We could note that Terraform brings flexibility creating resources for different cloud services and easy to implement in software projects. Github repository Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Setup recommendations If you have interesting to know what's my setup I've used to develop my tutorials, following: Notebook Dell Inspiron 15 15.6 Monitor LG Ultrawide 29WL500-29 Well that’s it, I hope you enjoyed it!
- Creating a Java code using Builder pattern
If you're using a language that supports oriented object in your project, probably there's some lines of codes with Builder pattern. If not, this post will help you to understand about it. What's Builder Pattern? Builder Pattern belongs to an area in Software Engineer called Design Patterns, the idea behind of a pattern is to solve commons problems in your project following best practices. Builder Pattern is very useful when we need to provide a better solution in the creational objects part in our project. Sometimes we need to instantiate an object with a lot of parameters and this could be a problem if you pass a wrong parameter value. Things like this happen every time and results in bugs and you will need to find out where's the issue and maybe, refactoring code to improve it. Let's write some lines of code to see how does Builder Pattern works and when to apply it. The code below is an example of a traditional Class with constructor used to load values when the object instantiated. public class PersonalInfo { private final String firstName; private final String lastName; private final Date birthDate; private final String address; private final String city; private final String zipCode; private final String state; private final int population; public PersonalInfo(String firstName, String lastName, Date birthDate, String address, String city, String zipCode, String state, int population){ this.firstName = firstName; this.lastName = lastName; this.birthDate = birthDate; this.address = address; this.city = city; this.zipCode = zipCode; this.state = state; this.population = population; } } And now, we can instantiate the object simulating the client code. PersonalInfo personalInfo = new BuilderPattern("Mônica", "Avelar", new Date(), "23 Market Street", "San Francisco", "94016", "CA", 800000); If you notice, to instantiate the object we should pass all the values related to each property of our class and there's a big chance to pass a wrong value. Another disadvantage of this approach is the possibility to not scale it. In this example we have a few properties but tomorrow we can add more properties and the disadvantage becomes clearer. Working with Builder Pattern Let's rewrite the code above to the Builder Pattern and see the differences. public class PersonalInfo { private final String firstName; private final String lastName; private final Date birthDate; private final String address; private final String city; private final String zipCode; private final String state; private final int population; public static class Builder { private String firstName; private String lastName; private Date birthDate; private String address; private String city; private String zipCode; private String state; private int population; public Builder firstName(String value) { firstName = value; return this; } public Builder lastName(String value) { lastName = value; return this; } public Builder birthDate(Date value) { birthDate = value; return this; } public Builder address(String value) { address = value; return this; } public Builder city(String value) { city = value; return this; } public Builder zipCode(String value) { zipCode = value; return this; } public Builder state(String value) { state = value; return this; } public Builder population(int value) { population = value; return this; } public BuilderPattern build() { return new BuilderPattern(this); } } public PersonalInfo(Builder builder){ firstName = builder.firstName; lastName = builder.lastName; birthDate = builder.birthDate; address = builder.address; city = builder.city; zipCode = builder.zipCode; state = builder.state; population = builder.population; } } If you compare both codes you will conclude that the first one is smaller and better to understand than the second one and I agree it. The advantage of the usage is going to be clear for the next example when we create an object based on Builder pattern. Simulating client code using Builder Pattern PersonalInfo personalInfo = new Builder() .firstName("Mônica") .lastName("Avelar") .birthDate(new Date()) .address("23 Market Street") .city("San Francisco") .zipCode("94016") .state("CA") .population(80000) .build(); This last example of creation object using Builder Pattern turns an organized code following the best practices and easy to read. Another advantage of Builder is that we can identify each property before passing values. To be honest I've been using Builder Pattern in my projects and I strongly recommend you do it the same in your next projects. There's an easier way to implement Builder pattern in projects nowadays and I'll write a post about it, see you soon! Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Design Patterns com Java. Projeto Orientado a Objetos Guiado por Padrões (Portuguese version) is a book that shows the concepts and fundamentals of Design Patterns and how to apply for different contexts using Java language.
- Running Spring Boot with ActiveMQ
Before understanding about ActiveMQ, we have to think about common problems in applications that need to scale and better integrate their services. Today the information transmitted flow is infinitely greater than 10 years ago and it is almost impossible to measure capacity in terms of scalability that an application can support. Use case To understand better, let's imagine that you were hired to design an architecture for an e-commerce that will sell tickets for NFL games. As always you have little time to think about an architecture. The first idea is simple and quick, the result is this drawing below. Thinking about the number of accesses and requests per second, do you think it is a resilient architecture? Does the database scale? Does the bank support multi-access? And if the bank for some reason falls down, will the purchase be lost? We can improve this architecture a little more, making it a little more professional and resilient. Let's go. Let's understand this last drawing. Now, when placing a purchase order, the orders are sent to a message server (Broker). The idea of the Broker is basically a service capable of allocating messages. These are usually texts or text in Json format. In this drawing we can say that the Queue allocates customer data, number of tickets, values, etc. And finally, there is an application that does all the management of orders/purchases. This application reads/removes messages from the Broker and can perform validations before writing to the database. Now, let's assume that one of the requirements is for each sale, the customer must receive an invoice for the ticket. As the new architecture is well decoupled, it's easier to "plug in" a new application to do this job. Then you thought of a new design, follow: Now the application that manages the purchases, in addition to recording the sale after retrieving the messages in Queue 1, it also sends a message to Queue 2, where it will allocate the customers' invoices. And a new application that manages invoices retrieves these messages and records them on a specific database for the financial area. But what are the benefits of this new architecture? The architecture is more resilient, asynchronous and fault-tolerant. In case of, one of the applications fails for some reason, the message returns to the queue until the applications are reestablished. And finally, it facilitates the integration of new applications. What about ActiveMQ? What does he have to do with it? ActiveMQ is the service that provides the messaging server. In the design, it would be the message servers (Brokers). To understand even better, let's create a practical example of how to configure and use ActiveMQ with Spring Boot and JMS. Creating the Project To create this Spring Boot project we're going to use Spring Initializr to generate our project faster. Therefore, access the following https://start.spring.io to create it and choose the dependencies. Fill in the fields and select the 2 dependencies (ActiveMQ 5 and Spring Web) as shown in the image. Generate the file and import it into your project. Pom file Following the pom.xml that was created by Spring Initializr. 4.0.0 org.springframework.boot spring-boot-starter-parent 2.4.2 com.spring.active.mq spring-boot-active-mq 0.0.1-SNAPSHOT spring-boot-active-mq Demo project for Spring Boot 1.8 org.springframework.boot spring-boot-starter-activemq org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-starter-test test org.springframework.boot spring-boot-maven-plugin Installing ActiveMQ Let's download ActiveMQ to make the process more transparent. But there is also the possibility of using the built-in version of Spring Boot, but this time we will present it in a more traditional way. For this example, we're going to use the classic version of ActiveMQ. Download ActiveMQ here: https://activemq.apache.org/components/classic/download/ Steps to install here: https://activemq.apache.org/getting-started After installation, start the server according to the documentation. application.properties file In the created Spring-Boot application, fill in the application.properties file spring.activemq.broker-url=tcp://127.0.0.1:61616 spring.activemq.user=admin spring.activemq.password=admin The first line sets the messages server URL The second and subsequent lines are the authentication data. Ticket class public class Ticket { private String name; private Double price; private int quantity; public Ticket(){} public Ticket(String name, Double price, int quantity){ this.name = name; this.price = price; this.quantity = quantity; } public String getName() { return name; } public void setName(String name) { this.name = name; } public Double getPrice() { return price; } public void setPrice(Double price) { this.price = price; } public int getQuantity() { return quantity; } public void setQuantity(int quantity) { this.quantity = quantity; } @Override public String toString() { return String.format("Compra de ingresso -> " + "Name=%s, Price=%s, Quantity=%s}", getName(), getPrice(), getQuantity()); } } In the SpringBootActiveMqApplication class previously created by the generator, make the following change. @SpringBootApplication @EnableJms public class SpringBootActiveMqApplication { public static void main(String[] args) { SpringApplication.run(SpringBootActiveMqApplication.class, args); } @Bean public JmsListenerContainerFactory defaultFactory( ConnectionFactory connectionFactory, DefaultJmsListenerContainerFactoryConfigurer configurer) { DefaultJmsListenerContainerFactory factory = new DefaultJmsListenerContainerFactory(); configurer.configure(factory, connectionFactory); return factory; } @Bean public MessageConverter jacksonJmsMessageConverter() { MappingJackson2MessageConverter converter = new MappingJackson2MessageConverter(); converter.setTargetType(MessageType.TEXT); converter.setTypeIdPropertyName("_type"); return converter; } } The @EnableJms annotation is the mechanism responsible for enabling JMS. The defaultFactory method configures and registers the factory to connect the queues using JMS. And finally, the jacksonJmsMessageConverter method converts the messages passed from JSON to the type that will be passed in the JmsTemplate that we will see soon. All of these methods use the @Bean annotation. Methods annotated with @Bean are managed by the Spring container. TicketController class In the TicketController class, we create a method called buyTicket that will be responsible for sending messages to the queue called compra_queue (purchase_queue) through a POST request. In this method we're using a JmsTemplate type object that allows objects to be converted and sent to the queue using JMS. package com.spring.active.mq.springbootactivemq.Controller; import com.spring.active.mq.springbootactivemq.pojo.Ticket; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.http.MediaType; import org.springframework.jms.core.JmsTemplate; import org.springframework.web.bind.annotation.PostMapping; import org.springframework.web.bind.annotation.RequestBody; import org.springframework.web.bind.annotation.RestController; @RestController public class TicketController { @Autowired private JmsTemplate jmsTemplate; @PostMapping(value = "/buy", consumes = MediaType.APPLICATION_JSON_VALUE) public void buyTicket(@RequestBody Ticket ticket){ jmsTemplate.convertAndSend("compra_queue", new Ticket(ticket.getName(), ticket.getPrice(), ticket.getQuantity())); } } EventListener class The EventListener class is a sort of "listener". The @JmsListener annotation defines this listener characteristic. In this same annotation, it is possible to configure the queue's name that will be "listened to" by the method. In short, all messages sent to the queue compra_queue (purchase_queue) will be received by this method. package com.spring.active.mq.springbootactivemq.listener; import com.spring.active.mq.springbootactivemq.pojo.Ticket; import org.springframework.jms.annotation.JmsListener; import org.springframework.stereotype.Component; @Component public class EventListener { @JmsListener(destination = "compra_queue", containerFactory = "defaultFactory") public void receiveMessage(Ticket ticket) { System.out.println("Mensagem da fila:" + ticket); } } Accessing Broker service - ActiveMQ After starting the service according to the documentation, access the service console through a browser http://127.0.0.1:8161/ Creating queue To create the queue, click on the Queues option on the top red bar, as shown in the image below. In the Queue Name field, type the queue's name as shown in the image above and click on the Create button. It's done, queue was created! Starting application Via terminal, access your project directory and run the Maven command below or launch via IDE mvn spring-boot:run Sending messages We will use Postman to send messages, if you don't have Postman installed, download it by accessing this link https://www.postman.com/downloads/ After installation, access Postman and fill in the fields as shown in the image below. Json content {"name":"Joao","price":2.0,"quantity":4} By clicking on the Send button, access the application's console and it will be possible to view the message sent and transmitted in the queue. Access the ActiveMQ console again and you will be able to see the message log that was sent to the queue. The Number of Consumers column is the number of consumers in the queue, which in this case is just 1. The Messages Enqueued column shows the number of messages that were sent and, finally, the Messages Dequeued column is the number of messages that were removed from the queue . Here I have a SpringBoot project with ActiveMQ repository: https://github.com/jpjavagit/jms-active-mq. It's worth checking out! Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spring Microservices in Action is a book that covers the principles of microservices using Spring, Spring Boot applications using Spring Cloud, resiliency, how to deploy and real-life examples of good development practices Spring MVC Beginner's Guide: Beginner's Guide is a book covering fundamental Spring concepts such as architecture, request flows, Bean validation, how to handle exception flows, using REST and Ajax, Testing and much more. This book is an excellent choice for anyone wanting to learn more about the fundamentals of Spring. Spring is a Java Framework containing different projects, Spring MVC being one of them. By acquiring a good Spring MVC foundation, you will be able to tackle challenges using any Spring Framework project. Learn Microservices with Spring Boot: A Practical Approach to RESTful Services using RabbitMQ, Eureka, Ribbon, Zuul and Cucumber is a book covers the main features of the Spring ecosystem using Spring Boot, such as creating microservices, event-based architecture, using RabbitMQ as a messaging feature, creating RESTful services and much more. This book is an excellent choice for anyone who wants to learn more about how Spring Boot and it's features. Well that’s it, I hope you enjoyed it!
- Generating a Maven project without IDE in 2 minutes
What's Maven? It's common to hear about maven, especially for Java Projects but don't confuse Maven with Java, okay? I can explain what's maven and it's use case. Maven is a popular build automation tool primarily used for Java projects. It provides a structured way to manage project dependencies, build processes, and releases. Maven uses a declarative approach to project management, where you define your project's specifications and dependencies in an XML file called pom.xml (Project Object Model). Maven helps simplify the build process by managing the dependencies of your project, downloading the required libraries from repositories, and providing a standardized way to build and package your application. It can also generate project documentation, run tests, and perform other tasks related to building and managing Java projects. To summarize, Maven provides a powerful toolset for building, managing, and releasing Java applications, and it is widely used in the Java development community. Generating a Maven project without IDE Usually engineers generate Maven project through an IDE but there are easiest ways to do the same without IDE supports. If you don't install Maven yet, I recommend to install it before we start. Thus, you can download Maven here and after installed, following the steps to install here. First of all, to be sure you've installed Maven, open the terminal and running the commando below: mvn -version A message similar to the one below will be displayed on terminal. Now, let's getting started generating your Maven project. 1° Step: Open the terminal again and running the command below. mvn archetype:generate -DgroupId=com.coffeeantips.maven.app -DartifactId=coffeeantips-maven-app -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.0 -DinteractiveMode=false 2° Step: After running the commando above a folder called coffeeantips-maven-app/ were created. Change into this directory and we'll see the following structure of folders and files. Understanding the command parameters archetype:generate: Generates a new project from an archetype or updates the current project. -DgroupId: Specifies the package where folders and projects files will be generated. -DartifactId: Project's name or artifact. -DarchetypeArtifactId: Maven provides a list of archetypes, you can check here. But for this example, we're using an archetype to generate a sample Maven project. -DarchetypeVersion: Version project. -DinteractiveMode: It's a way to define if Maven will interact with user asking for inputs. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Maven: The Definitive Guide Written by Maven creator Jason Van Zyl and his team at Sonatype, Maven: The Definitive Guide clearly explains how this tool can bring order to your software development projects. In this book you'll learn about: The POM and Project Relationships, The Build Lifecycle, Plugins, Project website generation, Advanced site generation, Reporting, Properties, Build Profiles, The Maven Repository and more. Well that’s it, I hope you enjoyed it!
- Getting started using Terraform on AWS
Terraform is a IaC (Infrastructure as Code) tool that makes it possible to provision infrastructure in cloud services. Instead of manually creating resources in the cloud, Terraform facilitates the creation and control of these services through management of state in a few lines of code. Terraform has its own language and can be used independently with other languages isolating business layer from infrastructure layer. For this tutorial, we will create an S3 Bucket and an SQS through Terraform on AWS Terraform Installation For installation, download the installer from this link according to your operating system. AWS Provider We'll use AWS as a provider. Thus, when we select AWS as a provider, Terraform will download the packages that will enable the creation of specific resources for AWS. To follow the next steps, we hope you already know about: AWS Credentials Your user already has the necessary permissions to create resources on AWS Authentication As we are using AWS as provider, we need to configure Terraform to authenticate and then create the resources. There are a few ways to authenticate. For this tutorial, I chose to use one of the AWS mechanisms that allows you to allocate credentials in a file in the $HOME/.aws folder and use it as a single authentication source. To create this folder with the credentials, we need to install the AWS CLI, access this link and follow the installation steps. This mechanism avoids using credentials directly in the code, so if you need to run a command or SDK that connects to AWS locally, these credentials will be loaded from this file. Credentials settings After installing the AWS CLI, open the terminal and run the following command: aws configure In the terminal itself, fill in the fields using your user's credentials data: After filling in, 2 text files will be created in the $HOME/.aws directory config: containing the profile, in this case the default profile was created credentials: containing own credentials Let's change the files to suit this tutorial, change the config file as below: [profile staging] output = json region = us-east-1 [default] output = json region = us-east-1 In this case, we have 2 profiles configured, the default and staging profile. Change the credentials file as below, replacing it with your credentials. [staging] aws_access_key_id = [Access key ID] aws_secret_access_key = [Secret access key] [default] aws_access_key_id = [Access key ID] aws_secret_access_key = [Secret access key] Creating Terraform files After all these configurations, we will actually start working with Terraform. For this we need to create some base files that will help us create resources on AWS. 1º Step: In the root directory of your project, create a folder called terraform/ 2º Step: Inside the terraform/ folder, create the files: main.tf vars.tf 3º Step: Create another folder called staging inside terraform/ 4º Step: Inside the terraform/staging/ folder create the file: vars.tfvars Okay, now we have the folder structure that we will use for the next steps. Setting up Terraform files Let's start by declaring the variables using the vars.tf file. vars.tf In this file is where we're going to create the variables to be used on resources and bring a better flexibility to our code. We can create variables with a default value or simply empty, where they will be filled in according to the execution environment, which will be explained later. variable "region" { default = "us-east-1" type = "string" } variable "environment" { } We create two variables: region: Variable of type string and its default value is the AWS region in which we are going to create the resources, in this case, us-east-1. environment: Variable that will represent the execution environment staging/vars.tfvars In this file we are defining the value of the environment variable previously created with no default value. environment = "staging" This strategy is useful when we have more than one environment, for example, if we had a production environment, we could have created another vars.tfvars file in a folder called production/. Now, we can choose in which environment we will run Terraform. We'll understand this part when we run it later. main.tf Here is where we'll declare resources such as S3 bucket and SQS to be created on AWS. Let's understand the file in parts. In this first part we're declaring AWS as a provider and setting the region using the variable already created through interpolation ${..}. Provider provider "aws" { region = "${var.region}" } Creating S3 Bucket To create a resource via Terraform, we always start with the resource keyword, then the resource name, and finally an identifier. resource "name_resource" "identifier" {} In this snippet we're creating a S3 Bucket called bucket.blog.data, remember that Bucket names must be unique. The acl field defines the Bucket restrictions, in this case, private. The tags field is used to provide extra information to the resource, in this case it will be provide the value of the environment variable. resource "aws_s3_bucket" "s3_bucket" { bucket = "bucket.blog.data" acl = "private" tags = { Environment = "${var.environment}" } } Creating SQS For now, we'll create an SQS called sqs-posts. Note that resource creation follows the same rules as we described earlier. For this scenario, we set the delay_seconds field that define the delay time for a message to be delivered. More details here. resource "aws_sqs_queue" "sqs-blog" { name = "sqs-posts" delay_seconds = 90 tags = { Environment = "${var.environment}" } } Running Terraform 1º Step : Initialize Terraform Open the terminal and inside terraform/ directory, run the command: terraform init Console message after running the command. 2º Step: In Terraform you can create workspaces. These are runtime environments that Terraform provides and bringing flexibility when it's necessary to run in more than one environment. Once initialized, a default workspace is created. Try to run the command below and see which workspace you're running. terraform workspace list For this tutorial we will simulate a development environment. Remember we created a folder called /staging ? Let's getting start using this folder as a development environment. For that, let's create a workspace in Terraform called staging as well. If we had a production environment, a production workspace also could be created. terraform workspace new "staging" Done, a new workspace called staging was created! 3º Step: In this step, we're going to list all existing resources or those that will be created, in this case, the last option. terraform plan -var-file=staging/vars.tfvars The plan parameter makes it possible to visualize the resources that will be created or updated, it is a good option to understand the behavior before the resource is definitively created. The second -var-file parameter makes it possible to choose a specific path containing the values of the variables that will be used according to the execution environment. In this case, the /staging/vars.tfvars file containing values related to the staging environment. If there was a production workspace, the execution would be the same, but for a different folder, got it? Messages console after running the last command using plan parameter. Looking at the console, note that resources earlier declared will be created: aws_s3_bucket.s3_bucket aws_sqs_queue.sqs-blog 4º Step: In this step, we are going to definitely create the resources. terraform apply -var-file=staging/vars.tfvars Just replace plan parameter with apply, then a confirmation message will be shown in the console: To confirm the resources creation, just type yes. That's it, S3 Bucket and SQS were created! Now you can check it right in the AWS console. Select workspace If you need to change workspace, run the command below selecting the workspace you want to use: terraform workspace select "[workspace]" Destroying resources This part of the tutorial requires a lot of attention. The next command makes it possible to remove all the resources that were created without having to remove them one by one and avoiding unexpected surprises with AWS billing. terraform destroy -var-file=staging/vars.tfvars Type yes if you want to delete all created resources. I don't recommend using this command in a production environment, but for this tutorial it's useful, Thus, don't forget to delete and AWS won't charge you in the future. Conclusion Terraform makes it possible to create infrastructure very simply through a decoupled code. For this tutorial we use AWS as a provider, but it is possible to use Google Cloud, Azure and other cloud services. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastructure as Code is a book focused on how to use Terraform and its benefits. The author make comparisons with several other IaC (Infrastructure as code) tools such as Ansible and Cloudformation (IaC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer or work in the Devops area, I strongly recommend learning about the topic. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
- How to generate random Data using Datafaker lib
Sometimes in our projects we have to fill Java objects for unit tests or even to create a database dump with random data to test a specific feature and etc. We need to be creative trying to create names, street names, cities or documents. There's an interesting and helpful Java library called Datafaker that allows to create random data with a large number of providers. Providers are objects based on a context, for example: If you want to generate data about a Person object, there's a specific provider for this context that will generate name, last name and etc. If you need to create a unit test that you need data about address, you'll find it. In this post we'll create some examples using Maven but the library also provides support for Gradle projects. Maven net.datafaker datafaker 1.1.0 Generating Random Data Let's create a simple Java class that contains some properties like name, last name, address, favorite music genre and food. public class RandomPerson { public String firstName; public String lastName; public String favoriteMusicGenre; public String favoriteFood; public String streetAddress; public String city; public String country; @Override public String toString() { return "firstName=" + firstName + "\n" + "lastName=" + lastName + "\n" + "favoriteMusicGenre="+favoriteMusicGenre + "\n" + "favoriteFood=" + favoriteFood + "\n" + "streetAddress=" + streetAddress + "\n" + "city=" + city + "\n" + "country=" + country ; } static void print(RandomPerson randomPerson){ System.out.println( randomPerson ); } } In the next step we'll fill an object using the providers that we quote in the first section. First of all, we create an object called randomData that represents Faker class. This class contains all the providers in the example below. public static void main(String[] args) { Faker randomData = new Faker(); RandomPerson randomPerson = new RandomPerson(); randomPerson.firstName = randomData.name().firstName(); randomPerson.lastName = randomData.name().lastName(); randomPerson.favoriteMusicGenre = randomData.music().genre(); randomPerson.favoriteFood = randomData.food().dish(); randomPerson.streetAddress = randomData.address().streetAddress(); randomPerson.city = randomData.address().city(); randomPerson.country = randomData.address().country(); print(randomPerson); } After the execution, we can see the results like this at the console: Result firstName=Dorthy lastName=Jones favoriteMusicGenre=Electronic favoriteFood=Cauliflower Penne streetAddress=7411 Darin Gateway city=Gutkowskifort country=Greece Every execution will be a new result because of providers are randoms. Another interesting feature is that we can set up the Locale when instantiate an object. Faker randomData = new Faker(Locale.JAPANESE); See the results based on Local.JAPANESE: Result firstName=航 lastName=横山 favoriteMusicGenre=Non Music favoriteFood=French Fries with Sausages streetAddress=418 美桜Square city=南斉藤区 country=Togo Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Unit Testing Principles, Practices, and Patterns: Effective Testing Styles, Patterns, and Reliable Automation for Unit Testing, Mocking, and Integration Testing with Examples in C# is a book that covers Unit Testing Principles, Patterns and Practices teaches you to design and write tests that target key areas of your code including the domain model. In this clearly written guide, you learn to develop professional-quality tests and test suites and integrate testing throughout the application life cycle. Mastering Unit Testing Using Mockito and JUnit is a book that covers JUnit practices using one of the most famous testing libraries called Mockito. This book teaches how to create and maintain automated unit tests using advanced features of JUnit with the Mockito framework, continuous integration practices (famous CI) using market tools like Jenkins along with one of the largest dependency managers in Java projects, Maven. For you who are starting in this world, it is an excellent choice. Isn't a cool library!? See you!
- Working with Schemas in Spark Dataframes using PySpark
What's a schema in the Dataframes context? Schemas are metadata that allows working with a standardized Data. Well, that was my definition about schemas but we also can understanding schemas as a structure that represents a data context or a business model. Spark enables using schemas with Dataframes and I believe that is a good point to keep data quality, reliability and we also can use these points to understand the data and connect to the business. But if you know a little more about Dataframes, working with schema isn't a rule. Spark provides features that we can infer to a schema without defined schemas and reach to the same result, but depending on the data source, the inference couldn't work as we expect. In this post we're going to create a simple Dataframe example that will read a CSV file without a schema and another one using a defined schema. Through examples we'll can see the advantages and disadvantages. Let's to the work! CSV File content "type","country","engines","first_flight","number_built" "Airbus A220","Canada",2,2013-03-02,179 "Airbus A320","France",2,1986-06-10,10066 "Airbus A330","France",2,1992-01-02,1521 "Boeing 737","USA",2,1967-08-03,10636 "Boeing 747","USA",4,1969-12-12,1562 "Boeing 767","USA",2,1981-03-22,1219 If you noticed in the content above, we have different data types. We have string, numeric and date column types. The content above will be represented by airliners.csv in the code. Writing a Dataframe without Schema from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() air_liners_df = spark.read \ .option("header", "true") \ .format("csv") \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result It seems that worked fine but if you look with attention, you'll realize that in the schema structure there are some field types that don't match with their values, for example fields like number_built, engines and first_flight. They aren't string types, right? We can try to fix it adding the following parameter called "inferSchema" and setting up to "true". from pyspark.sql import SparkSession if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() air_liners_df = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .format("csv") \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result Even inferring the schema, the field first_flight keeping as a string type. Let's try to use Dataframe with a defined schema to see if this details will be fixed. Writing a Dataframe with Schema Now it's possible to see the differences between the codes. We're adding an object that represents the schema. This schema describes the content in CSV file, you can note that we have to describe the column name and type. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StringType, IntegerType, DateType, StructField if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("schema-app") \ .getOrCreate() StructSchema = StructType([ StructField("type", StringType()), StructField("country", StringType()), StructField("engines", IntegerType()), StructField("first_flight", DateType()), StructField("number_built", IntegerType()) ]) air_liners_df = spark.read \ .option("header", "true") \ .format("csv") \ .schema(StructSchema) \ .load("airliners.csv") air_liners_df.show() air_liners_df.printSchema() Dataframe/Print schema result After we defined the schema, all the field types match with their values. This shows how important is to use schemas with Dataframes. Now it's possible to manipulate the data according to the type with no concerns. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices. Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language. Cool? I hope you enjoyed it!
- How to save costs on S3 running Data Lake
Cloud services provides useful resources to scale your business faster but not always we can measure cloud costs when we’re starting a business from the scratch or even being a solid business, costs always makes part of the strategy for any company who want to provide a better service. Me and my teammates have worked in a Data platform based on events enable to process 350 million events every day. We provide data to the client applications and to the businesses teams to make decisions and it always a challenge do deal with the massive data traffic and how we can maintain these data and saving money with storage at the same time. Storage is too expensive and there are some strategies to save money. For this post I’ll describe some strategies that we’ve adopted to save costs on S3 (Simple Storage Service) and I hope we can help it. Strategies Strategy #1 Amazon S3 storage classes Amazon S3 provides a way to manage files through life cycle settings, out there you can set ways to move files to different storage classes depending on the file’s age and access frequency. This strategy can save a lot of money to your company. Working with storage class enable us saving costs. By default, data are stored on S3 Standard storage class. This storage type has some benefits of storage and data access but we realized that after data transformed in the Silver layer, data in the Bronze layer it wasn’t accessed very often and it was totally possible to move them to a cheaper storage class. We decided to move it using life cycle settings to S3 Intelligent Tiering storage class. This storage class it was a perfect fit to our context because we could save costs with storage and even in case to access these files for a reason we could keeping a fair cost. We’re working on for a better scenario which we could set it a life cycle in the Silver layer to move files that hasn’t been accessed for a period to a cheaper storage class but at the moment we need to access historical files with high frequency. If you check AWS documentation you’ll note that there’s some cheapest storage classes but you and your team should to analyse each case because how cheapest is to store data more expensive will be to access them. So, be careful, try to understand the patterns about storage and data access in your Data Lake architecture before choosing a storage class that could fit better to your business. Strategy #2 Partitioning Data Apache Spark is the most famous framework to process a large amount of data and has been adopted by data teams around the world. During the data transformation using Spark you can set it a Dataframe to partition data through a specific column. This approach is too useful to perform SQL queries better. Note that partitioning approach has no relation to S3 directly but the usage avoids full scans in S3 objects. Full scans means that after SQL queries, the SQL engine can load gigabytes even terabytes of data. This could be very expensive to your company, because you can be charged easily depending on amount of loaded data. So, partitioning data has an important role when we need to save costs. Strategy #3 Delta Lake vacuum Delta Lake has an interesting feature called vacuum that’s a mechanism to remove files from the disk with no usage. Usually teams adopt this strategy after restoring versions that some files will be remain and they won’t be managed by Delta Lake. For example, in the image below we have 5 versions of Delta tables and their partitions. Suppose that we need to restore to version because we found some inconsistent data after version 1. After this command, Delta will point his management to version 1 as the current version but the parquet files related to others version will be there with no usage. We can remove these parquets running vacuum command as shown below. Note that parquets related to versions after 1 were removed releasing space in the storage. For more details I strongly recommend seeing Delta Lake documentation. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
- Differences between FAILFAST, PERMISSIVE and DROPMALFORED modes in Dataframes
There's a bit differences between them and we're going to find out in this post. The parameter mode is a way to handle with corrupted records and depending of the mode, allows validating Dataframes and keeping data consistent. In this post we'll create a Dataframe with PySpark and comparing the differences between these three types of mode: PERMISSIVE DROPMALFORMED FAILFAST CSV file content This content below simulates some corrupted records. There are String types for the engines column that we'll define as an Integer type in the schema. "type","country","city","engines","first_flight","number_built" "Airbus A220","Canada","Calgary",2,2013-03-02,179 "Airbus A220","Canada","Calgary","two",2013-03-02,179 "Airbus A220","Canada","Calgary",2,2013-03-02,179 "Airbus A320","France","Lyon","two",1986-06-10,10066 "Airbus A330","France","Lyon","two",1992-01-02,1521 "Boeing 737","USA","New York","two",1967-08-03,10636 "Boeing 737","USA","New York","two",1967-08-03,10636 "Boeing 737","USA","New York",2,1967-08-03,10636 "Airbus A220","Canada","Calgary",2,2013-03-02,179 Let's start creating a simple Dataframe that will load data from a CSV file with the content above, let's supposed that the content above it's from a file called airplanes.csv. To modeling the content, we're also creating a schema that will allows us to Data validate. Creating a Dataframe using PERMISSIVE mode The PERMISSIVE mode sets to null field values when corrupted records are detected. By default, if you don't specify the parameter mode, Spark sets the PERMISSIVE value. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "PERMISSIVE") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of PERMISSIVE mode Creating a Dataframe using DROPMALFORMED mode The DROPMALFORMED mode ignores corrupted records. The meaning that, if you choose this type of mode, the corrupted records won't be list. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "DROPMALFORMED") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of DROPMALFORMED mode After execution it's possible to realize that the corrupted records aren't available at Dataframe. Creating a Dataframe using FAILFAST mode Different of DROPMALFORMED and PERMISSIVE mode, FAILFAST throws an exception when detects corrupted records. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "FAILFAST") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of FAILFAST mode ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices. Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language. Cool? I hope you enjoyed it!