Buscar | Coffee and Tips

49 results found for ""

How to save costs on S3 running Data Lake
Cloud services provides useful resources to scale your business faster but not always we can measure cloud costs when we’re starting a business from the scratch or even being a solid business, costs always makes part of the strategy for any company who want to provide a better service. Me and my teammates have worked in a Data platform based on events enable to process 350 million events every day. We provide data to the client applications and to the businesses teams to make decisions and it always a challenge do deal with the massive data traffic and how we can maintain these data and saving money with storage at the same time. Storage is too expensive and there are some strategies to save money. For this post I’ll describe some strategies that we’ve adopted to save costs on S3 (Simple Storage Service) and I hope we can help it. Strategies Strategy #1 Amazon S3 storage classes Amazon S3 provides a way to manage files through life cycle settings, out there you can set ways to move files to different storage classes depending on the file’s age and access frequency. This strategy can save a lot of money to your company. Working with storage class enable us saving costs. By default, data are stored on S3 Standard storage class. This storage type has some benefits of storage and data access but we realized that after data transformed in the Silver layer, data in the Bronze layer it wasn’t accessed very often and it was totally possible to move them to a cheaper storage class. We decided to move it using life cycle settings to S3 Intelligent Tiering storage class. This storage class it was a perfect fit to our context because we could save costs with storage and even in case to access these files for a reason we could keeping a fair cost. We’re working on for a better scenario which we could set it a life cycle in the Silver layer to move files that hasn’t been accessed for a period to a cheaper storage class but at the moment we need to access historical files with high frequency. If you check AWS documentation you’ll note that there’s some cheapest storage classes but you and your team should to analyse each case because how cheapest is to store data more expensive will be to access them. So, be careful, try to understand the patterns about storage and data access in your Data Lake architecture before choosing a storage class that could fit better to your business. Strategy #2 Partitioning Data Apache Spark is the most famous framework to process a large amount of data and has been adopted by data teams around the world. During the data transformation using Spark you can set it a Dataframe to partition data through a specific column. This approach is too useful to perform SQL queries better. Note that partitioning approach has no relation to S3 directly but the usage avoids full scans in S3 objects. Full scans means that after SQL queries, the SQL engine can load gigabytes even terabytes of data. This could be very expensive to your company, because you can be charged easily depending on amount of loaded data. So, partitioning data has an important role when we need to save costs. Strategy #3 Delta Lake vacuum Delta Lake has an interesting feature called vacuum that’s a mechanism to remove files from the disk with no usage. Usually teams adopt this strategy after restoring versions that some files will be remain and they won’t be managed by Delta Lake. For example, in the image below we have 5 versions of Delta tables and their partitions. Suppose that we need to restore to version because we found some inconsistent data after version 1. After this command, Delta will point his management to version 1 as the current version but the parquet files related to others version will be there with no usage. We can remove these parquets running vacuum command as shown below. Note that parquets related to versions after 1 were removed releasing space in the storage. For more details I strongly recommend seeing Delta Lake documentation. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Differences between FAILFAST, PERMISSIVE and DROPMALFORED modes in Dataframes
There's a bit differences between them and we're going to find out in this post. The parameter mode is a way to handle with corrupted records and depending of the mode, allows validating Dataframes and keeping data consistent. In this post we'll create a Dataframe with PySpark and comparing the differences between these three types of mode: PERMISSIVE DROPMALFORMED FAILFAST CSV file content This content below simulates some corrupted records. There are String types for the engines column that we'll define as an Integer type in the schema. "type","country","city","engines","first_flight","number_built" "Airbus A220","Canada","Calgary",2,2013-03-02,179 "Airbus A220","Canada","Calgary","two",2013-03-02,179 "Airbus A220","Canada","Calgary",2,2013-03-02,179 "Airbus A320","France","Lyon","two",1986-06-10,10066 "Airbus A330","France","Lyon","two",1992-01-02,1521 "Boeing 737","USA","New York","two",1967-08-03,10636 "Boeing 737","USA","New York","two",1967-08-03,10636 "Boeing 737","USA","New York",2,1967-08-03,10636 "Airbus A220","Canada","Calgary",2,2013-03-02,179 Let's start creating a simple Dataframe that will load data from a CSV file with the content above, let's supposed that the content above it's from a file called airplanes.csv. To modeling the content, we're also creating a schema that will allows us to Data validate. Creating a Dataframe using PERMISSIVE mode The PERMISSIVE mode sets to null field values when corrupted records are detected. By default, if you don't specify the parameter mode, Spark sets the PERMISSIVE value. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "PERMISSIVE") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of PERMISSIVE mode Creating a Dataframe using DROPMALFORMED mode The DROPMALFORMED mode ignores corrupted records. The meaning that, if you choose this type of mode, the corrupted records won't be list. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "DROPMALFORMED") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of DROPMALFORMED mode After execution it's possible to realize that the corrupted records aren't available at Dataframe. Creating a Dataframe using FAILFAST mode Different of DROPMALFORMED and PERMISSIVE mode, FAILFAST throws an exception when detects corrupted records. from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType if __name__ == "__main__": spark = SparkSession.builder \ .master("local[1]") \ .appName("spark-app") \ .getOrCreate() schema = StructType([ StructField("TYPE", StringType()), StructField("COUNTRY", StringType()), StructField("CITY", StringType()), StructField("ENGINES", IntegerType()), StructField("FIRST_FLIGHT", StringType()), StructField("NUMBER_BUILT", IntegerType()) ]) read_df = spark.read \ .option("header", "true") \ .option("mode", "FAILFAST") \ .format("csv") \ .schema(schema) \ .load("airplanes.csv") read_df.show(10) Result of FAILFAST mode ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices. Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language. Cool? I hope you enjoyed it!
Differences between External and Internal tables in Hive
There are two ways to create tables in the Hive context and this post we'll show the differences, advantages and disadvantages. Internal Table Internal tables are known as Managed tables and we'll understand the reason in the following. Now, let's create an internal table using SQL in the Hive context and see the advantages and disadvantages. create table coffee_and_tips_table (name string, age int, address string) stored as textfile; Advantages To be honest I wouldn't say that it's an advantage but Internal tables are managed by Hive Disadvantages Internal tables can't access remote storage services for example in clouds like Amazon AWS, Microsoft Azure and Google Cloud. Dropping Internal tables all the data including metadata and partitions will be lost. External Table External tables has some interesting features compared to Internal tables and it's a good and recommended approach when we need to create tables. In the script below you can see the difference between Internal table creation and External table related to the last section. We just added the reserved word external in the script. create external table coffee_and_tips_external (name string, age int, address string) stored as textfile; Advantages The data and metadata won't be lost if drop table External tables can be accessed and managed by external process External tables allows access to remote storage service as a source location Disadvantages Again, I wouldn't say that it's a disadvantage but if you need to change schema or dropping a table, probably you'll need to run a command to repair the table as shown below. msck repair table Depending on the volume, this operation may take some time to complete. To check out a table type, run the following command below and you'll see at the column table_type the result. hive> describe formatted Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Programming Hive This comprehensive guide introduces you to Apache Hive, Hadoop’s data warehouse infrastructure. You’ll quickly learn how to use Hive’s SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoop’s distributed filesystem. Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Cool? I hope you enjoyed it!
First steps with DBT - Data Build Tool
DBT has been used by a lot of companies on Data area and I believe that we can extract good insights in this post about it. That's going to be a practical post showing how DBT works it and hope you guys enjoy it. What's DBT? DBT means Data Build Tool and enables teams to transform data already loaded in their warehouse with simple select statements. DBT does the T in ELT processes, in the other words, he doesn't work to extract and load data but he's useful to transform it. Step 1: Creating a DBT Project Now, we're assume that DBT is already installed but if not, I recommend see this link. After DBT installed you can create a new project using CLI or you can clone this project from the DBT Github repository. Here for this post we're going to use CLI mode to create our project and also to complete the next steps. To create a new project, run the command below. dbt init After running this command, you need to type the project's name and which warehouse or database you're going to use like the image below. For this post, we're going to use postgres adapter. It's very important that you have a postgres database already installed or you can up a postgres image using docker. About adapters, DBT supports different of them and you can check here. I created a table structure and also loaded it with data simulating data from a video platform called wetube and we're going to use them to understand how DBT works it. Follow the structure: Step 2: Structure and more about DBT After running dbt init command to create the project, a structure of folders and files below will be created. I won't talk about the whole directories of project but I'd like to focus in two of them. Sources Sources are basically the data already loaded into your warehouse. In DBT process, sources have the same meaning of raw data. There's no folders representing source data for this project but you need to know about this term because we're going to set up tables already created as sources for the next sections. Seeds Seeds is an interesting and useful mechanism to load static data into your warehouse through CSV files. If you want to load these data you need to create a CSV file on this directory and run the command below. dbt seed For each field on CSV file, DBT will infer their types and create a table into warehouse or database. Models DBT works with Model paradigm, the main idea is that you can create models through the transformation using SQL statements based on tables sources or existing models Every SQL file located in your model folder will create a model into your warehouse when the command below runs. dbt run Remember that a model can be created through a source or another model and don't worry about this, I'll show you more details about it. Step 3: Setting up database connection After project already created, we need to set up our database's connection and here at this post, we're going to use postgres as database. After initialize the project a bunch of files are created and one of them is called profiles.yml. profiles.yml file is responsible to control the different profiles to the different database's connection like dev and production environment. If you've noticed, we can't see this file on the image above because this file is created outside of project to avoid sensitive credentials. You can find this file in ~/.dbt/ directory. If you note, we have one profile named dbt_blog and a target called dev, by default the target refer to dev with the database's connection settings. Also, It's possible to create one or more profiles and targets, it enables working with different environments. Another important detail is that dbt_blog profile should be specified on dbt_project.yml file as a default profile. For the next sections, we'll discuss what and how dbt_project.yml file works it. Step 4: Creating dbt_project.yml file Every DBT project has a dbt_project.yml file, you can set up informations like project name, directories, profiles and materialization type. name: 'dbt_blog' version: '1.0.0' config-version: 2 profile: 'dbt_blog' model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] target-path: "target" # directory which will store compiled SQL files clean-targets: # directories to be removed by `dbt clean` - "target" - "dbt_packages" models: dbt_blog: # Config indicated by + and applies to all files under models/example/ mart: +materialized: table Note that profile field was set up as the same profile specified on profiles.yml file and another important detail is about materialized field. Here was set up as a "table" value but by default, is a "view". Materialized fields allows you to create models as a table or view on each run. There are others type of materialization but we won't discuss here and I recommend see dbt docs. Step 5: Creating our first model Creating first files Let's change a little and let's going to create a sub-folder on model directory called mart and inside this folder we're going to create our .SQL files and also another important file that we don't discuss yet called schema.yml. Creating schema file Schema files are used to map sources and to document models like model's name, columns and more. Now you can create a file called schema.yml e fill up with these informations below. version: 2 sources: - name: wetube tables: - name: account - name: city - name: state - name: channel - name: channel_subs - name: video - name: video_like - name: user_address models: - name: number_of_subs_by_channel description: "Number of subscribers by channel" columns: - name: id_channel description: "Channel's ID" tests: - not_null - name: channel description: "Channel's Name" tests: - not_null - name: num_of_subs description: "Number of Subs" tests: - not_null Sources: At sources field you can include tables from your warehouse or database that's going to be used on model creation. models: At models field you can include the name's model, columns and their description Creating a model This part is where we can create SQL scripts that's going to result in our first model. For the first model, we're going to create a SQL statement to represent a model that we can see the numbers of subscribers by channel. Let's create a file called number_of_subs_by_channel.sql and fill up with these scripts below. with source_channel as ( select * from {{ source('wetube', 'channel') }} ), source_channel_subs as ( select * from {{ source('wetube','channel_subs') }} ), number_of_subs_by_channel as ( select source_channel.id_channel, source_channel.name, count(source_channel_subs.id_subscriber) num_subs from source_channel_subs inner join source_channel using (id_channel) group by 1, 2 ) select * from number_of_subs_by_channel Understanding model creation Note that we have multiple scripts separated by common table expression (CTE) that becomes useful to understand the code. DBT enables using Jinja template {{ }} bringing a better flexibility to our code. The usage of keyword source inside Jinja template means that we're referring source tables. To refer a model you need to use ref keyword. The last SELECT statement based on source tables generates the model that will be created as table in the database. Running our first model Run the command below to create our first model dbt run Output: Creating another model Imagine that we need to create a model containing account information and it's channels. Let's get back to schema.yml file to describe this new model. - name: account_information description: "Model containing account information and it's channels" columns: - name: id_account description: "Account ID" tests: - not_null - name: first_name description: "First name of user's account" tests: - not_null - name: last_name description: "Last name of user's account" tests: - not_null - name: email description: "Account's email" tests: - not_null - name: city_name description: "city's name" tests: - not_null - name: state_name description: "state's name" tests: - not_null - name: id_channel description: "channel's Id" tests: - not_null - name: channel_name description: "channel's name" tests: - not_null - name: channel_creation description: "Date of creation name" tests: - not_null Now, let's create a new SQL file and name it as account_information.sql and put scripts below: with source_channel as ( select * from {{ source('wetube', 'channel') }} ), source_city as ( select * from {{ source('wetube','city') }} ), source_state as ( select * from {{ source('wetube','state') }} ), source_user_address as ( select * from {{ source('wetube','user_address') }} ), source_account as ( select * from {{ source('wetube','account') }} ), account_info as ( select account.id_user as id_account, account.first_name, account.last_name, account.email, city.name as city_name, state.name as state_name, channel.id_channel, channel.name as channel, channel.creation_date as channel_creation FROM source_account account inner join source_channel channel on (channel.id_account = account.id_user) inner join source_user_address user_address using (id_user) inner join source_state state using (id_state) inner join source_city city using (id_city) ) select * from account_info Creating our last model For our last model, we going to create a model about how many likes has a video. Let's change again the schema.yml to describe and to document our future and last model. - name: total_likes_by_video description: "Model containing total of likes by video" columns: - name: id_channel description: "Channel's Id" tests: - not_null - name: channel description: "Channel's name" tests: - not_null - name: id_video description: "Video's Id" tests: - not_null - name: title description: "Video's Title" tests: - not_null - name: total_likes description: "Total of likes" tests: - not_null Name it a file called total_likes_by_video.sql and put the code below: with source_video as ( select * from {{ source('wetube','video') }} ), source_video_like as ( select * from {{ source('wetube','video_like') }} ), source_account_info as ( select * from {{ ref('account_information') }} ), source_total_like_by_video as ( select source_account_info.id_channel, source_account_info.channel, source_video.id_video, source_video.title, count(*) as total_likes FROM source_video_like inner join source_video using (id_video) inner join source_account_info using (id_channel) GROUP BY source_account_info.id_channel, source_account_info.channel, source_video.id_video, source_video.title ORDER BY total_likes DESC ) select * from source_total_like_by_video Running DBT again After creation of our files, let's run them again to create the models dbt run Output The models were created in the database and you can run select statements directly in your database to check it. Model: account_information Model: number_of_subs_by_channel Model: total_likes_by_video Step 6: DBT Docs Documentation After generated our models, now we're going to generate docs based on these models. DBT generates a complete documentation about models and sources and their columns and also you can see through a web page. Generating docs dbt docs generate Running docs on webserver After docs generated you can run command below to start a webserver on port 8080 and see the documentation locally. dbt docs serve Lineage Another detail about documentation is that you can see through of a Lineage the models and it's dependencies. Github code You can checkout this code through our Github page. Cool? I hope you guys enjoyed it!
Overview about AWS SNS - Simple Notification Service
SNS (Simple Notification Service) provides a notification service based on Pub/Sub strategy. It's a way of publishing messages to one or more subscribers through endpoints. Is that confuse? Let's go deeper a little more about this theme. Usually the term Pub/Sub is related to event-driven architectures. In this architecture publishing messages can be done through notifications to one or more destinations already known, providing an asynchronous approach. For a destiny to become known, there must be a way to signal that the destination becomes a candidate to receive any message from the source. But how does subscriptions work? For SNS context, each subscriber could be associated to one or more SNS Topics. Thus, for each published message through Topic, one or more subscribers will receive them. We can compare when we receive push notifications from installed apps in our smartphones. It's the same idea, after an installed app we became a subscriber of that service. Thus, each interaction from that application could be done through notifications or published messages. The above example demonstrates a possible use case that SNS could be applied. For the next sections we'll discuss details for a better understanding. SNS basically provides two main characteristics, Publishers and Subscribers. Both of them work together providing resources through AWS console and APIs. 1. Topics/Publishers Topics are logical endpoints that works as an interface between Publisher and Subscriber. Basically Topics provides messages to the subscribers after published messages from the publisher. There are two types of Topics, Fifo and Standard: Fifo: Fifo type allows messages ordering (First in/First out), has a limit up to 300 published messages per second, prevent messages duplication and supports only SQS protocols as subscriber. Standard: It does not guarantee messages ordering rules and all of the supported delivery protocols can subscribe to a standard topic such as SQS, Lambda, HTTP, SMS, EMAIL and mobile apps endpoints. Limits AWS allows to create up to 100.000 topics per account. 2. Subscribers The subscription is a way to connect an endpoint to a specific Topic. Each subscription must have associated to a Topic to receive notifications from that Topic. Examples of endpoints: AWS SQS HTTP HTTPS AWS Kinesis Data Firehose E-mail SMS AWS Lambda These above endpoints are examples of delivery or transportation formats to receive notifications from a Topic through a subscription. Limit Subscriptions AWS allows up to 10 millions subscriptions per Topic. 3. Message size limit SNS messages can contain up to 256 KB of text data. Different of the SMS that can contain up to 140 bytes of text data. 4. Message types SNS supports different types of messages such as text, XML, JSON and more. 5. Differences between SNS and SQS Sometimes people get confused about the differences but don't worry I can explain these differences. SNS and SQS are different services but they can be associated. SQS is an AWS queue service that retains sent messages from the different clients and contexts, but at the same time they can works as a subscriber to a Topic. Thus, a SQS protocol subscribed to a Topic will start receiving notifications, becoming an asynchronous integration. Look at the image above, we're simulating a scenario that we have three SQS subscribed in three Topics. SQS 1 is a subscriber to the Topic 1 and 2. Thus, SQS 1 will receive notifications/messages from the both Topics, 1 and 2. SQS 2 is another subscriber to the Topic 2 and 3 that automatically also will receive messages from Topic 2 and 3. And for the last one case, we have SQS 3 as a subscriber to the Topic 3 that will receive messages only from Topic 3. For more details I recommend read this doc. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
How to create S3 notification events using SQS via Terraform
S3 (Simple Storage Service) makes it possible to notify through events when an action occurs within a Bucket or in a specific folder. In other words, it works as a listener. Therefore, any action that takes place on a source, an event notification will be sent to a destination. What would those actions be? Any actions that takes place within a S3 Bucket such as creating objects, folders, removing files, restoring files and more. Destinations For each event notification configuration, there must be a destination. For this destination, information about each action will be sent, for example: A new file has been created in a specific folder, so information about the file will be sent, such as the creation date, file size, event type, file name, and more. Remembering that in this process, the content of the file is not sent, okay? There are 3 types of destinations: Lambda SNS SQS Understanding how it works In this post we are going to create an event notification settings in an S3 Bucket, simulating an action and understanding the final behavior. We could create this setting via console but for good practice reasons, we'll use Terraform as IaC tool. For those who aren't very familiar with Terraform, follow this tutorial on Getting Started using Terraform on AWS. In the next step, we will create a flow simulating the image below. We'll set in S3 Bucket for every file created within files/ folder, a notification event will be sent to a SQS queue. Creating Terraform files Create a folder called terraform/ in your project and from now on, all .tf files will be created inside it. Now, create a file called vars.tf where we're going to store the variables that will be used and paste the content below to this file. variable "region" { default = "us-east-1" type = string } variable "bucket" { type = string } Create a file called provider.tf , where we will add the provider settings, which will be AWS. This means, Terraform will use AWS as the cloud to create the resources and will download the required plugins on startup. Copy the code below to the file. provider "aws" { region = "${var.region}" } Create a file called s3.tf , where we'll add the settings for creating a new S3 Bucket that will be used for this tutorial. resource "aws_s3_bucket" "s3_bucket_notification" { bucket = "${var.bucket}" } Now, create a file called sqs.tf , where we'll add the settings for creating an SQS queue and some permissions according to the code below: resource "aws_sqs_queue" "s3-notifications-sqs" { name = "s3-notifications-sqs" policy = <<POLICY { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": "*", "Action": "sqs:SendMessage", "Resource": "arn:aws:sqs:*:*:s3-notifications-sqs", "Condition": { "ArnEquals": { "aws:SourceArn": "${aws_s3_bucket.s3_bucket_notification.arn}" } } } ] } POLICY } Understanding code above In the code above, we're creating an SQS and adding some policy settings, see more details: SQS name will be s3-notifications-sqs, detailed value in name field In the policy field, we define a policy that allows S3 sending messages notification to SQS. Notice that we're referencing the Bucket S3 via ARN in the snippet ${aws_s3_bucket.s3_bucket_notification.arn} For the last file, let's create the settings that allows sending event notifications from S3 Bucket to an SQS. Therefore, create s3_notification.tf file and add the code below: resource "aws_s3_bucket_notification" "s3_notification" { bucket = aws_s3_bucket.s3_bucket_notification.id queue { events = ["s3:ObjectCreated:*"] queue_arn = aws_sqs_queue.s3-notifications-sqs.arn filter_prefix = "files/" } } Understanding code above In the code above, we are creating a resource called aws_s3_bucket_notification which will be responsible for enabling notifications from an S3 Bucket. In the bucket field, we are referring to the S3 bucket setting located on s3.tf file. The block queue contains some settings such as: events: Is the event type of the notification. In this case, for ObjectCreated type events. Notifications will be sent only for created objects. For deleted objects, there will be no notifications. It helps to restrict some types of events. queue_arn: Refers to the SQS defined in the sqs.tf file. filter_prefix: This field defines the folder where we want notifications to be triggered. In the code, we set the folder files/ to be the trigger location when the files are created. Summarizing, for all files created within folder files/ , a notification will be sent to the SQS defined in the queue_arn field. Running Terraform Init Terraform terraform init Running Plan The plan makes it possible to verify which resources will be produced, in this case it is necessary to pass the value of the bucket variable for its creation in S3. terraform plan -var bucket = 'type the bucket name' Running Apply command In this step, the creation of resources will be applied. Remember to pass the name of the bucket you want to create into the bucket variable, and the bucket name must be unique. terraform apply -var bucket = 'type the bucket name' Simulating an event notification After running the previous steps and creating the resources, we will manually upload a file in the files/ folder to the bucket that was created. Via console, access the Bucket created in S3 and create a folder called files. Inside it, load any file. Uploading file After loading the file in the files/ folder, access the created SQS. You'll see some available messages. Usually 3 messages will be available in queue because after creating S3 notification events settings, a test message is sent. The second one happens when we create a folder and for the last, the message related to the file upload. It's done, we have an event notification created!. References: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/s3_bucket_notification https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-event-notifications.html Coffee and tips Github repository: https://github.com/coffeeandtips/s3-bucket-notification-with-sqs-and-terraform Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastrucutre as Code is a book focused on how to use Terraform and its benefits. The author sought to make comparisons with several other IAC (Infrastructure as code) tools such as Ansible and Cloudformation (IAC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer, I strongly recommend learning about it. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Understanding AWS S3 in 1 minute
Amazon Web Services AWS is one of the most traditional and resourceful Cloud Computing services on the market. One of the most used resources is S3. What's AWS S3? AWS S3 is an abbreviation for Simple Storage Service. It is a resource managed by AWS itself, that is, we do not worry about managing the infrastructure. Provides an object repository allowing you to store objects from different volumes, backups, sites, perform data analysis, manage Data Lakes, etc. It also provides several integrations with other AWS services aimed at a base repository. S3 Structure AWS S3 is divided into different Buckets containing a folder structure based on customer needs. What's a Bucket? S3 Bucket is a kind of container where all objects will be stored and organized. A very important thing is that this Bucket must be unique. Architecture In the image below we have a more simpler design of the S3 architecture with buckets and folder directories. Integrations The AWS ecosystem makes it possible to integrate most of its and third-party tools. S3 is one of the most integrated resources. Some examples of integrated services: Athena Glue Kinesis Firehose Lambda Delta Lake RDS Outros SDK AWS provides SDK compatible for different programming languages that makes it possible to manipulate objects in S3, such as creating Buckets, folders, uploading and downloading files and much more. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
How to read CSV file with Apache Spark
Apache Spark works very well in reading several files for data extraction, in this post we'll create an example of reading a CSV file using Spark, Java and Maven. Maven org.apache.spark spark-core_2.12 3.1.0 org.apache.spark spark-sql_2.12 3.1.0 CSV Content Let's suppose that file's name below is movies.csv. title;year;rating The Shawshank Redemption;1994;9.3 The Godfather;1972;9.2 The Dark Knight;2008;9.0 The Lord of the Rings: The Return of the King ;2003;8.9 Pulp Fiction;1994;8.9 Fight Club;1999;8.8 Star Wars: Episode V - The Empire Strikes Back;1980;8.7 Goodfellas;1990;8.7 Star Wars;1977;8.6 Creating a SparkSession SparkConf sparkConf = new SparkConf(); sparkConf.setMaster("local[*]"); sparkConf.setAppName("app"); SparkSession sparkSession = SparkSession.builder() .config(sparkConf) .getOrCreate(); Running the Read Dataset ds = sparkSession.read() .format("CSV") .option("sep",";") .option("inferSchema", "true") .option("header", "true") .load("movies.csv"); ds.select("title","year","rating").show(); Result Understanding some parameters .option("sep", ";"): Defines the use of a default delimiter for file reading, in this case the delimiter is a semicolon (;) .option("inferSchema", "true"): The inferSchema parameter makes it possible to infer the file(s) in order to understand (guess) the data types of each field .option("header", "true"): Enabling the header parameter makes it possible to use the name of each field defined in the file header .load("movies.csv"): movies.csv is the name of the file to be read Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python. Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading. High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more. Maven: The Definitive Guide Written by Maven creator Jason Van Zyl and his team at Sonatype, Maven: The Definitive Guide clearly explains how this tool can bring order to your software development projects. In this book you'll learn about: The POM and Project Relationships, The Build Lifecycle, Plugins, Project website generation, Advanced site generation, Reporting, Properties, Build Profiles, The Maven Repository and more. Well that’s it, I hope you enjoyed it!
Accessing and Modifying Terraform State
Before starting to talk about access to states, it is necessary to explain what states or State are. What are States? What's Terraform State? Terraform State is a way for Terraform to manage the infrastructure, configurations and resources created in order to maintain a mapping of what already exists and control the update and creation of new resources. A basic example is when we create an S3 Bucket, an EC2 instance or an SQS via Terraform. All these resources are mapped in the state and are managed by Terraform. State locations Local By default Terraform allocates state locally in the terraform.tfsate file. Using the State locally can work well for a specific scenario where there is no need to share the State between teams. Remote Unlike Local, when we have teams sharing the same resources, using State remotely becomes essential. Terraform provides support so that State can be shared remotely. We won't go into detail on how to configure it, but it's possible to keep State in Amazon S3, Azure Blob Storage, Google Cloud Storage, Alibaba Cloud OSS, and other cloud services. The State is represented by the terraform.tfsate file in JSON format, here is an example of a S3 Bucket mapped on State: { "version": 4, "terraform_version": "0.12.3", "serial": 3, "lineage": "853d8b-4ee1-c1e4-e61e-e10", "outputs": {}, "resources": [ { "mode": "managed", "type": "aws_s3_bucket", "name": "s3_bucket_xpto", "provider": "provider.aws", "instances": [ { "schema_version": 0, "attributes": { "acceleration_status": "", "acl": "private", "arn": "arn:aws:s3:::bucket.xpto", "bucket": "bucket.xpto", "bucket_domain_name": "bucket.xpto", "bucket_prefix": null, "bucket_regional_domain_name": "bucket.xpto", "cors_rule": [], "force_destroy": false, "grant": [], "hosted_zone_id": "Z3NHGSIKTF", "id": "bucket.xpto", "lifecycle_rule": [], "logging": [], "object_lock_configuration": [], "policy": null, "region": "us-east-1", "replication_configuration": [], "request_payer": "BucketOwner", "server_side_encryption_configuration": [], "tags": { "Environment": "development" }, "versioning": [ { "enabled": false, "mfa_delete": false } ], "website": [], "website_domain": null, "website_endpoint": null }, "private": "Ud4JbhV==" } ] } ] } Accessing and updating the State Despite the State being allocated in a JSON file, it is not recommended to change it directly in the file. Terraform provides the use of the Terraform state commands executed via CLI so that small modifications can be made. Through the CLI, we can execute commands in order to manipulate the State, as follows: terraform state [options] [args] Sub-commands: list List the resources in the state mv Move an item in state pull Extract the current state and list the result on stdout push Update a remote state from a local state file rm Remove an instance from state show Show state resources 1. Listing State resources Command: terraform state list The above command makes it possible to list the resources being managed by State Example: $ terraform state list aws_s3_bucket.s3_bucket aws_sqs_queue.sqs-xpto In the example above, we have as a result, an S3 and an SQS Bucket that were created via terraform and are being managed by State. 2. Viewing a resource and its attributes Command: terraform state show [options] RESOURCE_ADDRESS The above command makes it possible to show in detail a specific resource and its attributes Example: $ terraform state show aws_sqs_queue.sqs-xpto # aws_sqs_queue.sqs-xpto: resource "aws_sqs_queue" "sqs-xpto" { arn = "arn:aws:sqs:sqs-xpto" content_based_deduplication = false delay_seconds = 90 fifo_queue = false id = "https://sqs-xpto" kms_data_key_reuse_period_seconds = 300 max_message_size = 262144 message_retention_seconds = 345600 name = "sqs-xpto" receive_wait_time_seconds = 10 tags = { "Environment" = "staging" } visibility_timeout_seconds = 30 } 3. Removing resources from the State Command: terraform state rm [options] RESOURCE_ADDRESS The above command removes one or more items from the State. Unlike a terraform destroy command, which removes the State resource and remote objects created in the cloud. Example: $ terraform state rm aws_sqs_queue.sqs-xpto Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Terraform: Up & Running: Writing Infrastructure as Code is a book focused on how to use Terraform and its benefits. The author make comparisons with several other IaC (Infrastructure as code) tools such as Ansible and Cloudformation (IaC native to AWS) and especially how to create and provision different resources for multiple cloud services. Currently, Terraform is the most used tool in software projects for creating and managing resources in cloud services such as AWS, Azure, Google Cloud and many others. If you want to be a complete engineer or work in the Devops area, I strongly recommend learning about the topic. AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!
Java: Streams API - findFirst()
Java 8 Streams introduced different methods for handling collections. One of these methods is the findFirst(), which allows returning the first element of a Stream through an Optional instance. Example Output Item: Monica Souza Using filter Output Item: Andre Silva Note that it returned the first name with last name Silva from the collection. Not using Streams If we use the traditional way without using Streams. The code would look like this, filtering by last name "Silva" In this case we depend on the break to end the execution. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Head First Java is a complete learning experience in Java and object-oriented programming. With this book, you'll learn the Java language with a unique method that goes beyond how-to manuals and helps you become a great programmer. Through puzzles, mysteries, and soul-searching interviews with famous Java objects, you'll quickly get up to speed on Java's fundamentals and advanced topics including lambdas, streams, generics, threading, networking, and the dreaded desktop GUI. Well that’s it, I hope you enjoyed it!
Java Streams API - Filter
Since Java 8 released in 2014, dozens of new features have been added, including improvements in the JVM and functions to make the developer's life easier. Among these features are the Expression Lambda (EL) which was the starting point for the entry of Java into the world of functional programming, improvement in the data API and the no need to create implementations of existing Interfaces with the use of Default methods . And the other news is the Streams API, the focus of this post. The Stream API is a new approach to working with Collections making the code cleaner and smarter. The Stream API works with on-demand data processing and provides dozens of features to manipulate Collections, reducing code and simplifying development in a kind of pipeline that will be explained later. To understand better, let's create a simple code of a list of objects and we'll create some conditions in order to extract a new list with the desired values. In this example we will not use Streams API. Let's create a Class representing the City entity which will have the following attributes: name, state and population. And finally a method called listCities that loads a list of objects of type City. Filtering In the next code, we will iterate through a list, and within the iteration will check which cities have a population greater than 20 and then add them to a secondary list. Until then, there are technically no problems with the above code. But it could be cleaner. Now, applying Stream API, here's a new example. Notice the difference, we invoke the method listCities() and because it returns a Collection, it is possible to invoke the Stream method. From the call of the stream() method, the pipeline starts. More examples of Filter Example 1: Output: City: Hollywood /State: CA /Population: 30 City: Venice /State: CA /Population: 10 Example 2: Output: City: Venice /State: CA /Population: 10 How does a pipeline work? Following the previous example, the pipeline is a sequential process that differentiates between intermediate and final operations. In the example, the Stream is invoked from a data source (list of objects of type City) that works on demand, the filter method is an intermediate operation, that is, it processes the data until the collect method is invoked, resulting in a final operation. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Head First Design Patterns: Building Extensible and Maintainable Object-Oriented Software is a book that through Java examples shows you the patterns that matter, when to use them and why, how to apply them to your own designs, and the object-oriented design principles on which they're based. Head First Java is a complete learning experience in Java and object-oriented programming. With this book, you'll learn the Java language with a unique method that goes beyond how-to manuals and helps you become a great programmer. Through puzzles, mysteries, and soul-searching interviews with famous Java objects, you'll quickly get up to speed on Java's fundamentals and advanced topics including lambdas, streams, generics, threading, networking, and the dreaded desktop GUI. Well that’s it, I hope you enjoyed it!
How to create Mutation tests with Pitest
Pitest is a tool that provides state-of-the-art mutation testing. It was written to run on Java applications or on applications that run on top of a JVM. How it works? Pitest works by generating mutant codes that will make changes to the bytecodes in the source code, changing logic, removing and even changing method return values. Thus, it's possible to evaluate the code in a more complete way, looking for failures and making the code reliable. Basic example of mutation See the code below: if(name == ""){ return "Empty name"; }else{ return name; } Running Pitest on the code above, it's able to modify the bytecodes by changing the code above to: if(name != ""){ return name; }else{ return "Empty name"; } Pitest enables you identifying possible points of failure and guide you to create tests based on these points. Hands on time Maven junit junit 4.12 test org.pitest pitest 1.6.2 Let's create a simple class representing a bank account called BankAccount and to make it simple we'll implement the logic inside the entity class itself. In this class, we have a method called userValid that checks if the user is valid. package com.bank.entity; public class BankAccount { public String bankName; public String user; public Double balance; public Boolean userValid(BankAccount bankAccount){ if(bankAccount.user != ""){ return true; }else{ return false; } } } Without creating any test classes, run this command via Maven: mvn org.pitest:pitest-maven:mutationCoverage See that a report was generated in the console with the mutations created based on the conditions that must be tested. Looking at the console above, we have 3 mutations or point of failures, which means we need to eliminate these mutations from our code. Pitest provides another way to validate these mutations through a report generated by an HTML page. You can access these generated reports through /target/pit-reports/ folder In the image below we can see a report that summarizes the coverage percentage. Note that we have 0% of coverage. In the BankAccount.java.html file shows details about uncovered code parts. Note that in the image above there is a list of Active mutators that are mutations based on the code we created. Removing these mutations Based on the reports, let's create some tests to reach out 100% of coverage. First of all, we need to create a class test. For this example, we're going to create a class test called BankAccountTest.java. To do this, there's an attention point. You'll have to create this class with the same package's name used for the BankAccount.java. Thus, create inside src/test/java a package called com.bank.entity and finally BankAccountTest.java. package com.bank.entity; import org.junit.Assert; import org.junit.Test; public class BankAccountTest { @Test public void userValid_NegateConditionalsMutator_Test(){ BankAccount bankAccount = new BankAccount( "Chase","Jonas", 2000.00); Boolean actual = bankAccount.userValid(bankAccount); Boolean expected = true; Assert.assertEquals(expected, actual); } @Test public void userValid_BooleanFalseReturnValsMutator_Test(){ BankAccount bankAccount = new BankAccount( "Chase","", 2000.00); Boolean actual = bankAccount.userValid(bankAccount); Boolean expected = false; Assert.assertEquals(expected, actual); } } We have created only 2 test method for our coverage: 1.userValid_NegateConditionalsMutator_Test(): Covers filled-in usernames. The constructor passed the "Chase" argument representing the username so that the method's condition is validated. And finally validates the condition of the method to be returned as TRUE. To this case, we already eliminated two mutations according to statistic report. BooleanTrueReturnValsMutator NegateConditionalsMutator 2. userValid_BooleanFalseReturnValsMutator_Test(): Validates the possibility of the method returning the Boolean value FALSE. Rerun the Maven command below: mvn org.pitest:pitest-maven:mutationCoverage Application console Note that for every mutation generated, 1 was eliminated. Covering 100% coverage in tests. References: https://pitest.org Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): Unit Testing Principles, Practices, and Patterns: Effective Testing Styles, Patterns, and Reliable Automation for Unit Testing, Mocking, and Integration Testing with Examples in C# is a book that covers Unit Testing Principles, Patterns and Practices teaches you to design and write tests that target key areas of your code including the domain model. In this clearly written guide, you learn to develop professional-quality tests and test suites and integrate testing throughout the application life cycle. Mastering Unit Testing Using Mockito and JUnit is a book that covers JUnit practices using one of the most famous testing libraries called Mockito. This book teaches how to create and maintain automated unit tests using advanced features of JUnit with the Mockito framework, continuous integration practices (famous CI) using market tools like Jenkins along with one of the largest dependency managers in Java projects, Maven. For you who are starting in this world, it is an excellent choice. Well that’s it, I hope you enjoyed it!