5 de fev. de 20233 min para ler

Differences between FAILFAST, PERMISSIVE and DROPMALFORED modes in Dataframes

There's a bit differences between them and we're going to find out in this post. The parameter mode is a way to handle with corrupted records and depending of the mode, allows validating Dataframes and keeping data consistent.

In this post we'll create a Dataframe with PySpark and comparing the differences between these three types of mode:

PERMISSIVE
DROPMALFORMED
FAILFAST

CSV file content

This content below simulates some corrupted records. There are String types for the engines column that we'll define as an Integer type in the schema.

"type","country","city","engines","first_flight","number_built"
"Airbus A220","Canada","Calgary",2,2013-03-02,179
"Airbus A220","Canada","Calgary","two",2013-03-02,179
"Airbus A220","Canada","Calgary",2,2013-03-02,179
"Airbus A320","France","Lyon","two",1986-06-10,10066
"Airbus A330","France","Lyon","two",1992-01-02,1521
"Boeing 737","USA","New York","two",1967-08-03,10636
"Boeing 737","USA","New York","two",1967-08-03,10636
"Boeing 737","USA","New York",2,1967-08-03,10636
"Airbus A220","Canada","Calgary",2,2013-03-02,179

Let's start creating a simple Dataframe that will load data from a CSV file with the content above, let's supposed that the content above it's from a file called airplanes.csv. To modeling the content, we're also creating a schema that will allows us to Data validate.

Creating a Dataframe using PERMISSIVE mode

The PERMISSIVE mode sets to null field values when corrupted records are detected. By default, if you don't specify the parameter mode, Spark sets the PERMISSIVE value.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

if __name__ == "__main__":

    spark = SparkSession.builder \
        .master("local[1]") \
        .appName("spark-app") \
        .getOrCreate()

    schema = StructType([
        StructField("TYPE", StringType()),
        StructField("COUNTRY", StringType()),
        StructField("CITY", StringType()),
        StructField("ENGINES", IntegerType()),
        StructField("FIRST_FLIGHT", StringType()),
        StructField("NUMBER_BUILT", IntegerType())
    ])

    read_df = spark.read \
        .option("header", "true") \
        .option("mode", "PERMISSIVE") \
        .format("csv") \
        .schema(schema) \
        .load("airplanes.csv")

    read_df.show(10)

Result of PERMISSIVE mode

Creating a Dataframe using DROPMALFORMED mode

The DROPMALFORMED mode ignores corrupted records. The meaning that, if you choose this type of mode, the corrupted records won't be list.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

if __name__ == "__main__":

    spark = SparkSession.builder \
        .master("local[1]") \
        .appName("spark-app") \
        .getOrCreate()

    schema = StructType([
        StructField("TYPE", StringType()),
        StructField("COUNTRY", StringType()),
        StructField("CITY", StringType()),
        StructField("ENGINES", IntegerType()),
        StructField("FIRST_FLIGHT", StringType()),
        StructField("NUMBER_BUILT", IntegerType())
    ])

    read_df = spark.read \
        .option("header", "true") \
        .option("mode", "DROPMALFORMED") \
        .format("csv") \
        .schema(schema) \
        .load("airplanes.csv")

    read_df.show(10)

Result of DROPMALFORMED mode

After execution it's possible to realize that the corrupted records aren't available at Dataframe.

Creating a Dataframe using FAILFAST mode

Different of DROPMALFORMED and PERMISSIVE mode, FAILFAST throws an exception when detects corrupted records.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

if __name__ == "__main__":

    spark = SparkSession.builder \
        .master("local[1]") \
        .appName("spark-app") \
        .getOrCreate()

    schema = StructType([
        StructField("TYPE", StringType()),
        StructField("COUNTRY", StringType()),
        StructField("CITY", StringType()),
        StructField("ENGINES", IntegerType()),
        StructField("FIRST_FLIGHT", StringType()),
        StructField("NUMBER_BUILT", IntegerType())
    ])

    read_df = spark.read \
        .option("header", "true") \
        .option("mode", "FAILFAST") \
        .format("csv") \
        .schema(schema) \
        .load("airplanes.csv")

    read_df.show(10)

Result of FAILFAST mode

ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.

Books to study and read

If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s):

Spark: The Definitive Guide: Big Data Processing Made Simple is a complete reference for those who want to learn Spark and about the main Spark's feature. Reading this book you will understand about DataFrames, Spark SQL through practical examples. The author dives into Spark low-level APIs, RDDs and also about how Spark runs on a cluster and how to debug and monitor Spark clusters applications. The practical examples are in Scala and Python.

Beginning Apache Spark 3: With Dataframe, Spark SQL, Structured Streaming, and Spark Machine Library with the new version of Spark, this book explores the main Spark's features like Dataframes usage, Spark SQL that you can uses SQL to manipulate data and Structured Streaming to process data in real time. This book contains practical examples and code snippets to facilitate the reading.

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark is a book that explores best practices using Spark and Scala language to handle large-scale data applications, techniques for getting the most out of standard RDD transformations, how Spark SQL's new interfaces improve performance over SQL's RDD data structure, examples of Spark MLlib and Spark ML machine learning libraries usage and more.

Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming covers the basic concepts of Python through interactive examples and best practices.

Learning Scala: Practical Functional Programming for the Jvm is an excellent book that covers Scala through examples and exercises. Reading this bool you will learn about the core data types, literals, values and variables. Building classes that compose one or more traits for full reusability, create new functionality by mixing them in at instantiation and more. Scala is one the main languages in Big Data projects around the world with a huge usage in big tech companies like Twitter and also the Spark's core language.

Cool? I hope you enjoyed it!

Coffee and Tips Newsletter

Differences between FAILFAST, PERMISSIVE and DROPMALFORED modes in Dataframes