For this post we're going to create examples to how convert parquet table to Delta table. First, we'll create a parquet table from scratch through a Spark Dataframe and then converting to Delta table.
Using Delta table has some benefits comparing to a Parquet table. Delta enables to restore versions of your table through time travel function, ACID supports and more.
Creating a Parquet table
First of all, let's create a parquet table to be converted later to Delta Table. I'll prefer create a parquet table from scratch to bring a better understanding.
The following code will be executed once, just to create a parquet table. We're going to use a Spark Dataframe that will be loaded from a JSON file containing semi-structured records.
public static void main(String[] args){
SparkConf conf = new SparkConf();
conf.setAppName("spark-delta-table");
conf.setMaster("local[1]");
SparkSession session = SparkSession.builder()
.config(conf)
.getOrCreate();
Dataset<Row> dataFrame = session.read().json("product.json");
dataframe.write().format("parquet").save("table/product");
}
The above example, we start creating a SparkSession object to create and manage a Spark Dataframe that was loaded from the product.json file content.
Alter load, the Dataframe creates (or write) a table in parquet format in the table/product directory.
JSON content
File represented by product.json file that contains semi-structured records.
{"id":1, "name":"rice", "price":12.0, "qty": 2}
{"id":2, "name":"beans", "price":7.50, "qty": 5}
{"id":3, "name":"coke", "price":5.50, "qty": 2}
{"id":4, "name":"juice", "price":3.80, "qty": 1}
{"id":5, "name":"meat", "price":1.50, "qty": 1}
{"id":6, "name":"ice-cream", "price":6.0, "qty": 2}
{"id":7, "name":"potato", "price":3.70, "qty": 10}
{"id":8, "name":"apple", "price":5.60, "qty": 5}
After running the code above, parquet files will be generated in the table/product directory containing the files below.
Converting Parquet table to Delta Table
Now that we have a Parquet table already created, we can convert easily to Delta Table, let's do this.
public static void main(String[] args){
SparkConf conf = new SparkConf();
conf.setAppName("spark-delta-table");
conf.setMaster("local[1]");
SparkSession session = SparkSession.builder()
.config(conf)
.getOrCreate();
DeltaTable.convertToDelta(session, "parquet.`table/product`");
}
DeltaTable.convertToDelta method is responsible to convert parquet table to Delta table. Note that we had to use SparkSession as a parameter and also specify the path of parquet table using this format "parquet.`<path>`" .
The result after execution you can see in the picture below.
After conversion running, Delta creates the famous _delta_log directory containing commit info and checkpoint files.
Books to study and read
If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s):
AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study.
Well that's it, I hope you enjoyed it.
Comments