Partition, Optimize and ZORDER Delta Tables in Azure Databricks

In our previous blog on getting started with Azure Databricks, we looked at Databricks tables.  In this blog, we will look at a type of Databricks table called Delta table and best practices around storing data in Delta tables.

1. Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Databricks Delta table is a table that has a Delta Lake as the data source similar to how we had a CSV file as a data source for the table in the previous blog.

2. Table which is not partitioned

When we create a delta table and insert records into it, Databricks loads the data into multiple small files.  You can see the multiple files created for the table “business.inventory” below

3. Partitioned table

Partitioning involves putting different rows into different tables.  E.g., if we have an address table with addresses in the US, the addresses might be stored in 50 different tables corresponding to the 50 states in the US.  A view with a union might be created over all of them to provide a complete view of all addresses.

Sample code to create a table partitioned by date column is given below:

CREATE TABLE events (
  date DATE,
  eventId STRING,
  eventType STRING,
  data STRING)
USING delta
PARTITIONED BY (date) 

The table “business.sales” given below is partitioned by InvoiceDate.  You can see that there is a folder created for each InvoiceDate and within the folders, there are multiple files that store the data for this table.

This partitioning will be useful when we have queries selecting records from this table with InvoiveDate in WHERE clause. 

E.g.:
SELECT SLSDA_ID, RcdType, DistId
FROM business.sales
WHERE InvoiceDate = ‘2013-01-01’

In total there are 40,545 files for this table which you can see from below screenshot

4. OPTIMIZE

SMALL FILE PROBLEM

Historical and new data is often written in very small files and directories.  This data may be spread across a data center or even across the world (that is, not co-located).  The result is that a query on this data may be very slow due to

  • network latency
  • volume of file metadata

The solution is to compact many small files into one larger file.

OPTIMIZE command invokes the bin-packing (Compaction) algorithm to coalesce small files into larger ones.  Small files are compacted together into new larger files up to 1GB.

You can see below that the OPTIMIZE command has removed the 40,545 files and instead of them added 2378 files.  Also, observe that after Optimization the size of the table has decreased from 1.49 GB to 1.08 GB

5. Optimize table which is not partitioned

Optimize will compact the small files for tables that are not partitioned too.

business.finance_transactions_silver table is not partitioned and is currently having 64 files with total size 858 MB

Running the Optimize command coalesces the 64 files to 1 file

Note that “partitionsOptimized” is 1 in this case.  Previously for the partitioned table “partitionsOptimized was 2509.  OPTIMIZE command coalesces the small files within a partition only.  If the table is not partitioned, the whole table is considered as a single xpartition.

6. ZORDER
  • Data Skipping is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses).
    As new data is inserted into a Databricks Delta table, file-level min/max statistics are collected for all columns (including nested ones) of supported types. Then, when there’s a lookup query against the table, Databricks Delta first consults these statistics to determine which files can safely be skipped.  This is done automatically and no specific commands are required to be run for this.
  • Z-Ordering is a technique to co-locate related information in the same set of files.
    Z-Ordering maps multidimensional data to one dimension while preserving the locality of the data points.

Given a column that you want to perform ZORDER on, say OrderColumn, Delta

  • Takes existing parquet files within a partition.
  • Maps the rows within the parquet files according to OrderColumn using the Z-order curve algorithm.
  • In the case of only one column, the mapping above becomes a linear sort
  • Rewrites the sorted data into new parquet files.

Note: We cannot use the table partition column also as a ZORDER column.

Syntax for ZORDER is

OPTIMIZE tablename
ZORDER BY (OrderColumn) 
7. Best practices

a. PARTITION BY

  • Partition the table by a column which is used in the WHERE clause or ON clause (join).  The most commonly used partition column is the date.
  • Use columns with low cardinality.  If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
  • Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.  Partitioning is not required for smaller tables.
  • PARTITION BY is done on a single column only

b. OPTIMIZE

  • OPTIMIZE is required for all tables to which we write data continuously on a daily basis.
  • OPTIMIZE is not required for tables that have static data/reference data which are rarely updated.
  • There is a cost associated with OPTIMIZE (Running Optimize command for sales took 6.64 minutes).  We should run it more often (daily) if we want better end-user query performance.  We should run it less often if we want to optimize costs.

c. ZORDER BY

  • If we expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use ZORDER BY.
  • We can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.
8. References:

Share this:

CloudIQ is a leading Cloud Consulting and Solutions firm that helps businesses solve today’s problems and plan the enterprise of tomorrow by integrating intelligent cloud solutions. We help you leverage the technologies that make your people more productive, your infrastructure more intelligent, and your business more profitable. 

US

626 120th Ave NE, B102, Bellevue,

WA, 98005.

 sales@cloudiqtech.com

INDIA

Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097


© 2020 CloudIQ Technologies. All rights reserved.

Get in touch

Please contact us using the form below

USA

626 120th Ave NE, B102, Bellevue, WA, 98005.

+1 (206) 203-4151

sales@cloudiqtech.com

INDIA

Chennai One IT SEZ,

Module No:5-C, Phase ll, 2nd Floor, North Block, Pallavaram-Thoraipakkam 200 ft road, Thoraipakkam, Chennai – 600097

+91-044-48651163