Apache Iceberg

Apache Iceberg is an open source table format designed for big analytics tables.

A table format is a collection of metadata describing how data is organized in a table, including column structure and data types. Iceberg is not a standalone service that needs to be monitored. It is a library loaded by the application which interacts with the data, allowing to manage this new table format.

Iceberg offers flexibility and scalability, making it a preferred choice for companies looking to enhance their data lake. It facilitates the transition to a data lakehouse, which is a combination of a data lake and a data warehouse. The architecture of the data lake enables efficient data management and analysis.

Table formats are becoming the new standard in the big data industry. Iceberg was created in 2017 by Netflix, and given to the The Apache Software Foundation in 2018, becoming open source. It finished its incubating phase in May 2020, and the Iceberg project graduated to become a top level Apache project. Now, many others use it like Apple, Airbnb, LinkedIn… It offers high performances as well as schema evolution and consistent concurrent writes in parallel to safely work with the same tables at the same time. It works with multiple engines such as Spark, Hive, Flink, Trino…

Benefits of using Iceberg

  1. ACID compliance: Brings a real consistency to the data lake and isolate readers and writers.
  2. Rollback and time travel: Icerberg captures changes in the form of transactions, captured in the form of snapshots. The snapshots are then a representation of the state of some table at some time. It is possible to query the dataset at different snapshots. It is also possible to rollback and correct any discovered problems quickly by resetting tables to a known good state.
  3. Scan planning is fast: facilitated by its metadata and filtering mechanisms, Iceberg’s scan planning process enables fast query execution without relying on a distributed SQL engine. This results in lower latency queries and direct access to data from any client.
  4. Hidden partitioning: Iceberg allows you to specify the column of the table that you want to partition on, without the need to create a new column explicitly. Instead, Iceberg maintains a relationship between the partition column and its source column. This approach eliminates the necessity of creating an additional column to store the partition value.
  5. Schema evolution: Support In-place table evolution. Iceberg doesn’t need to rewrite able data or migrating to a new table. You can evolve a table schema like SQL (add, drop, rename…).

How Iceberg works

The architecture

The Iceberg table’s architecture consists of three layers:

  1. The Iceberg catalog: The catalog is the main entrypoint where services go to find the location of the current metadata pointer, which helps identify where to read or write data for a given table. Here is where references and pointers exist for each table that identify each table’s current metadata file.
  2. The metadata layer: This layer consists of three components: metadata file, manifest list, and manifest file.
    1. The metadata file works at the table level. It includes information about a table’s schema, partition information, snapshots, and the current snapshot. Each metadata files can have 1 or more corresponding manifest list.
    2. The manifest list works at the snapshot level. It defines a snapshot of a table by containing a list of manifest files with information about the manifest file.
    3. Manifest files works at the data level. They store the data files they track. They also provide other details including statistics about each file.
  3. The data layer: Each manifest file tracks a subset of data files. Data files store the raw data as well as details relative to partition memberships, record count, and lower- and upper-bounds of columns.

Iceberg Structure - from: https://iceberg.apache.org/spec/

Step by step

When writing data with Iceberg, its the reverse order:

  • Write the data files first
  • Iceberg attributes these data files a manifest file
  • The manifest files are linked back to the manifest list. it represents the first snapshot
  • This first snapshot, or manifest list, is linked to the metadata file
  • Finally, the metadata file is linked to our catalog

When we update several records and add new ones, here is the process:

  • Update current data files or add new data files
  • A new manifest file is created and linked to the new data files
  • A new manifest list is created. It points to this new manifest file and to the existing manifest file linked to the data files modified
  • Metadata file is updated with both snapshot 0 (points the first manifest list) and 1 (points to this new manifest list)