Skip to content

Row Filter Syntax

In addition to the primary Expression DSL, PyIceberg provides a string-based statement interface for filtering rows in Iceberg tables. This guide explains the syntax and provides examples for supported operations.

The row filter syntax is designed to be similar to SQL WHERE clauses. Here are the basic components:

Column References

Columns can be referenced using either unquoted or quoted identifiers:

column_name
"column.name"

Literals

The following literal types are supported:

  • Strings: 'hello world'
  • Numbers: 42, -42, 3.14
  • Booleans: true, false (case insensitive)

Comparison Operations

Basic Comparisons

column = 42
column != 42
column > 42
column >= 42
column < 42
column <= 42

Note

The == operator is an alias for = and <> is an alias for !=

String Comparisons

column = 'hello'
column != 'world'

NULL Checks

Check for NULL values using the IS NULL and IS NOT NULL operators:

column IS NULL
column IS NOT NULL

NaN Checks

For floating-point columns, you can check for NaN values:

column IS NAN
column IS NOT NAN

IN and NOT IN

Check if a value is in a set of values:

column IN ('a', 'b', 'c')
column NOT IN (1, 2, 3)

LIKE Operations

The LIKE operator supports pattern matching with a wildcard % at the end of the string:

column LIKE 'prefix%'
column NOT LIKE 'prefix%'

Important

The % wildcard is only supported at the end of the pattern. Using it in the middle or beginning of the pattern will raise an error.

BETWEEN

The BETWEEN operator filters a numeric value against an inclusive range, e.g. a between 1 and 2 is equivalent to a >= 1 and a <= 2.

column BETWEEN 1 AND 2
column BETWEEN 1.0 AND 2.0

Logical Operations

Combine multiple conditions using logical operators:

column1 = 42 AND column2 = 'hello'
column1 > 0 OR column2 IS NULL
NOT (column1 = 42)

Tip

Parentheses can be used to group logical operations for clarity:

(column1 = 42 AND column2 = 'hello') OR column3 IS NULL

Complete Examples

Here are some complete examples showing how to combine different operations:

-- Complex filter with multiple conditions
status = 'active' AND age > 18 AND NOT (country IN ('US', 'CA'))

-- Filter with string pattern matching
name LIKE 'John%' AND age >= 21

-- Filter with NULL checks and numeric comparisons
price IS NOT NULL AND price > 100 AND quantity > 0

-- Filter with multiple logical operations
(status = 'pending' OR status = 'processing') AND NOT (priority = 'low')

Common Pitfalls

  1. String Quoting: Always use single quotes for string literals. Double quotes are reserved for column identifiers.
-- Correct
name = 'John'

-- Incorrect
name = "John"
  1. Wildcard Usage: The % wildcard in LIKE patterns can only appear at the end.
-- Correct
name LIKE 'John%'

-- Incorrect (will raise an error)
name LIKE '%John%'
  1. Case Sensitivity: Boolean literals (true/false) are case insensitive, but string comparisons are case sensitive.
-- All valid
is_active = true
is_active = TRUE
is_active = True

-- Case sensitive
status = 'Active'  -- Will not match 'active'

Best Practices

  1. For complex use cases, use the primary Expression DSL
  2. When using multiple conditions, consider the order of operations (NOT > AND > OR)
  3. For string comparisons, be consistent with case usage