YamlJdbc Driver

Introduction

The YamlJdbc driver allows reading YAML files via JDBC. Using SQL queries, data from a YAML file can be extracted and processed just like from a database table.

Sequences (lists) in a YAML document are addressed as tables for querying. Three components are used in the query:

  • a dummy connection 'jdbc:yaml:' is used as the database
  • the schema is a YAML file (in the FROM clause)
  • the table is a path expression with dot notation (in the FROM clause) that leads to YAML sequences treated as rows
  • the columns are keys within the mapping elements (dictionary/map) in the sequence

The following YAML document can be read, for example, using the driver:

people:
  - id: 10
    profession: scientist
    fname: Henry
    lname: Cavendish
  - id: 21
    profession: scientist
    fname: Isaac
    lname: Newton
  - id: 33
    profession: scientist
    fname: Doctor
    lname: Seltsam

An SQL query over the YamlJdbc driver would look like this:

SELECT fname
    , lname
    , profession
FROM "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml/people.yaml"."people"
COLUMNS(fname, lname, profession)

Here, the filename "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml/people.yaml" is specified as the schema and "people" is a path expression that leads to the "people" sequence in the document.

Within the "people" sequence, the keys "fname", "lname" and "profession" of the mapping elements are addressed.

The result looks like this:

fname  lname     profession
------ --------- ----------
Henry  Cavendish scientist
Isaac  Newton    scientist
Doctor Seltsam   scientist

Aliases

As usual, column aliases can be used to rename the columns:

SELECT fname AS firstname
    , lname AS lastname
    , profession AS profession
FROM "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml/people.yaml"."people"
COLUMNS(fname, lname, profession)

This produces:

firstname lastname profession
------- --------- ---------
Henry   Cavendish scientist
Isaac   Newton    scientist
Doctor   Seltsam   scientist

Pseudo-Columns

In addition to the data columns from the YAML document, the YamlJdbc driver generates additional pseudo-columns:

  • rownumber
  • $id
  • $parentid
SELECT rownumber AS rownum
    , "$id" AS row_id
    , fname
    , lname
    , profession
FROM "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml/people.yaml"."people"
COLUMNS(fname, lname, profession)

The pseudo-column "rownumber" returns a sequential number for each record in the result. The pseudo-column "$id" returns a sequential number for the element in the document (starting at 0). The result of the example query looks like this:

rownum row_id fname  lname     profession
------ --------- ------ --------- ----------
1      0         Henry  Cavendish scientist
2      1         Isaac  Newton    scientist
3      2         Doctor Seltsam   scientist

Note: There is also the pseudo-column "$parentid". It returns the path of the parent element; more on this in the next section.

Nested Structures and Hierarchies

An element in a YAML document can contain further nested structures, as in the following example:

people:
  - id: 10
    profession: scientist
    fname: Henry
    lname: Cavendish
    research:
      - area: Chemistry
      - area: Density of the Earth
      - area: Electricity
  - id: 21
    profession: scientist
    fname: Isaac
    lname: Newton
    research:
      - area: Calculus
      - area: Optics
      - area: Gravity

Nested Structures - Two Options

Nested structures can be read in two ways:

Option 1. Dot Notation for Direct Navigation

For simple nested structures, dot notation can be used:

company:
  name: Acme Corp
  employees:
    - name: John Doe
      department: IT
      salary: 75000
    - name: Jane Smith
      department: HR
      salary: 65000
SELECT name, department, salary
FROM "employees.yaml"."company.employees"
COLUMNS(name, department, salary)

The dot notation "company.employees" navigates through the nested structure directly to the "employees" sequence.

This produces:

name       department salary
---------- ---------- ------
John Doe   IT         75000
Jane Smith HR         65000

This is the simplest approach, but the reference to the parent elements is missing.

Option 2. Comma-Separated Hierarchy Navigation

The YamlJdbc driver also supports comma-separated navigation, similar to JsonJdbc and XmlJdbc. This is particularly useful for nested lists:

SELECT area
FROM "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml/researcher.yaml"."people,research"
COLUMNS(area)

Here "people" navigates to the first hierarchy level (list of persons) and "research" (after the comma) to the second level relative to each found "people" element.

This returns all "research" entries of all persons:

area
--------------------
Chemistry
Density of the Earth
Electricity
Calculus
Optics
Gravity

With the pseudo-column "$parentid" you can see the relationship:

SELECT "$parentid", area
FROM "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml/researcher.yaml"."people,research"
COLUMNS(area)

This produces:

$parentid area
--------- --------------------
0         Chemistry
0         Density of the Earth
0         Electricity
1         Calculus
1         Optics
1         Gravity

The "$parentid" shows which "people" element the "research" entries belong to (0 = first element, 1 = second element, etc.).

Correlation with Parent Elements

The pseudo-column "$parentid" enables the correlation between nested elements and their parent elements.

Best practice for nested hierarchies:

  1. First query - Read the parent elements and store the "$id" pseudo-column as primary key:
    SELECT "$id", fname, lname
    FROM "researcher.yaml"."people"
    COLUMNS(fname, lname)
    

This produces:

$id fname  lname
--- ------ ---------
0   Henry  Cavendish
1   Isaac  Newton
  1. Second query - Read the child elements with comma-separated navigation:
    SELECT "$parentid", area
    FROM "researcher.yaml"."people,research"
    COLUMNS(area)
    

This produces:

$parentid area
--------- --------------------
0         Chemistry
0         Density of the Earth
0         Electricity
1         Calculus
1         Optics
1         Gravity
  1. Joining - The "$parentid" from the second query corresponds to the "$id" from the first query, so you can correctly join the records in your database.

Difference Between Dot Notation and Comma Separation

Dot notation (company.employees):

  • Navigates through the structure to a specific nested sequence
  • Ideal for direct paths to the desired data
  • No hierarchy information available

Comma separation (people,research):

  • Navigates through multiple hierarchy levels
  • Provides hierarchy information via "$parentid"
  • Each comma starts a new level relative to the previous one

Example of the difference:

data:
  level1:
    - item: A
      level2:
        - value: 1
        - value: 2
-- Dot notation: Direct to level2 (if you don't need the hierarchy)
FROM "file.yaml"."data.level1.level2"

-- Comma separation: Preserves hierarchy information
FROM "file.yaml"."data.level1,level2"

System Tables

The YamlJdbc driver supports directories as system tables. This allows searching for files in a directory in a loop query of a datasqill module, which can then be loaded.

The following query uses the "files()" function as a system table and its columns "directory" and "filename".

SELECT directory, filename
    FROM "yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml".files()
    WHERE filename LIKE '%.yaml'
    ORDER BY filename

As a result, the searched files in the used directory are returned:

directory                                           filename
--------------------------------------------------- --------------
yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml books.yaml
yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml employees.yaml
yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml people.yaml
yamljdbc/src/test/resources/de/softquadrat/jdbc/yaml researcher.yaml

Notes

Data Types

The YamlJdbc driver returns all results as text. Conversion to other data types must be performed in downstream processing steps.

YAML supports various data types (strings, numbers, booleans, null), but the driver converts all values to strings for the JDBC output.

Empty Result

If a query returns no records, it should be verified whether the path expression in the FROM clause actually navigates to a YAML sequence. If this is not the case, the YamlJdbc driver ignores the query and returns no results.

Null Values

The YamlJdbc driver returns the content of selected columns in the SELECT clause. If the addressed key does not exist, a null value is returned.

items:
  - id: 1
    name: Test1
  - id: 2
    # name missing
SELECT id
    , name
    , description
FROM "test.yaml"."items"
COLUMNS(id, name, description)
id name  description
-- ----- -----------
1  Test1 null
2  null  null

YAML Anchors and Aliases

The YamlJdbc driver currently does not support YAML anchors (&) and aliases (*). These are resolved by the YAML parser (SnakeYAML) before the data is processed.

Complex Structures

When a value itself is a complex structure (nested mapping or sequence), it is returned as a string representation:

items:
  - id: 1
    name: Test
    config:
      setting1: value1
      setting2: value2
SELECT id, name, config
FROM "test.yaml"."items"
COLUMNS(id, name, config)

The "config" field is returned as a string representation of the nested object (e.g. {setting1=value1, setting2=value2}).

YAML Formats

The YamlJdbc driver supports various YAML formats:

Sequence at root level:

- id: 1
  name: Item 1
- id: 2
  name: Item 2

Usage: FROM "file.yaml"."/"

Named sequence:

items:
  - id: 1
    name: Item 1
  - id: 2
    name: Item 2

Usage: FROM "file.yaml"."items"

Single mapping as sequence: The driver also treats a single mapping element as a single-row sequence:

person:
  id: 1
  name: John

Usage: FROM "file.yaml"."person"

This returns a single row.

Technical Details

Implementation

The YamlJdbc driver uses:

  • SnakeYAML (Version 1.28) for YAML parsing
  • In-memory processing for fast access
  • Recursive navigation for nested structures

Performance Optimization

For optimal performance with large YAML files:

  • Use the most specific path expressions possible in the FROM clause
  • Limit the number of columns in the COLUMNS clause to what is necessary
  • Use WHERE clauses for filtering when possible
  • Note that the entire YAML document is loaded into memory

Practical Guidelines for Document Sizes

Since YAML files are fully loaded into memory:

  • < 10 MB: No problem, very fast processing
  • 10 MB - 50 MB: Works well, normal performance
  • 50 MB - 200 MB: Works, but monitor available heap memory
  • > 200 MB: May lead to memory issues; consider alternatives (e.g. splitting the file)

Note: YAML is optimized for smaller configuration files. For very large data volumes, formats like JSON or CSV are more efficient.

Character Sets

The YamlJdbc driver uses UTF-8 as the default character set for reading YAML files. This is the recommended standard for YAML documents.

Examples

Example 1: Book List

books:
  - id: 1
    category: fiction
    title: The Great Gatsby
    author: F. Scott Fitzgerald
    year: 1925
    price: 10.99
  - id: 2
    category: fiction
    title: To Kill a Mockingbird
    author: Harper Lee
    year: 1960
    price: 12.99
SELECT category, title, author, year, price
FROM "books.yaml"."books"
COLUMNS(category, title, author, year, price)
WHERE category = 'fiction'

Example 2: Nested Employee Structure

company:
  name: Acme Corp
  departments:
    - name: IT
      employees:
        - name: John Doe
          salary: 75000
        - name: Jane Smith
          salary: 80000
    - name: HR
      employees:
        - name: Bob Johnson
          salary: 65000
-- All employees with department info
SELECT "$parentid" AS dept_id, name, salary
FROM "company.yaml"."company.departments,employees"
COLUMNS(name, salary)

Example 3: Configuration Data

servers:
  - hostname: web01
    ip: 192.168.1.10
    role: webserver
    active: true
  - hostname: db01
    ip: 192.168.1.20
    role: database
    active: true
  - hostname: backup01
    ip: 192.168.1.30
    role: backup
    active: false
SELECT hostname, ip, role
FROM "servers.yaml"."servers"
COLUMNS(hostname, ip, role, active)
WHERE active = 'true'

Summary

The YamlJdbc driver offers a simple and intuitive way to read YAML files via SQL queries:

  • ✅ Simple dot notation for navigation
  • ✅ Comma-separated hierarchy navigation as with JSON and XML
  • ✅ System tables for file listings
  • ✅ Pseudo-columns for hierarchy information
  • ✅ Fully compatible with datasqill modules
  • ✅ Same syntax patterns as JsonJdbc and XmlJdbc

The driver is particularly suitable for:

  • Configuration files
  • Test data
  • Small to medium-sized datasets (< 50 MB)
  • Structured data with clear hierarchies