XmlJdbc Driver

Introduction

The XmlJdbc driver allows reading XML files via JDBC. Using SQL queries, data from an XML file can be extracted and processed just like from a database table.

Elements in an XML document are addressed as tables for querying. Three components are used in the query:

  • a dummy connection 'jdbc:xml:' is used as the database
  • the schema is an XML file (in the FROM clause)
  • the table is an XPath expression (in the FROM clause) that leads to one or more XML elements treated as rows
  • the columns are attributes (@attributeName) or child elements (elementName) within the row element

The following XML document can be read, for example, using the driver:

<?xml version="1.0" encoding="UTF-8"?>
<people>
    <person id="10" profession="scientist">
        <fname>Henry</fname>
        <lname>Cavendish</lname>
    </person>
    <person id="21" profession="scientist">
        <fname>Isaac</fname>
        <lname>Newton</lname>
    </person>
    <person id="33" profession="scientist">
        <fname>Doctor</fname>
        <lname>Seltsam</lname>
    </person>
</people>

An SQL query over the XmlJdbc driver would look like this:

SELECT "@id"
    , fname
    , lname
    , "@profession"
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml"."/people/person"
COLUMNS("@id", fname, lname, "@profession")

Here, the filename "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml" is specified as the schema and "/people/person" is a path expression that leads to the "person" elements in the document.

Within the "person" elements, "@id" and "@profession" address the attributes and "fname" and "lname" the child elements.

The result looks like this:

@id fname  lname     @profession
--- ------ --------- -----------
10  Henry  Cavendish scientist
21  Isaac  Newton    scientist
33  Doctor Seltsam   scientist

Aliases

As usual, column aliases can be used to rename the columns:

SELECT "@id" AS id
    , fname AS firstname
    , lname AS lastname
    , "@profession" AS profession
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml"."/people/person"
COLUMNS("@id", fname, lname, "@profession")

This produces:

id firstname lastname profession
-- ------- --------- ---------
10 Henry   Cavendish scientist
21 Isaac   Newton    scientist
33 Doctor  Seltsam   scientist

Pseudo-Columns

In addition to the data columns from the XML document, the XmlJdbc driver generates additional pseudo-columns:

  • rownumber
  • $id
  • $parentid
SELECT rownumber AS rownum
    , "$id" AS row_id
    , "@id" AS id
    , fname
    , lname
    , "@profession"
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml"."/people/person"
COLUMNS("@id", fname, lname, "@profession")

The pseudo-column "rownumber" returns a sequential number for each record in the result. The pseudo-column "$id" returns a sequential number for the element in the document. The result of the example query looks like this:

rownum row_id id fname  lname     @profession
------ --------- -- ------ --------- -----------
1      0         10 Henry  Cavendish scientist
2      1         21 Isaac  Newton    scientist
3      2         33 Doctor Seltsam   scientist

Note: There is also the pseudo-column "$parentid". It returns the path of the parent element; more on this in the next section.

Nested Elements and Hierarchies

An element in an XML document can contain further sub-elements, as in the following example:

<?xml version="1.0" encoding="UTF-8"?>
<people>
    <person id="10" profession="scientist">
        <fname>Henry</fname>
        <lname>Cavendish</lname>
        <research>
            <area>Chemistry</area>
        </research>
        <research>
            <area>Density of the Earth</area>
        </research>
        <research>
            <area>Electricity</area>
        </research>
    </person>
    <person id="21" profession="scientist">
        <fname>Isaac</fname>
        <lname>Newton</lname>
        <research>
            <area>Calculus</area>
        </research>
        <research>
            <area>Optics</area>
        </research>
        <research>
            <area>Gravity</area>
        </research>
    </person>
    <person id="33" profession="scientist">
        <fname>Doctor</fname>
        <lname>Seltsam</lname>
    </person>
</people>

Nested Elements

Nested elements can be read in two ways:

Option 1. Full Path

All elements can be addressed directly via the full path:

SELECT area
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/researcher.xml"."/people/person/research"
COLUMNS(area)

The query returns all "research" elements:

area
--------------------
Chemistry
Density of the Earth
Electricity
Calculus
Optics
Gravity

This is the simplest approach, but the reference to the parent elements is missing.

Option 2. Comma-Separated Hierarchy Navigation

The XmlJdbc driver also supports comma-separated navigation, similar to JsonJdbc or YamlJdbc:

SELECT "$parentid", area
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/researcher.xml"."/people/person,research"
COLUMNS(area)

Here "/people/person" navigates to the first hierarchy level and "research" (without leading slash!) to the second level relative to the found "person" elements.

This produces:

$parentid area
--------- --------------------
0         Chemistry
0         Density of the Earth
0         Electricity
1         Calculus
1         Optics
1         Gravity

The "$parentid" shows which "person" element the "research" elements belong to (0 = first person element, 1 = second person element, etc.).

Correlation with Parent Elements

The pseudo-column "$parentid" enables the correlation between nested elements and their parent elements.

Best practice for nested hierarchies:

  1. First query - Read the parent elements and store the "$id" pseudo-column as primary key:
    SELECT "$id", "@id" AS person_id, fname, lname
    FROM "researcher.xml"."/people/person"
    COLUMNS("@id", fname, lname)
    

This produces:

$id person_id fname  lname
--- --------- ------ --------
0   10        Henry  Cavendish
1   21        Isaac  Newton
2   33        Doctor Seltsam
  1. Second query - Read the child elements with comma-separated navigation:
    SELECT "$parentid", area
    FROM "researcher.xml"."/people/person,research"
    COLUMNS(area)
    

This produces:

$parentid area
--------- --------------------
0         Chemistry
0         Density of the Earth
0         Electricity
1         Calculus
1         Optics
1         Gravity
  1. Joining - The "$parentid" from the second query corresponds to the "$id" from the first query, so you can correctly join the records in your database.

Note on Relative Paths in Hierarchies

With comma-separated navigation:

  • The first path starts with "/" and is absolute from the root element
  • Subsequent paths are relative to the found elements of the previous level and start without "/"

Example:

-- Correct:
FROM "file.xml"."/root/element,subelement,subsubelement"
                  ↑ absolute      ↑ relative  ↑ relative

-- Incorrect:
FROM "file.xml"."/root/element,/subelement,/subsubelement"
                                ↑ not relative!

Attributes and Elements

The XmlJdbc driver distinguishes between XML attributes and XML elements:

  • Attributes are referenced with leading "@": @attributeName
  • Child elements are referenced directly by their name: elementName
  • Attributes of child elements are referenced with elementName/@attributeName

Example with mixed attributes and elements:

<?xml version="1.0" encoding="UTF-8"?>
<books>
    <book id="1" category="fiction">
        <title>The Great Gatsby</title>
        <author>F. Scott Fitzgerald</author>
        <year>1925</year>
        <price currency="EUR">10.99</price>
    </book>
    <book id="2" category="programming">
        <title>Clean Code</title>
        <author>Robert C. Martin</author>
        <year>2008</year>
        <price currency="USD">45.00</price>
    </book>
</books>
SELECT "@id" AS id
    , "@category" AS kategorie
    , title
    , author
    , year
    , price
    , "price/@currency" AS currency
FROM "books.xml"."/books/book"
COLUMNS("@id", "@category", title, author, year, price, "price/@currency")

This produces:

id category    title            author               year price currency
-- ------------ ---------------- -------------------- ---- ----- --------
1  fiction      The Great Gatsby F. Scott Fitzgerald 1925 10.99 EUR
2  programming  Clean Code       Robert C. Martin    2008 45.00 USD

System Tables

The XmlJdbc driver supports directories as system tables. This allows searching for files in a directory in a loop query of a datasqill module, which can then be loaded.

The following query uses the "files()" function as a system table and its columns "directory" and "filename".

SELECT directory, filename
    FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml".files()
    WHERE filename = 'books.xml' OR filename = 'employees.xml'
    ORDER BY filename

As a result, the searched files in the used directory are returned:

directory                                          filename
-------------------------------------------------- -------------
xmljdbc/src/test/resources/de/softquadrat/jdbc/xml books.xml
xmljdbc/src/test/resources/de/softquadrat/jdbc/xml employees.xml

Notes

Data Types

The XmlJdbc driver returns all results as text. Conversion to other data types must be performed in downstream processing steps.

Empty Result

If a query returns no records, it should be verified whether the path expression in the FROM clause actually navigates to XML elements. If this is not the case, the XmlJdbc driver ignores the query and returns no results.

Null Values

The XmlJdbc driver returns the content of selected columns in the SELECT clause. If the addressed attribute or element does not exist, a null value is returned.

<test>
    <item id="1">
        <name>Test1</name>
    </item>
    <item id="2">
        <!-- name missing -->
    </item>
</test>
SELECT "@id"
    , name
    , description
FROM "test.xml"."/test/item"
COLUMNS("@id", name, description)
@id name  description
--- ----- -----------
1   Test1 null
2   null  null

Reading Element Contents

If the element in the SELECT clause is not empty, its text content is returned as the result. If an element has both text content and child elements, only the direct text content (without the child elements) is returned.

<test>
    <item id="1">
        <name>Test1</name>
        <details>
            <created>17.3.2022</created>
            <description>only a test</description>
        </details>
    </item>
    <item id="2">
        <name>Test2</name>
        <tags>
            <tag>important</tag>
            <tag>urgent</tag>
        </tags>
    </item>
</test>
SELECT "@id"
    , name
FROM "test.xml"."/test/item"
COLUMNS("@id", name)
@id name
--- -----
1   Test1
2   Test2

XPath Limitations

The XmlJdbc driver uses a simplified XPath syntax for navigation in the XML document. Full XPath expressions with predicates, axes, or functions are not supported.

Supported:

  • Simple paths like /root/element/subelement
  • Attribute references with @attributeName
  • Child element references with elementName
  • Attributes of child elements with elementName/@attributeName

Not supported:

  • XPath predicates like /root/element[@id='1']
  • XPath axes like //element or parent::*
  • XPath functions like count() or text()

Hybrid Processing: StAX and DOM

The XmlJdbc driver uses a hybrid architecture for optimal performance and flexibility:

StAX for Simple Paths

For simple, non-hierarchical queries (without comma separation), StAX (Streaming API for XML) is used:

-- Uses StAX
FROM "file.xml"."/people/person"

Advantages:

  • Very memory efficient: The XML document does not need to be fully loaded into memory
  • Performant: Sequential reading without DOM construction
  • Suitable for very large files: XML files of several GB can also be processed

DOM for Hierarchical Paths

For hierarchical queries with comma-separated navigation, DOM (Document Object Model) is used:

-- Uses DOM
FROM "file.xml"."/people/person,research"

Note:

  • The XML document is fully loaded into memory
  • For very large XML files with deep hierarchies, this can be memory-intensive
  • On the other hand, navigation through nested structures is simpler and more flexible

Practical guidelines for document sizes:

  • < 100 MB: DOM is perfectly fine, no performance issues
  • 100 MB - 500 MB: DOM usually still works well; monitor available heap memory
  • 500 MB - 1 GB: DOM becomes critical; prefer StAX when possible
  • > 1 GB: StAX is strongly recommended for simple paths; use DOM only when hierarchical navigation is absolutely necessary

Recommendation: Use simple paths without comma separation whenever possible to benefit from StAX memory efficiency. For most XML files in practice (typically < 100 MB), the difference does not matter much.

Character Sets

The XmlJdbc driver automatically recognizes the character set declaration in the XML header (<?xml version="1.0" encoding="UTF-8"?>). UTF-8 is the default character set when no explicit declaration is present.

Namespaces

The XmlJdbc driver currently does not support XML namespaces. Elements with namespace prefixes are treated as regular element names.

Summary

The XmlJdbc driver offers a simple and intuitive way to read XML files via SQL queries:

  • ✅ Simple path notation for navigation
  • ✅ Comma-separated hierarchy navigation as with JSON and YAML
  • ✅ System tables for file listings
  • ✅ Pseudo-columns for hierarchy information
  • ✅ Fully compatible with datasqill modules
  • ✅ Same syntax patterns as JsonJdbc and YamlJdbc