The XmlJdbc driver allows reading XML files via JDBC. Using SQL queries, data from an XML file can be extracted and processed just like from a database table.
Elements in an XML document are addressed as tables for querying. Three components are used in the query:
The following XML document can be read, for example, using the driver:
<?xml version="1.0" encoding="UTF-8"?>
<people>
<person id="10" profession="scientist">
<fname>Henry</fname>
<lname>Cavendish</lname>
</person>
<person id="21" profession="scientist">
<fname>Isaac</fname>
<lname>Newton</lname>
</person>
<person id="33" profession="scientist">
<fname>Doctor</fname>
<lname>Seltsam</lname>
</person>
</people>
An SQL query over the XmlJdbc driver would look like this:
SELECT "@id"
, fname
, lname
, "@profession"
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml"."/people/person"
COLUMNS("@id", fname, lname, "@profession")
Here, the filename "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml" is specified as the schema and "/people/person" is a path expression that leads to the "person" elements in the document.
Within the "person" elements, "@id" and "@profession" address the attributes and "fname" and "lname" the child elements.
The result looks like this:
@id fname lname @profession
--- ------ --------- -----------
10 Henry Cavendish scientist
21 Isaac Newton scientist
33 Doctor Seltsam scientist
As usual, column aliases can be used to rename the columns:
SELECT "@id" AS id
, fname AS firstname
, lname AS lastname
, "@profession" AS profession
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml"."/people/person"
COLUMNS("@id", fname, lname, "@profession")
This produces:
id firstname lastname profession
-- ------- --------- ---------
10 Henry Cavendish scientist
21 Isaac Newton scientist
33 Doctor Seltsam scientist
In addition to the data columns from the XML document, the XmlJdbc driver generates additional pseudo-columns:
SELECT rownumber AS rownum
, "$id" AS row_id
, "@id" AS id
, fname
, lname
, "@profession"
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/people.xml"."/people/person"
COLUMNS("@id", fname, lname, "@profession")
The pseudo-column "rownumber" returns a sequential number for each record in the result. The pseudo-column "$id" returns a sequential number for the element in the document. The result of the example query looks like this:
rownum row_id id fname lname @profession
------ --------- -- ------ --------- -----------
1 0 10 Henry Cavendish scientist
2 1 21 Isaac Newton scientist
3 2 33 Doctor Seltsam scientist
Note: There is also the pseudo-column "$parentid". It returns the path of the parent element; more on this in the next section.
An element in an XML document can contain further sub-elements, as in the following example:
<?xml version="1.0" encoding="UTF-8"?>
<people>
<person id="10" profession="scientist">
<fname>Henry</fname>
<lname>Cavendish</lname>
<research>
<area>Chemistry</area>
</research>
<research>
<area>Density of the Earth</area>
</research>
<research>
<area>Electricity</area>
</research>
</person>
<person id="21" profession="scientist">
<fname>Isaac</fname>
<lname>Newton</lname>
<research>
<area>Calculus</area>
</research>
<research>
<area>Optics</area>
</research>
<research>
<area>Gravity</area>
</research>
</person>
<person id="33" profession="scientist">
<fname>Doctor</fname>
<lname>Seltsam</lname>
</person>
</people>
Nested elements can be read in two ways:
All elements can be addressed directly via the full path:
SELECT area
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/researcher.xml"."/people/person/research"
COLUMNS(area)
The query returns all "research" elements:
area
--------------------
Chemistry
Density of the Earth
Electricity
Calculus
Optics
Gravity
This is the simplest approach, but the reference to the parent elements is missing.
The XmlJdbc driver also supports comma-separated navigation, similar to JsonJdbc or YamlJdbc:
SELECT "$parentid", area
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml/researcher.xml"."/people/person,research"
COLUMNS(area)
Here "/people/person" navigates to the first hierarchy level and "research" (without leading slash!) to the second level relative to the found "person" elements.
This produces:
$parentid area
--------- --------------------
0 Chemistry
0 Density of the Earth
0 Electricity
1 Calculus
1 Optics
1 Gravity
The "$parentid" shows which "person" element the "research" elements belong to (0 = first person element, 1 = second person element, etc.).
The pseudo-column "$parentid" enables the correlation between nested elements and their parent elements.
Best practice for nested hierarchies:
SELECT "$id", "@id" AS person_id, fname, lname
FROM "researcher.xml"."/people/person"
COLUMNS("@id", fname, lname)
This produces:
$id person_id fname lname
--- --------- ------ --------
0 10 Henry Cavendish
1 21 Isaac Newton
2 33 Doctor Seltsam
SELECT "$parentid", area
FROM "researcher.xml"."/people/person,research"
COLUMNS(area)
This produces:
$parentid area
--------- --------------------
0 Chemistry
0 Density of the Earth
0 Electricity
1 Calculus
1 Optics
1 Gravity
With comma-separated navigation:
Example:
-- Correct:
FROM "file.xml"."/root/element,subelement,subsubelement"
↑ absolute ↑ relative ↑ relative
-- Incorrect:
FROM "file.xml"."/root/element,/subelement,/subsubelement"
↑ not relative!
The XmlJdbc driver distinguishes between XML attributes and XML elements:
@attributeNameelementNameelementName/@attributeNameExample with mixed attributes and elements:
<?xml version="1.0" encoding="UTF-8"?>
<books>
<book id="1" category="fiction">
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<year>1925</year>
<price currency="EUR">10.99</price>
</book>
<book id="2" category="programming">
<title>Clean Code</title>
<author>Robert C. Martin</author>
<year>2008</year>
<price currency="USD">45.00</price>
</book>
</books>
SELECT "@id" AS id
, "@category" AS kategorie
, title
, author
, year
, price
, "price/@currency" AS currency
FROM "books.xml"."/books/book"
COLUMNS("@id", "@category", title, author, year, price, "price/@currency")
This produces:
id category title author year price currency
-- ------------ ---------------- -------------------- ---- ----- --------
1 fiction The Great Gatsby F. Scott Fitzgerald 1925 10.99 EUR
2 programming Clean Code Robert C. Martin 2008 45.00 USD
The XmlJdbc driver supports directories as system tables. This allows searching for files in a directory in a loop query of a datasqill module, which can then be loaded.
The following query uses the "files()" function as a system table and its columns "directory" and "filename".
SELECT directory, filename
FROM "xmljdbc/src/test/resources/de/softquadrat/jdbc/xml".files()
WHERE filename = 'books.xml' OR filename = 'employees.xml'
ORDER BY filename
As a result, the searched files in the used directory are returned:
directory filename
-------------------------------------------------- -------------
xmljdbc/src/test/resources/de/softquadrat/jdbc/xml books.xml
xmljdbc/src/test/resources/de/softquadrat/jdbc/xml employees.xml
The XmlJdbc driver returns all results as text. Conversion to other data types must be performed in downstream processing steps.
If a query returns no records, it should be verified whether the path expression in the FROM clause actually navigates to XML elements. If this is not the case, the XmlJdbc driver ignores the query and returns no results.
The XmlJdbc driver returns the content of selected columns in the SELECT clause. If the addressed attribute or element does not exist, a null value is returned.
<test>
<item id="1">
<name>Test1</name>
</item>
<item id="2">
<!-- name missing -->
</item>
</test>
SELECT "@id"
, name
, description
FROM "test.xml"."/test/item"
COLUMNS("@id", name, description)
@id name description
--- ----- -----------
1 Test1 null
2 null null
If the element in the SELECT clause is not empty, its text content is returned as the result. If an element has both text content and child elements, only the direct text content (without the child elements) is returned.
<test>
<item id="1">
<name>Test1</name>
<details>
<created>17.3.2022</created>
<description>only a test</description>
</details>
</item>
<item id="2">
<name>Test2</name>
<tags>
<tag>important</tag>
<tag>urgent</tag>
</tags>
</item>
</test>
SELECT "@id"
, name
FROM "test.xml"."/test/item"
COLUMNS("@id", name)
@id name
--- -----
1 Test1
2 Test2
The XmlJdbc driver uses a simplified XPath syntax for navigation in the XML document. Full XPath expressions with predicates, axes, or functions are not supported.
Supported:
/root/element/subelement@attributeNameelementNameelementName/@attributeNameNot supported:
/root/element[@id='1']//element or parent::*count() or text()The XmlJdbc driver uses a hybrid architecture for optimal performance and flexibility:
For simple, non-hierarchical queries (without comma separation), StAX (Streaming API for XML) is used:
-- Uses StAX
FROM "file.xml"."/people/person"
Advantages:
For hierarchical queries with comma-separated navigation, DOM (Document Object Model) is used:
-- Uses DOM
FROM "file.xml"."/people/person,research"
Note:
Practical guidelines for document sizes:
Recommendation: Use simple paths without comma separation whenever possible to benefit from StAX memory efficiency. For most XML files in practice (typically < 100 MB), the difference does not matter much.
The XmlJdbc driver automatically recognizes the character set declaration in the XML header (<?xml version="1.0" encoding="UTF-8"?>). UTF-8 is the default character set when no explicit declaration is present.
The XmlJdbc driver currently does not support XML namespaces. Elements with namespace prefixes are treated as regular element names.
The XmlJdbc driver offers a simple and intuitive way to read XML files via SQL queries: