CsvJdbc Driver

Introduction

The CsvJdbc driver enables reading CSV files via JDBC.

SQL queries can be executed that select and return data from the file. The syntax is composed as follows:

  • a dummy connection 'jdbc:csv:' is used as the database
  • a schema is not used
  • a CSV file is specified as the table in the FROM clause
  • columns are selected in the SELECT clause

The following CSV file can be read, for example, using the driver:

TrackId|Name|AlbumId|MediaTypeId|GenreId|Composer|Milliseconds|Bytes|UnitPrice
1|For Those About To Rock (We Salute You)|1|1|1|Angus Young, Malcolm Young, Brian Johnson|343719|11170334|0,99
2|Balls to the Wall|2|2|1||342562|5510424|0,99
3|Fast As a Shark|3|2|1|F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman|230619|3990994|0,99
4|Restless and Wild|3|2|1|F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman|252051|4331779|0,99
5|Princess of the Dawn|3|2|1|Deaffy & R.A. Smith-Diesel|375418|6290521|0,99
6|Put The Finger On You|1|1|1|Angus Young, Malcolm Young, Brian Johnson|205662|6713451|0,99
7|Let's Get It Up|1|1|1|Angus Young, Malcolm Young, Brian Johnson|233926|7636561|0,99

An SQL query over the driver would look like this:

SELECT x,y,z
    FROM "chinook.Track.csv" COLUMNS (x,y,z) SEPARATED BY '|'

This produces the following result set:

x y z
- - - 
TrackId Name AlbumId 
1 For Those About To Rock (We Salute You) 1 
2 Balls to the Wall 2 
3 Fast As a Shark 3 
4 Restless and Wild 3 
5 Princess of the Dawn 3 
6 Put The Finger On You 1 
7 Let's Get It Up 1

Columns are selected sequentially from the file. The column names specified in the COLUMN clause are assigned to them.

Headers

If the CSV file has headers (as in our example), they can be used to identify column names.

The HEADLINE clause is used for this:

SELECT trackid,name,composer
    FROM "chinook.Track.csv" COLUMNS(trackid,name,composer) HEADLINE 1 SEPARATED BY '|'

The line number with the headers (after HEADLINE) has the following effects:

  1. By specifying the HEADLINE clause, the headers contained in it are not included in the result set.
  2. The header row does not have to be the first row of a CSV file (all preceding rows before the header row are then ignored).
  3. Not all columns are needed in ascending order anymore; instead, columns can be selectively chosen by name.
trackid name composer
------- ---- -------- 
1       For Those About To Rock (We Salute You) Angus Young, Malcolm Young, Brian Johnson 
2       Balls to the Wall  
3       Fast As a Shark F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 
4       Restless and Wild F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 
5       Princess of the Dawn Deaffy & R.A. Smith-Diesel 
6       Put The Finger On You Angus Young, Malcolm Young, Brian Johnson 
7       Let's Get It Up Angus Young, Malcolm Young, Brian Johnson

Aliases

As usual, column aliases can be used to give columns different names. This is helpful when additional SQL operators are applied to the column values:

SELECT trackid,BIGINT(milliseconds)/1000/60 minutes,name,composer
    FROM "chinook.Track.csv" COLUMNS(trackid,name,composer,milliseconds) HEADLINE 1 SEPARATED BY '|'

The result is:

trackid minutes name composer
------- ------- ---- -------- 
1       5       For Those About To Rock (We Salute You) Angus Young, Malcolm Young, Brian Johnson 
2       5       Balls to the Wall  
3       3       Fast As a Shark F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 
4       4       Restless and Wild F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 
5       6       Princess of the Dawn Deaffy & R.A. Smith-Diesel 
6       3       Put The Finger On You Angus Young, Malcolm Young, Brian Johnson 
7       3       Let's Get It Up Angus Young, Malcolm Young, Brian Johnson

Pseudo-Columns

In addition to the data columns from the CSV document, the CsvJdbc driver generates an additional pseudo-column "rownumber". This can be used to generate sequential numbers:

SELECT rownumber,BIGINT(milliseconds)/1000/60 minutes,name,composer
    FROM "chinook.Track.csv" COLUMNS(trackid,name,composer,milliseconds) HEADLINE 1 SEPARATED BY '|'
rownumber minutes name composer
--------- ------- ---- -------- 
1         5       For Those About To Rock (We Salute You) Angus Young, Malcolm Young, Brian Johnson 
2         5       Balls to the Wall  
3         3       Fast As a Shark F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman 
4         4       Restless and Wild F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman 
5         6       Princess of the Dawn Deaffy & R.A. Smith-Diesel 
6         3       Put The Finger On You Angus Young, Malcolm Young, Brian Johnson 
7         3       Let's Get It Up Angus Young, Malcolm Young, Brian Johnson

System Tables

The CsvJdbc driver supports directories as system tables. This allows searching for files in a directory in a loop query of a datasqill module, which can then be loaded.

The following query uses the "files()" function as a system table and its columns "directory" and "filename".

SELECT directory, filename
    FROM "csvjdbc/src/test/resources/de/softquadrat/jdbc/csv".files()
    WHERE filename = 'chinook.xlsx' or filename = 'ExcelValidate.json'
;

As a result, the searched files are returned:

directory filename
--------- -------- 
csvjdbc/src/test/resources/de/softquadrat/jdbc/csv CsvValidate.json 
csvjdbc/src/test/resources/de/softquadrat/jdbc/csv x.csv

Notes

Data Types

The CsvJdbc driver returns all results as text. Conversion to numbers or date fields must be performed in a subsequent processing step.