Upsert

The module generates a database-specific Upsert (Update else Insert) statement based on the SQL query entered by the developer and executes the effective statement. The target table is reconciled with the SQL query via defined keys. Rows with keys that exist in both the target table and the SQL query are set to the values of the SQL query (Update functionality). Rows with keys that appear in the SQL query but not in the target table are inserted into the target table (Insert functionality).

Name Meaning
Module Upsert
Module Class DsModTemplateSQL
Type Java
Purpose Based on a defined primary key, the data of the specified SQL query should be integrated into the target table (Update else Insert = Upsert)
Transformation Code SQL query
Sources Source tables in a database
Targets Target table in the same database

Description

Since not all databases have a unified SQL MERGE command, a statement is generated depending on the database involved that performs the upsert functionality.

/ The SQL query and the target table are not identical after execution, since rows that are not in the source query were not deleted from the target table. /

The following example shows how to build an Upsert statement with the help of the module. First, the SQL query used:

SELECT n.n_nationkey AS nation_key
     , n.n_name nation_name
     , r.r_regionkey AS region_key
     , r.r_name region_name
  FROM stage.region r
     , stage.nation n
 WHERE 1 = 1
   AND n.n_regionkey = r.r_regionkey

The primary key "nation_key" is specified (see Attributes).

If the target schema is "data_mart" and the target table is "d_nation", the module generates the following effective SQL for Postgres:

INSERT INTO data_mart.d_nation AS tgt (
       nation_key
     , nation_name
     , region_key
     , region_name
     )
SELECT nation_key
     , nation_name
     , region_key
     , region_name
  FROM (
SELECT n.n_nationkey AS nation_key
     , n.n_name nation_name
     , r.r_regionkey AS region_key
     , r.r_name region_name
  FROM stage.region r
     , stage.nation n
 WHERE 1 = 1
   AND n.n_regionkey = r.r_regionkey
       ) src
 ON CONFLICT (nation_key)
   DO UPDATE
      SET nation_name = excluded.nation_name
        , region_key = excluded.region_key
        , region_name = excluded.region_name
    WHERE NOT (excluded.nation_name IS NOT DISTINCT FROM tgt.nation_name
           AND excluded.region_key IS NOT DISTINCT FROM tgt.region_key
           AND excluded.region_name IS NOT DISTINCT FROM tgt.region_name
          )

As can be seen in the example, the column names in the source query must be specified as aliases so that they match the column names of the target table. This is not necessary of course when the query column already has the same name as the corresponding column in the target.

In the case of an update, datasqill checks whether something has changed. This check must take into account that columns that are NULL in both source and target are considered equal. In Postgres, the operator "IS NOT DISTINCT FROM" was used in this example to recognize that the two columns are identical.

For an Oracle target database, the generated SQL looks like this:

MERGE INTO data_mart.d_nation tgt
USING (
SELECT n.n_nationkey AS nation_key
     , n.n_name nation_name
     , r.r_regionkey AS region_key
     , r.r_name region_name
  FROM stage.region r
     , stage.nation n
 WHERE 1 = 1
   AND n.n_regionkey = r.r_regionkey
      ) src
   ON (src.nation_key = tgt.nation_key
      )
 WHEN MATCHED THEN
   UPDATE
      SET tgt.nation_name = src.nation_name
        , tgt.region_key = src.region_key
        , tgt.region_name = src.region_name
    WHERE NOT (DECODE(src.nation_name, tgt.nation_name, 1, 0) = 1
           AND DECODE(src.region_key, tgt.region_key, 1, 0) = 1
           AND DECODE(src.region_name, tgt.region_name, 1, 0) = 1
          )
 WHEN NOT MATCHED THEN
   INSERT (
          nation_key
        , nation_name
        , region_key
        , region_name
        )
   VALUES (
          src.nation_key
        , src.nation_name
        , src.region_key
        , src.region_name
        )

You can see that both the underlying DML command differs (INSERT or MERGE) and the procedure for checking whether the data columns have changed.

Data Sources

The sources of the module are tables that are read via the query defined by the datasqill developer. For these tables, the datasqill runtime user must have read rights. All source tables used in the SQL query must be connected to the module input in the datasqill GUI in the graphical data model.

Data Target

The target of the module is a single table. For this, the datasqill runtime user must have write rights. The target table must be connected to the module output in the datasqill GUI in the graphical data model.

Attributes

In the delivered state, the module offers only one attribute in the GUI for the datasqill developer:

Name Type Meaning
Join Key Columns Required Column List The text "RL:" followed by a comma-separated list of columns to be used for the join condition. By default it says "RL:${tgt_pk_list}". The variable tgt_pk_list is replaced by the list of primary key columns of the target table

Variations

When installing the module, additional attributes can be configured

Upsert with Loop

The module can be extended so that the "Loop Query" attribute is activated (see configuring module attributes in the appendix):

Name Type Meaning
Loop Query SQL Loop query to use the results in the main query.

Then you have an additional attribute in which you can enter a query. When a loop query is captured, the module executes the main query for each record found. The column results can be used as variables.

Since the main query is executed per result row of the loop query, the result set of the loop query should be kept small. Otherwise you lose the performance of set operations and execution behaves procedurally.

The loop query can also be used very effectively as an IF query. E.g.:

SELECT DISTINCT 1
  FROM <source-table>
 WHERE <last-change-date> >= <today>-7

You get one row if something has changed in the last 7 days, otherwise none. Thus the main query is executed once or not at all.

In the main query you can use the column names of the loop query as variables. Note that the variables are each inserted as text (string).

See examples at the Insert module.

Technical Columns

You can define technical columns that allow you to see afterwards when which rows were inserted or modified:

Name Column Definition Meaning Value
ins_date ta_inserted_ts Inserted at Not changed on update. Set to current timestamp (current_date) when inserting a row
ins_by ta_inserted_by Inserted by Not changed on update. Set to current modifier (current_by) when inserting a row
upd_date ta_modified_ts Modified at Filled with the current timestamp (current_date)
upd_by ta_modified_by Modified by Filled with the current modifier (current_by)
del_date ta_deletion_ts Deleted at Filled with NULL
del_by ta_deleted_by Deleted by Filled with NULL
del_ind ta_is_deleted Deletion indicator Filled with 'N'
current_date ${get_current_datetime()} This value defines the current timestamp In delivered state the current timestamp in date-time (without milliseconds)
current_by ${request_id} This value defines the current modifier In delivered state the Request ID of the datasqill action execution

The column names used (under Column Definition) can be changed so that e.g. ta_inserted_ts is no longer called but insert_date. See also configuring module attributes in the appendix.

If the target table contains some or several of these technical columns, they are filled according to the table above. If you change the target table "data_mart.d_nation" in the very first example and add the technical columns ta_inserted_ts, ta_inserted_by, ta_modified_ts, ta_modified_by, ta_deletion_ts, ta_deletion_by and ta_is_deleted, the following is generated for Exasol:

MERGE INTO data_mart.d_nation tgt
USING (
SELECT n.n_nationkey AS nation_key
     , n.n_name nation_name
     , r.r_regionkey AS region_key
     , r.r_name region_name
  FROM stage.region r
     , stage.nation n
 WHERE 1 = 1
   AND n.n_regionkey = r.r_regionkey
      ) src
   ON (src.nation_key = tgt.nation_key
      )
 WHEN MATCHED THEN
   UPDATE
      SET tgt.nation_name = src.nation_name
        , tgt.region_key = src.region_key
        , tgt.region_name = src.region_name
        , tgt.ta_modified_by = 1804
        , tgt.ta_modified_ts = DATE_TRUNC('second', CURRENT_TIMESTAMP)
        , tgt.ta_is_deleted = 'N'
        , tgt.ta_deletion_by = NULL
        , tgt.ta_deletion_ts = NULL
    WHERE NOT (DECODE(src.nation_name, tgt.nation_name, 1, 0) = 1
           AND DECODE(src.region_key, tgt.region_key, 1, 0) = 1
           AND DECODE(src.region_name, tgt.region_name, 1, 0) = 1
           AND tgt.ta_is_deleted = 'N'
           AND tgt.ta_deletion_by IS NULL
           AND tgt.ta_deletion_ts IS NULL
          )
 WHEN NOT MATCHED THEN
   INSERT (
          nation_key
        , nation_name
        , region_key
        , region_name
        , ta_modified_by
        , ta_modified_ts
        , ta_inserted_by
        , ta_inserted_ts
        , ta_is_deleted
        , ta_deletion_by
        , ta_deletion_ts
        )
   VALUES (
          src.nation_key
        , src.nation_name
        , src.region_key
        , src.region_name
        , 1804
        , DATE_TRUNC('second', CURRENT_TIMESTAMP)
        , 1804
        , DATE_TRUNC('second', CURRENT_TIMESTAMP)
        , 'N'
        , NULL
        , NULL
        )