Using Wrapper script at unix level we can perform the below tasks
Capture the technical Metadata
Capture the failure: the failure of load
Capture the last load details
Comparison for Incremental and full load
Sqoop Vs Bigquery :
We can create a wrapper to generate a log with below Technical Metadata which will help for restartability mechanism when script fails .
Source : Filename, Source file path , source files count, source file record count
Target : Cluster Information (connection details), Target database, Target table name, Target record count
Execution Start timestamp , Execution end timestamp
Wrapper script :
It loads the given above elements in log file and help in restartability or It will load the respective details in to control table inside database.
We can write another wrapper script which will help in error and exception handling.
Approaches for Historical Load (for hive and gsutil):
GSUTIL : we have an option – Resemble upload. With this we can restart the failure from the failure point. Please find the attached document for the same.
It is the preferable approach for Historical loads. Please find the attached document.
Sqoop : We need to take the count of rows wrt to the target and do the restartability wrapper based on count of rows with source vs count of rows loaded in target .
Approaches for Incremental Load ( for hive) :
GSUTIL is not preferable for incremental load .
The below approaches can be automated with another wrapper script
1) Approach based on Timestamp
2) Approach based on record count
3) Approach staging layer – we will load the incremental data into staging tables and in next step we will load to target table. If incase if there is any failure we will truncate the staging table alone and load the target table.
0 Comments