ETL in computing terminology refers to Extract, Transform and Load process. This is related mostly to data warehousing projects. A ETL framework involves the following three steps:
1. Extract: This is a process to load the data from a data source which could be a database, or a file dump from another system
2. Transform: This step involves, massaging the data to an appropriate form. This may need to to trim down the data or aggregate data from multiple data sources
3. Load: This is the final step, which uploads the data in another data source like database or generate a flat file.
This first part of the process involves reading data from various data sources databases. The data itself could be in different format. Some of the very commonly used data formats are databases and flat files. In some cases the data sources may also include some non-relational data sources.
This next step involves application of various rules on the dataset and prepare the data for the next step of Load. Some of the datasets may need very little or no transformation, while there may be other data sources that need very complext levels of transformations to meet the business requirements. Some of the common operations that may be needed here are:
- Filtering the data set for a subset of records
- Generating new values based on existing columns (using pre-defined formulas)
- Splitting of data set into different tables
- Aggregating data from various data sets
The load phase loads the data in the target. This phase can do various things depending on the business needs. Sometimes the load may need uploading a fresh data set on a incremental basis. In other cases it may require to update an existing dataset
1. Cycle Initiation: This is the very first step in the ETL process, where you collect all the reference data and validate that the settings provided are correct. This is the initialization phase. If there are errors during initialization, the ERL process fails.
2. Extract: In this step, you read the data from the datasource
3. Validate: Here the data is validated against a pre-defined business ruleset
4. Transform: Apply any transformation rules
5. Stage: This can be categorized a sub set of the transform stage. A business requirement may need us to load the data in a temp space like when we need to aggregate data from more than one data sources. In that case, we use a tamp database to hold the data sets before we can apply transformation.
6. Load: Load the data into final data source.
7. Cleanup: Clean up any temp files / databases.
Some of the common challenges are:
- An ETL process involved considerable complexity and significant problems can come up with an incorrect designed solution
- Data sets in production can be vastly different than what developers of the system use. This can lead to huge performance bottlenecks
- These types of solutions grow horizontally which involves adding more data sources either to extract or load. The solutions should be designed to support addition of such data sources with minimal effort
This is the biggest challenge that any ETL solution has to struggle with. Most often the slowest part of the ETL process is the load phase where we have to take care of the various database structures, integrity of the records and indexes. The transform phase can also lead to some performance bolltenecks if there are needs to perform some extensive data transformations.
Layered Architecture Design
Core Layer: This is the primary layer which holds all the business logic or core processing like Extract, Transform and Load
Job Management Layer: This layer should take care of scheduling jobs, managing queues and other operational activities like activation of tasks, alerts etc
Auditing and Error Handling: This layer should be dedicated to auditing process, logging entries to log files or database. Also, providing error handling support
Utilities: A common layer to provide common functionality across layers
This is the most important layer and holds the most logic. As a good practice, this layer should be divided into three sections, which should be controlled around a commoin Processing logic. Some common components of this layer can be:
1. Controller: These hold the processing logic which co-ordinate the entire ETL lifecycle. They hold the details of the various utilities and invoke them as needed.
2. Readers: These hold logic to read data from data sources like databases and flat files. Their responsibility should be to load the data set and make it available for next phase.
3. Transformer: These components hold the logic for applying transformations to the data. Transformations can be business validations, mappings or other logic.
4. Mappers: These hold the mapping for a transformation. The controller should be aware of the mappings that are to be applied to the loaded data. In most common cases, the framework should make interfaces available to the consumers of the framework to define mappings. The framework (via controller) should consume those mappings
5. Validation: If there are validations needed to applied, these components should be defined individually. Again, in most cases, the framework should make these available as interfaces and concrete implementations would be provided by consumers of the framework.
6. Loaders: These hold logic to load the data into a data source like databases and flat files.
I am not a Subject matter expert on ETL, but hope this helps.