What is ETL Setup/Performance Tuning ?- Definition from Trenovision

ETL

ETL performance tuning :The data load using ETL in a typical DW is usually a very time-consuming process  A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible
 


ETL Setup/Performance Tuning

 

  • The first step of ETL performance tuning is to identify performance bottlenecks. This can happen in the Source or Target database, Mapping, Transformation and System. The strategy is to identify a performance bottleneck, eliminate it and then identify the next performance bottleneck until it is satisfied with the performance.
  • Run Test Sessions
  • Study performance details and thread statistics
  • Monitor system Performance
  • Eliminate Source and Target Database Issues
  • Eliminate Mapping Issues
  • Eliminate Transformation Issues



Mapping Issues

Mapping level optimization may take time to implement but it can significantly boost session performance. Generally you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the mapping. Configure the mapping with the least number of transformations and expressions to do the most amount of work possible. Here are some simple steps:

  • Optimize the flat file sources
  • Configure Single Pass reading
  • Optimize Simple Pass through Mappings
  • Optimize filters
  • Optimize data type conversions
  • Optimize expressions



Transformation Issues

  • You can optimize following types of Transformation in Mapping:
  • Aggregator Transformation
  • Joiner Transformation
  • Lookup Transformation
  • Sequence Generator Transformation
  • Source Qualifier Transformation

Aggregator Transformation

  • Here are some guidelines to optimize the performance of an aggregator transformation:
  • Group by simple columns
  • Use sorted Inputs
  • Use Incremental aggregation
  • Filter data before you aggregate it
  • Limit port connections



Joiner Transformation

  • Joiner transformations can slow performance because they need additional space at run time.
  • Designate the master source as the source with fewer duplicate key values
  • Designate the master source as the source with the fewer rows
  • Perform joins in a database when possible



Lookup Transformation

  • Lookup table is on the same database as the source table in your mapping and caching is not feasible; join the tables in the source database rather than using a lookup transformation.
  • Use the optimal database driver
  • Cache lookup tables
  • Optimize the lookup condition
  • Index the lookup table
  • Optimize multiple lookups



Sequence Generator Transformation

  • The Number of cached values property determines the number of values the Integration Service caches at one time. Make sure that the number of cached value is not too small. You may consider configuring the number of cached values to a value greater than 1000
  • SQL Qualifier Transformation:
    • Use the select distinct option for the source qualifier transformation if you want the Integration Service to select unique values from a source. Use Select Distinct option to filter unnecessary data earlier in the data flow. This can improve performance

Identifying Target Bottlenecks

The most common performance bottleneck occurs when the Informatica Server writes to a target database. Target bottlenecks can be identified by configuring the session to write to a flat file target. If the session performance increases significantly when written to a flat file, then there is a target bottleneck
 

Identifying Source Bottlenecks

Performance bottlenecks can occur when the Informatica Server reads from a source database. If the session reads from relational source use the following methods:

  • Add a filter condition after Source qualifier to false so that no data is processed past the filter transformation. If the time it takes to run the new session remains about the same, then there is a source bottleneck.
  • In a test mapping remove all the transformations and if the performance is similar, then there is a source bottleneck.
  • Source bottlenecks can be identified by executing the read query directly against the source database. Measure the query execution time and the time it takes for the query to return the first row

Identifying Mapping Bottlenecks

  • Mapping bottlenecks can be identified by using a Filter transformation in the mapping. If there is no source bottleneck, then add a Filter transformation in the mapping before each target definition. Set the filter condition to false so that no data is loaded into the target tables. If the time it takes to run the new session is the same as the original session, then there is mapping bottleneck.

Identifying a Session Bottleneck

  • Use Collect Performance Details. Any value other than zero in the readfromdisk and writetodisk counters for Aggregator, Joiner, or Rank transformations indicate a session bottleneck. Low (0-20%) BufferInput_efficiency and BufferOutput_efficiency counter values also indicate a session bottleneck.

Identifying a System Bottleneck

  • If there is no source, Target, Mapping or Session bottleneck, then there may be a system bottleneck.
  • Use system tools to monitor CPU usage, memory usage, and paging.
  • On Windows :- Task Manager
  • On Unix – Systems toots like sar, iostat. For E.g. sar –u (%usage on user, idle time, i/o waiting time)



Query optimization

  • Understand how the database is executing the query
  • Retrieve as little data as possible
  • Store intermediate results

Below are several specific query optimization strategies.

  • UseIndex
    Using an index is the first strategy one should use to speed up a query. In fact, this strategy is so important that index optimization is also discussed.
  • AggregateTable
    Pre-populating tables at higher levels so less amount of data need to be parsed.
  • VerticalPartitioning
    Partition the table by columns. This strategy decreases the amount of data a SQL query needs to process.
  • HorizontalPartitioning
    Partition the table by data value, most often time. This strategy decreases the amount of data a SQL query needs to process.
  • Denormalization
    The process of denormalization combines multiple tables into a single table. This speeds up query performance because fewer table joins are needed.
  • ServerTuning
    Each server has its own parameters, and often tuning server parameters so that it can fully take advantage of the hardware resources can significantly speed up query performance.



ETL

  • Incremental load: Since ETL has designed to load the DWH, and normally loading will happen during non business hours, need to check the ETL for incremental loading to evaluate the performance.
    • Stage Load
    • Data load Volume
    • Time check Vs. Data
  • Check General Logs
    • Reference Table Manager – Look into this log if you have issues with RTM operation
    • Preview – This is DTM log for preview operation
    • Model Repository Service – This logs all MRS operations
    • Profile – This is DTM logs for profile runs
    • Data Integration Service – This logs all DIS operations
    • Deployed Mapping – This is DTM log for running deployed mappings

Target loading(Data warehouse loading)

  • When ETL sessions writing into Data Warehouse tables, you can perform following tasks to increase performance:
    • Increase Checkpoint Intervals
    • Use Bulk Loading
    • Minimize Deadlocks
    • Increase database network packet size
    • Optimize Target Database

Increase Checkpoint Intervals

The Integration service performance slows each time it waits for the database to perform a checkpoint. To increase performance, consider increasing the database checkpoint interval. When you increase the database checkpoint interval, you increase the likelihood that the database performs checkpoints as necessary, when the size of the database log file reaches its limit.

Use Bulk Load

When bulk loading enabled, the Integration service bypasses the database log, which speeds performance. Without writing to the database log, however, the target database cannot perform rollback. As a result, you may not able to perform recovery. But when you use bulk loading, weigh the importance of improved session performance against the ability to recover an incomplete session.

Minimize Deadlocks

To improve session performance, you can increase the number of target connection groups the Integration Services uses to write to the targets in a sessions. To use a different target connection group of each target in a session, use the different database connection name for each target instance. You can specify the same connection information for each connection name.

Increase Database Network Packet Size

  • Increase the network packet size to allow larger packets of data to cross the network at one time. Increase the network packet size based on database you use:
  • Oracle – You can increase the database server network packet size in listener.ora and tnsnames.ora file.

Optimize Target Database

  • If the target database is Oracle, we can optimize the target database by checking the storage clause, space allocation and rollback or undo segments.
  • When you write to an Oracle Database, check the storage clause for the database objects. The database should also store table and index data in separate table spaces, preferably on different disks.
  • When you write to Oracle Database, it uses rollback or undo segments during loads. Make sure you have appropriate table spaces. The rollback or undo segments should also have appropriate storage clauses.

Performance in terms of Data retrieval

System Limitations

  • Each system has finite capacity
    • volume of data can be measured easily
    • elapsed time is fairly easily captured
    • how fast can data be delivered?
    • how fast can data be written?
    • is the network a bottleneck?
  • Need baseline measures
    • taken under controlled conditions
    • standard data set
    • known amount of data
    • statistically significant amount of data
    • reproducible test(s)
  • Concerns on Disk Deliver
    • Local disk can deliver data at finite rate
    • Rate depends mainly on rotation
    • 4,000 rpm
    • 5,400 rpm
    • 7,700 rpm
    • 10,000 rpm (“Barracuda”)
    • Exacerbated by contention, fragmentation
      (especially Windows NT)

Keep Jobs Simple

  • Don’t do unnecessary reference lookups, validations
  • Don’t process more rows than necessary
    • perform selection at extraction time
    • perform aggregation early in job
  • Don’t process more columns than necessary
    • don’t select (extract) columns you don’t need
    • just because it’s in the table definition doesn’t
      mean you have to use it
  • Parallel Processing
    • Job design can take advantage of parallel processing capabilities
    • Independent paths through job are run as separate processes
    • Independent jobs can run simultaneously

Sorting

  • Sorted data substantially improve performance of Aggregator stage
  • Prefer to sort at extraction time
    • ORDER BY clause in SELECT
    • perform UNIX sort in before-stage routine
  • Can use Sort stage
    • not as fast
    • memory hog

Limiting Processing

  • Select only needed rows from source
    • change data capture
    • selection criteria
    • extraction through views/stored procedures
  • Select/process only needed columns
  • Pre-sort data to be aggregated
  • Avoid unnecessary multiple execution of functions and traversal of data

 

Performance Tuning : Can also be looked into

  • Optimized use of the Tool.
  • Better memory management
  • Better resource management
  • Reduction in processing time.

Approach for Performance Tuning

  • Parallel Processing
  • Partitioning
  • Distributed Processing
  • Memory Management

Performance Tuning – Parallel Processing

  • Pipeline Parallelism
  • Parallel Load and Extract

 

  • When to Use
    • DSS Environment
    • Complex queries on very large tables
  • Supported Systems
    • SMP, Clusters, MPPS
    • Sufficient I/O bandwidth
    • Underutilized CPU (upto 30%)
    • Sufficient Memory to support extra processes (Sorts, hashing, I/O buffers)

Performance Tuning – Partitioning

  • Data Partitioning
    • Range
    • Hash
    • Round-Robin
  • Continues ETL
    • Large data volume
    • Windowing technique
  • Set Processing
    • Saves time for background processing
    • Processing many records at a time
    • Re-usable Caching

Performance Tuning – Distributed Processing

  • Hub Spoke Architecture to n-tier Architecture
    • Reduce network traffic
    • Load distribution
    • Load delegation
    • All types of sources
  • Distributed architecture

 


Performance Tuning in Various ETL Tools

 

Performance Tuning – AbInitio

  • Defining multiple nodes to achieve parallel processing.
  • Data is split into smaller partitions to achieve parallel processing .
  • Different Partition and Collection Techniques to improve performance of the job.
  • Buffering Techniques to improve performance.
  • Data stored as datasets helps jobs to execute faster as datasets are stored in native format. This also helps in the restart capabilities of the job.
  • Built-in sorting and aggregation capabilities to achieve optimized performance.
  • Native calls to source and target avoid performance problems due to ODBC
  • Monitor facility helps in identifying the Congestion points and helps in further tuning of the jobs.

Performance Tuning – AbInitio

  • Partitioning techniques
  • Memory management
  • Optimized usage of components
  • Reducing Phases
  • Using lookups which are in-memory Hash tables
  • Loading data in direct/parallel mode using native databases Load utilities.

Performance Tuning – SAS

  • Parallel processing for sorting and aggregations through SAS Enterprise ETL Server (released with SAS 9)
  • SAS Intelligence Storage (released with SAS 9) offers secure, high-performance data storage for large, multi-user communities using multiple business intelligence and analytic applications. Provide fast access to small subsets (Summarize datasets) of very large tables.
  • SAS SPDS server (based on RAID technology) offers performance improvement techniques like Partitioning and parallel processing, Indexing, SQL Planner optimizations
  • Base SAS provides system options like MEMSIZE, SORTSIZE, BUFSIZE, BUFNO, CATCACHE, COMPRESS can be used to monitor and optimize the job performance



One Reply to “What is ETL Setup/Performance Tuning ?- Definition from Trenovision

Comments are closed.