Big Data & ETL's Evolution
The need for Extract Transform Load (ETL) Tools are ever-present
as long as data consumption is there. ETL tool has been used in batch
processing and transforming data as per the format required by data warehouse. Transformations
have evolved into more complex due to enormous growth in the amount of
unstructured data.
At High level, Big Data Hadoop eco system consists of,
- Structure data : High level of organization. Data is stored typically in organized table structure.
- Unstructured data : Data is not stored in any organized form. E.g. data from Social media, Smart phones, Sensors, Images, Emails, etc.
- Hadoop (Hadoop Distributed File System) : Framework for processing/storage of extremely vast data; breaks the data into chunks and stores in the participating node servers.
- MapReduce : S/W Framework for processing vast data on multiple clusters(nodes) in parallel in master (Map task) and Slave (Reduce task) mode.
- Spark : Data analytics tool that operates on distributed data sources like Hadoop.
- Pig & Hive : Both ease the complexity of writing complex MapReduce programs (Similar to Scripting/SQL but not exactly).
- Sqoop : Migrates data in/out of Hadoop and relational data bases.
(Note: some of the above components are optional)
Fig 1. Hadoop Eco System
Given the growth and significance of
unstructured data, there has been increase in need for major ETL players to
provide solution options for transforming unstructured data to be used in
analytics. Most of the ETL tools in the market are successfully marching
towards that path. Here are some of the ETL tools offerings w.r.t Big Data,
Oracle - ODI:
The approach of Oracle's BigData is to enable
Client's current data architecture to incorporate BigData and help to get more
value to business and prospective analytical reporting and enable it to support
other big data needs. ODI is important key tool for Oracle in this pursuit. The
advanced new Big Data Wizard in ODI 12.2.1.1.0 supports many new Hadoop
technologies.
Fig 2. Oracle Data Integrator
ODI ELT doesn't require middle tier engine for supporting big data components whereas typically ETL tools require intermediate servers to convert the mapping into programming languages like C++ for execution. ODI leverages its predominant feature of using underlying database efficiency for the processing to support big data. ODI ability to produce native code results in tremendous efficiency for the processing can be attained.
Cognos:
IBM has introduced a new suite 'BigInsights' for big
data and analytical reporting. BigSQL authorizes Cognos to configure Hadoop as
a data source. BigSQL can access Hive, Hbase and Spark synchronously using a
single DB connection via Hadoop.
Business analysts and Executives can experience visually enhanced Big data reports from Cognos Presentation service which is a good value addition for understanding Big data. With BigInsights and BigSQL, IBM is providing tools for enabling Hadoop operations, including the ability to exchange components with the existing infrastructure and functionality of Cognos.
DataStage:
IBM platform for
DataStage has engineered an easy integration service of heterogeneous data,
including big data at rest (Data is stored and analyzed. E.g conventional data
warehousing) or big data in motion (Dynamic data based on Real-Time or
operational intelligence architecture. E.g Trading, Fraud detection,
etc.).
DataStage, in its newer
versions, now includes components such as new Big data file stages to access
files (both read &write) from HDFS, Hive stages or has Stages to
automatically generate MapReduce program.
Talend Studio for Data Integration:
Talend Data Fabric solution delivers high-scale and in-memory fast data processing. To generate native Spark and MapReduce code, it leverages Hadoop's parallel environment property.
Since Talend Open Studio is an open source solution it can be downloaded at no cost, but support will be provided only for subscription products. Subscription products has more functionality like shared repository, versioning and dashboards.
PowerCenter Informatica:
Informatica Corp launched Informatica BigData Edition which can be used
for ETL in Hadoop environment along with RDBMS. Informatica BDE is available in
versions 9.6 and later.
BDE runs in two modes, Native mode for normal power center ETL and Hive mode to support BigData additionally. Mappings moved to Hive will be executed in Hadoop cluster using Hadoop's parallelism (By MapReduce cability).
SQL Server Integration Services (SSIS):
Microsoft has new Visual Studio 2015 tools which contains new SQL Server
Integration Services (SSIS) Tasks. This provides ETL options on Apache Hadoop, Sqoop
for data import/export, Hive for SQL queries, the MapReduce distributed
programming infrastructure and ODBC drivers to connect to your data in HDFS
from tools like Excel and SQL Server.
JaspersoftETL:
Jaspersoft amended
OEM agreement with Talend to use native connectors to Apache Hadoop Big Data
environments in Jaspersoft ETL. Also Integration of Talend into the Jaspersoft
BI Suite, supports all Big Data use cases.
Talend supports major Big Data platforms including Amazon EMR, Apache Hadoop (HBase, HDFS, and Hive), Cassandra, Cloudera, etc. For the robust performance and reliability, Big Data Edition has high availability and load balancing features for critical reporting and analysis requirements.
List of ETL Big Data Solutions Vendor-wise:
|
Big Data |
Big Data in Cloud |
ODI |
ODI for Big Data |
Oracle Data Integrator Cloud Service |
Cognos |
BigInsights Suite |
IBM BigInsights on Cloud |
DataStage |
Native BD file stages. |
IBM Bluemix - IBM InfoSphere DataStage on Cloud |
Informatica |
Informatica Big data edition BDE |
Informatica Big data edition BDE |
SSIS |
SQL Server Data Tools for Visual Studio 2015 |
Azure Data Factory |
Talend |
Talend Big Data Integration platform |
Talend Integration Cloud |
Jaspersoft |
Talend native connectors |
Amazon Redshift |
- Xavier Philip