In today's era of advance analytics, data from only relational database is not enough to get proper insight about any business. There is a growing need to integrate data from various sources and analyze them to produce more meaningful and innovative idea to take better decision.
Below are the steps to Enrich Oracle Database Data with Data from distributed File System e.g. HDFS) and accessed through Hadoop jobs for Integrated Reporting:
1.Table function's launched DBMS_SCHEDULER manages Hadoops Map reduce job
2.Enabling to redirect Hadoop's result into Oracle query for
blending
3. A bash script is launched by the DBMS_SCHEDULER
asynchronously which in turn register Hadoop job
4.Oracle table function and the Hadoop's mapper collaborate
using Oracle's Advanced Queuing feature
•A Common queue is used to en-queue and de-queue data by Hadoop
mapper and Oracle table function respectively
•Oracle Advance Queue supplies the data into Oracle Database of
various BI tool's consumption
5. At a time one of the slave node submits the External Job
- Create a table function
that will act like a Query coordinator(QC) as well processing role
- DBMS_SCHEDULER of Table function invocation starts an
asynchronous job which will run a bash script and in return the script
will start a synchronous launcher in the Hadoop cluster for the mapper
process
- The Mapper process
processed the data and writes into a queue. To improve the performance,
rather the writing sequentially by each and every mapper process, we can
choose to write in batch mode
- In the next step, the
De-queue process, Oracle database
supports parallel invocation of table function to get data from Oracle
Advance Queue. The table functions
leverages both the oracle data and data from queue, then integrates them
into a single dataset for the user
- Once the Hadoop mapper
processes are started, the job monitor process keeps an eye of the
launcher bash script. The bash script finishes execution once the mapper
process finished processing
·
The job
monitor also keeps track of remaining
data elements in queue and the table function keeps on processing the data.
Pros of Using Oracle
table Function and Advance Queuing Facility:
- The Oracle Advance queuing feature provides us the load balancing capability by enabling the parallel execution of table function and at the same time Hadoop mapper job will also run in parallel with a different cardinality of parallelism and without any intervention of Oracle's Query Coordinator
- This method eliminates the necessity external data storage while accessing Hadoop's result
- Data from both Oracle table and Oracle queue (Hadoop's result) are used by Table function to produce combined result for the user
Cons of Using Oracle table Function and Advance Queuing Facility:
- When there is a need to synchronize the timing of the data to be processed, we should use Table Functions and Oracle Advanced Queue features
Recommendation:
- For better performance of Queue, we can insert data in a batch mode from individual mapper job
- Delivery mechanism can be changed to pipeline and relational table for optimized performance