INTRODUCTION
There is a plethora of databases today for example SQL databases, No SQL databases, Open source databases, Columnar databases, MPP Databases etc. Oracle which is a leader in the relational databases space is often compared to these. So let us look at some basic differences between Oracle and some other players like MPP databases (Teradata, Vertica, Greenplum, Redshift, Netezza etc). and Hadoop for various Analytics workflows
ARCHITECTURE
Oracle, Teradata, Vertica, Greenplum, PostgresSQL, Redshift
and Netezza are all Relational Databases. However, Teradata, Vertica,
Greenplum, PostgresSQL, Redshift and Netezza are massively parallel processing
databases which have parallelism built into each component of its architecture.
They have a shared nothing architecture and no single point of failure. On the
other hand, Oracle database has a shared everything architecture. Even Exadata (which
is Oracle's engineered system or Oracle's database appliance specifically for
Analytics or OLAP) is based on existing Oracle engine which means any machine
can access any data which is fundamentally different from Teradata as shown in
diagram below. Thus MPP databases are able to break a query into a number of DB
operations that are then performed in parallel thus increasing the performance
of a query
USER INTERFACE
Oracle and most MPP databases use SQL interface while Hadoop uses Map reduce programs or Spark which are java based interfaces. Apache HIVE project however is aimed towards introducing a SQL interface over Map Reduce programs.
INFRASTRUCTURE
The other difference between these systems is that most MPP databases like Teradata and Oracle Exadata run on propriety hardware or appliances while Hadoop runs on commodity hardware.
SCALABILITY
Oracle Exadata and most MPP databases scale vertically on propriety hardware while Hadoop scales horizontally which results in a very cost effective model especially for large data storage
STORAGE
The MPP databases use columnar data storage techniques while Oracle uses row wise storage which is less efficient in disk space usage and also in performance to columnar storage. However, Oracle Exadata uses Hybrid Columnar Compression (HCC) which is an aggregate data block created above the rows of data. The compression is achieved by storing the repeating the values only once in the HCC. Thus performance of Oracle Exadata is considerably better than row wise storage Oracle database. Hadoop on the other hand supports HDFS which is distributed file storage
USE CASES
Oracle is often the choice of database for Analytics where Oracle ERP systems are deployed. Oracle Exadata can meet OLAP workflow/ DSS requirements and has many Advanced Analytics options. More details can be seen at Oracle's Machine Learning and Advanced Analytics 12.2c and Oracle Data Miner 4.2 New Features.
Teradata is the choice of DB in
case of pure OLAP workflows with its massively parallel processing capabilities
especially when data volumes are high. Teradata is also the preferred choice in
case of low latency analytics requirement where an RDBMS is still required However,
it is losing market share as Teradata migration is a priority for most cost
conscious CEOs due to its prohibitive year on year expense. Another reason for
migration off Teradata is the adoption of new generation data analytics
architecture with support for unstructured data.
The above sets the stage for
Hadoop with its support for big data which can be structured or unstructured. It
provides a platform for data streaming and analytics over large amounts of data
coming from IOT sensors, social data from various platforms, weather data or
spatial data. It is based off open source technologies and uses commodity
hardware which is another attraction for many companies moving from Data
warehouse to data lake ecosystem.
CONCLUSION
Thus, it is important to consider the Use case a database under consideration is designed to serve before deciding the best fit for your Big Data ecosystem. Making a decision solely on amount of data (Petabytes or terabytes) that need to be stored might not be accurate. The other factors that can influence one's decision might be your overall IT landscape/ preferred Infrastructure platform, developer skills, cost, future requirements which is specific to each individual organization. So though new age databases are opening new opportunities for data storage and usage, the traditional RDMS will most likely not go away in the near future.
REFERENCES
https://docs.oracle.com/cd/E11882_01/server.112/e17157/architectures.htm#HAOVW215
https://downloads.teradata.com/blog/carrie/2015/08/teradata-basics-parallelism