This blog deals with the Architecture of Hadoop, advantages
and disadvantages of Hadoop.
------------------------------------------------------------------------------------------------------------------------------------------
Let's first understand what is Hadoop?
Hadoop is open source framework for processing and storing
large data sets across the different clusters of computer which are present in
different geographical locations.
Now let's understand why Hadoop?
The problem with the traditional database management systems
is that it can process only structured data and it can handle only small amount
of data (giga bytes). Hadoop can handle structured, unstructured and semi
structured data. Hadoop can handle large amounts of data with high processing
speed through parallel processing.
The Architecture of Hadoop has mainly two components. They
are
1.
Hadoop Distributed File System - For Storing
Data
2.
Map Reduce - Processing
Name Node is the master node which does the tasks like
memory management, process management. It is the single point of failure in
Hadoop Cluster. Secondary Name Node takes the backup of the namespace of the
Name Node and updates the edits file into the FSimage file periodically. Data
Nodes are the slave nodes which does the computations.
When client submits the job to Name Node, it divides the
files into chunks and distributes the chunks to Data Nodes for processing. Each
chunk is replicated 3 times and will be stored on three different Data Nodes.
If one Node is going down, then the Name Node identifies the Data Node which
have the replicated file and starts execution. This process makes Hadoop a
fault tolerant System.
Now let's discuss the Limitations of Hadoop
1.
Handling small files:
If you want to process large number of
small files, then Name Node needs to store the HDFS location of each file. This
will become over head for the Name Node. This is the reason why Hadoop is not
recommended when it comes to handling large number of small files.
2.
Processing Speed:
To process large datasets MapReduce follows
Map and Reduce mechanism. During this process, the intermediate results of
Mapper, Reducer function are stored to HDFS Location which results in the
increase of I/O operations. Thus, the processing speed get decreased.
3.
Not able to Handle Real Stream Data:
Hadoop can process large amount of batch
files very efficiently. When it comes to Real Stream processing, Hadoop failed
handle the real-time data.
4.
Not Easy to Code:
Developers need to write code for each
operation they need to perform on data, which makes it very difficult for them
to work.
5.
Security:
Hadoop does not provide proper
authentication for accessing the cluster and it does not provide any
information about who has accessed the cluster and what data the user has
viewed. Security is the biggest draw back when it comes to Hadoop.
6.
Easy to Hack:
Since Hadoop is written in Java, which
makes cyber criminals to hack the system very easily.
7.
Caching:
There is no cache mechanism in Hadoop for
storing the intermediate results for further use. As result of this the
performance got diminished.
8.
Line of Code:
The line of code for Hadoop 1,20,000, which
makes it difficult for debugging and executing.
9.
Unpredictability:
In Hadoop we can't guarantee the time for completion of job.