In today's world, business needs to take decisions instantly on the data provided by business analyst to stay on top. In current scenario business analyst needs to process and analyze all types of data (structured, semi-structured and unstructured) in short span of time, which is not possible only through the traditional data warehouse concepts, to achieve this we need to move to big data. When we have decided to move to big data, which is the best architecture we can implement?
The two widely used big data processing architecture
Lambda architecture
Kappa architecture
Lambda Architecture:
The lambda architecture ensures both batch and real time data are taken into consideration for analysis. Using this architecture history of the data is maintained for any future analysis. The lambda architecture has the following advantages:
latency time
throughput
volatile
Latency time:
Latency time is the time taken between the generation of data and its availability in the reporting layer for analytics.
Throughput:
The large volume of data is broken into small blocks, gets processed in parallel increasing the throughput.
Fault tolerance:
The system which is designed to continue processing even when there is some error.
Architecture diagram:
The three major components of Lambda architecture:
Batch layer
Speed layer
Serving layer
Batch layer:
In the batch layer the complete data will be stored in HDFS immutable system, which means the data can only be appended not updated or deleted. The versioning of the data in append logic is achieved using the time stamp. From the main file system, the by using the Map reducer the data will be pre-commute to the batch views as per the business requirement. The ad-hoc querying is also possible on the batch view. The batch view is generated to achieve low latency time.
Apache Hadoop is used in the batch layer processing.
Speed layer:
The records which are missed in the latency time of batch processing will be fetched and stored in the speed layer dataset. Speed layer is a delete and create dataset, once the batch layer processing is completed speed layer dataset will be deleted and created with new data. From the speed layer dataset by using complex algorithm real time views are created.
Apache storm, SQL stream is used in speed layer processing.
Serving layer:
The serving layer is where the user queries for output. Once the user query is fired batch view and real-time view outputs are combined and near real time output will be provided. Druid can be used in the serving layer to handle both batch and speed layer views.
Kappa Architecture:
Kappa architecture is nothing but a simple Lambda architecture system with batch layer removed. This architecture has been designed in such a way that the speed layer is capable enough to handle both real time and batch data.
Architecture diagram:
The Apache Kafka streaming data is specifically designed to achieve Kappa architecture. In Kappa architecture there is only one code which needs to maintained for stream processing (Apache Kafka). In kappa architecture we need to run the full load job at first time of development as well any code change. Once the reprocessing of the full data is done and presented in the serving layer, the speed layer job will run fetching the latest records and present it in the serving layer. The speed layer job can be designed using (Apache Storm, Apache Samza, and Spark Streaming). The serving layer can use no sql database, apache drill etc.
Which architecture to go for?
If the code logic for both the batch layer and real time layer are same then we can go for kappa architecture.
If the code logic of batch layer and speed layer are different then we can go for lambda architecture.