Recently I have been frequently using the data processing frameworks from Apache (Hadoop/HDFS/Yarn/Spark/Kafka/ZooKeeper). And I designed a comprehensive project that uses all of them in one project and the purpose is to analyze the crime distribution in LA. I would like to have a brief summary here and move on to other techs.
I think the most fundamental and earliest framework is Hadoop. Hadoop is an open-sourced community version of Google File System. Hadoop Distributed File System (HDFS) can store large files in multiple machines (a cluster). And Hadoop (MapReduce) can parallelize `map` or `reduce` tasks on each machine to handle data-intensive jobs. Yarn was later introduced in Hadoop 2 to manage and schedule computational resources. I haven’t looked into Hadoop3 yet but I know one feature is that we can use GPU on each worker.
Spark is another framework to parallelize the data processing on a cluster. The base of Spark is the Resilient Distributed Dataset (RDD). You can operator RDD as a chunk of memory as others, but it actually partitions the memory and stores them on different machines. Spark provides the interface to manipulate the collection of memory as one. Also, Spark evaluates the operations (transformation) in a lazy way.
To do stream analysis, we might have multiple data producers and multiple data consumers. Kafka provides a channel to collect/cache streaming data from different sources. Also, Kafka separates the data into different topics and use multiple machines (brokers) to collect and distribute data. Zookeeper is used to tracking the performance of brokers, consumers, and producers.
Spark can handle the streaming data with different strategies. The first one is Spark DStream (discretized stream), which means we set a time interval and process all streaming data in that interval as a batch. The second way is the Windowed Streaming. We still pick a time interval but we collect data from multiple consecutive intervals (window). The last Structured Streaming is new and still in development. It assumes that the streaming data has a consistent format and we can stack the table row by row.
The most time-consuming part might be the configuration. For most problems, you can search the error message on Google to get a solution. But I want to mention two common errors that are hard to debug:
1. When I tried to connect Yarn and Spark, it stuck with the initialization step and no log was provided. Finally, I found that Yarn is trying to allocate a large piece of memory on my workers. However, my workers are virtual machines, even with enough memory, took forever to initialize. You might want to reduce the memory usage like this:
<!-- yarn-site.xml-->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.nodemanager.resource.vcores</name>
<value>1</value>
</property>
2.I used the openjdk-8 version of Java Runtime Environment. It works with Hadoop but fails for some unknown reason. You might have to add:
<!-- yarn-site.xml -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>