Apache Flume Data Flow

Summary -

In this topic, we described about the below sections -

Data Flow

Apache Flume is an intermediate tool/service for data collection from external data sources and sends to centralized data storage. The data is Log data which is generated by the log serves.

The Flume collects the log data by using agents which are running on log servers. Apache Flume has data collectors to collect the data from the agents. As a last step, the collected data will be aggregated and pushed to centralized store like HDFS or HBase.

One approach is to tag all the data with source information and then push all the data down the same pipe. Another approach is to keep the two sets of data logically isolated from each other the entire time and avoid post processing.

Flume can perform both approaches. Flume enables the latter lower-latency approach, by introducing the concept of grouping nodes into flows.

Let’s discuss about each part in the above diagram.

A Flume event is defined as a unit of data flow and an optional set of string attributes.
A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
A Flume source consumes events delivered to it by an external source like a web server.
The external source sends events to Flume in a format that is recognized by the target Flume source.
The channel is a passive store that keeps the event until it’s consumed by a Flume sink.
The sink removes the event from the channel and puts it into an external repository like HDFS or forwards it to the Flume source of the next Flume agent (next hop) in the flow.
The source and sink within the given agent run asynchronously with the events staged in the channel.

Multi-hop flow -

Within Flume, there can be multiple agents. Using the multiple agents and before reaching the final destination, an event may travel through more than one agent. This process is known as multi-hop flow.

The below diagram shows the multi-hop flow.

Fan-in flow -

The data transferred from many sources to one channel is known as fan-in flow. Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination.

Fan-out -

The data is delivered from one source to multiple destinations (sinks) for processing. The above data flow is called as Fan-out. There are two modes of fan out, replicating and multiplexing.

In the replicating flow, the event is sent to all the configured channels. In the multiplexing, the event is sent to only a subset of qualifying channels. To fan out the flow, one needs to specify a list of channels for a source and the policy for the fanning it out.

Recoverability -

Apache flume can recover the event if the transmission failure happened. Flume supports below for recovery from failure. The events are staged in the channel can manages recovery from failure.

Flume supports a durable file channel which is backed by the local file system. There is a memory channel which simply stores the events in an in-memory queue.

The in-memory queue is faster when compared to other. Any events still left in the memory channel when an agent process dies can’t be recovered.