DAG in Spark and Usage

DAG - Directed Acyclic Graph

DAG is a set of vertices and edges
Vertices represent RDD
Edges represent Operations that needs to be applied on RDD
DAG is a finite graph with no directed cycles, which means it have finite vertices and edges.

Each Spark Application is converted into DAG
DAG Shows the complete task i.e. Transformation(s) and Action
DAG shows different stages of Spark Job.
Logical DAG of operations is created implicitly by the Spark Program. or in other words, user code defines DAG of RDDs
For each Action, spark creates a DAG and submits to the DAG Scheduler.
DAG scheduler divides the operators into stages of tasks.(This is the final result of DAG scheduler)
DAG Scheduler pipelines the operators together.
e.g. Many map operations can be scheduled on a single stage.


DAG helps in achieving Fault tolerance.
Lost RDD can be recovered.
MR has two levels: map and reduce, DAG can have multiple levels, because of this executing SQL queries is more flexible.

DAG is an alternate to MapReduce in Hadoop, it is a programming style in distributed systems.

DAG is faster because it doesn't write the intermediate results to the disk.

Action force the translation of RDD into an Execution plan


DAG vs Lineage


Q1) Then what is Lineage Graph  

When a new RDD is derived from existing RDD using transformation, Spark keeps track of all the dependencies between these RDDs called the lineage graph. In case of data loss, this lineage graph is used to rebuild the data.

Q2) Are there any difference between Lineage Graph and Directed Acyclic Graph ? 

Yes, DAG and Lineage graphs are different. 

DAG shows the different stages of a spark job.
Q3) Are they both same, can they both be used interchangeably ? 

No they cannot be used interchangeably, because workings are different. Lineage graph deals with RDDs so it is applicable up-till transformations ,  Whereas, DAG shows the complete task, ie; trasnformation + Action   


From <http://support.edureka.co/support/solutions/articles/4000083868--directed-acyclic-graph-dag-vs-lineage-graph-in-spark>





Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark

Spark Persistence(Caching)