DAG in Spark and Usage
DAG - Directed Acyclic Graph
DAG is a set of vertices and edges
Vertices represent RDD
Edges represent Operations that needs to be applied on RDD
DAG is a finite graph with no directed cycles, which means it have finite vertices and edges.
Each Spark Application is converted into DAG
DAG Shows the complete task i.e. Transformation(s) and Action
DAG shows different stages of Spark Job.
Logical DAG of operations is created implicitly by the Spark Program. or in other words, user code defines DAG of RDDs
For each Action, spark creates a DAG and submits to the DAG Scheduler.
DAG scheduler divides the operators into stages of tasks.(This is the final result of DAG scheduler)
DAG Scheduler pipelines the operators together.
e.g. Many map operations can be scheduled on a single stage.
DAG helps in achieving Fault tolerance.
Lost RDD can be recovered.
MR has two levels: map and reduce, DAG can have multiple levels, because of this executing SQL queries is more flexible.
DAG is an alternate to MapReduce in Hadoop, it is a programming style in distributed systems.
DAG is faster because it doesn't write the intermediate results to the disk.
Action force the translation of RDD into an Execution plan
From
<http://support.edureka.co/support/solutions/articles/4000083868--directed-acyclic-graph-dag-vs-lineage-graph-in-spark>
DAG is a set of vertices and edges
Vertices represent RDD
Edges represent Operations that needs to be applied on RDD
DAG is a finite graph with no directed cycles, which means it have finite vertices and edges.
Each Spark Application is converted into DAG
DAG Shows the complete task i.e. Transformation(s) and Action
DAG shows different stages of Spark Job.
Logical DAG of operations is created implicitly by the Spark Program. or in other words, user code defines DAG of RDDs
For each Action, spark creates a DAG and submits to the DAG Scheduler.
DAG scheduler divides the operators into stages of tasks.(This is the final result of DAG scheduler)
DAG Scheduler pipelines the operators together.
e.g. Many map operations can be scheduled on a single stage.
DAG helps in achieving Fault tolerance.
Lost RDD can be recovered.
MR has two levels: map and reduce, DAG can have multiple levels, because of this executing SQL queries is more flexible.
DAG is an alternate to MapReduce in Hadoop, it is a programming style in distributed systems.
DAG is faster because it doesn't write the intermediate results to the disk.
Action force the translation of RDD into an Execution plan
DAG vs Lineage
Q1) Then what is Lineage
Graph
When a new RDD is derived from existing RDD using
transformation, Spark keeps track of all the dependencies between these RDDs
called the lineage graph. In case of data loss, this lineage graph is used to
rebuild the data.
Q2) Are there any
difference between Lineage Graph and Directed Acyclic Graph ?
Yes, DAG and Lineage graphs are different.
DAG shows the different stages of a spark job.
Q3) Are they both same,
can they both be used interchangeably ?
No they cannot be used interchangeably, because workings
are different. Lineage graph deals with RDDs so it is applicable up-till transformations , Whereas,
DAG shows the complete task, ie; trasnformation
+ Action
Comments
Post a Comment