DataFrame vs Dataset
|
DataFrame
|
DataSet
(RDD + DF)
|
Definition
|
Distributed collection of rows(data) organized into named
columns.
|
Extension of DF API
Distributed collection of strongly typed Domain Specific
Objects.
|
Regeneration of
Domain specific object
|
Can’t regenerate original domain specific
object
|
Can regenerate original domain specific object from JVM
object
|
Static typing/Compile time safety
|
DF API trying to access a column which doesn’t exists in the table, doesn’t
give compile time error, detects attribute error at runtime
|
Gives an error at compile time
|
Individual attribute of an object
|
To access individual attribute, we need to serialize the
entire object
|
No need to serialize the entire object to access the
individual element of an object
|
Programming language support
|
Python, R, Scala and Java
|
Scala and Java.
|
Optimization
|
Catalyst optimizer
|
Catalyst optimizer + Tungsten Encoder
Encoder maps the Domain specific type T to spark’s internal type system.
|
High level domain specific language methods
|
|
E.g. sum, avg, join, select, groupBy etc
|
Caching
|
|
Uses more optimal layout
|
Aggregation
|
|
Faster to perform aggregation on plenty of datasets
|
Memory usage
|
Uses off-heap memory for serialization, (reduces the
overhead when compared to RDD)
|
Uses encoder for converting between the JVM objects and
tabular form
|
Operation on serialized data
|
|
|
Comments
Post a Comment