DataFrame vs Dataset


DataFrame
DataSet (RDD + DF)
Definition
Distributed collection of rows(data) organized into named columns.
Extension of DF API

Distributed collection of strongly typed Domain Specific Objects.
Regeneration of  Domain specific object
Cant regenerate original domain specific object
Can regenerate original domain specific object from JVM object
Static typing/Compile time safety
DF API trying to access a column which doesnt exists in the table, doesnt give compile time error, detects attribute error at runtime
Gives an error at compile time
Individual attribute of an object
To access individual attribute, we need to serialize the entire object
No need to serialize the entire object to access the individual element of an object
Programming language support
Python, R, Scala and Java
Scala and Java.
Optimization
Catalyst optimizer
Catalyst optimizer + Tungsten Encoder
Encoder maps the Domain specific type T to sparks internal type system.
High level domain specific language methods

E.g. sum, avg, join, select, groupBy etc
Caching

Uses more optimal layout
Aggregation

Faster to perform aggregation on plenty of datasets
Memory usage
Uses off-heap memory for serialization, (reduces the overhead when compared to RDD)
Uses encoder for converting between the JVM objects and tabular form
Operation on serialized data


Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark

Spark Persistence(Caching)