DataFrame vs Dataset
|  | 
DataFrame | 
DataSet
  (RDD + DF) | 
| 
Definition | 
Distributed collection of rows(data) organized into named
  columns. | 
Extension of DF API 
Distributed collection of strongly typed Domain Specific
  Objects. | 
| 
Regeneration of 
  Domain specific object  | 
Can’t regenerate original domain specific
  object | 
Can regenerate original domain specific object from JVM
  object | 
| 
Static typing/Compile time safety | 
DF API trying to access a column which doesn’t exists in the table, doesn’t
  give compile time error, detects attribute error at runtime | 
Gives an error at compile time | 
| 
Individual attribute of an object | 
To access individual attribute, we need to serialize the
  entire object | 
No need to serialize the entire object to access the
  individual element of an object | 
| 
Programming language support | 
Python, R, Scala and Java | 
Scala and Java. | 
| 
Optimization | 
Catalyst optimizer | 
Catalyst optimizer + Tungsten Encoder 
Encoder maps the Domain specific type T to spark’s internal type system. | 
| 
High level domain specific language methods |  | 
E.g. sum, avg, join, select, groupBy etc | 
| 
Caching |  | 
Uses more optimal layout | 
| 
Aggregation |  | 
Faster to perform aggregation on plenty of datasets | 
| 
Memory usage | 
Uses off-heap memory for serialization, (reduces the
  overhead when compared to RDD) | 
Uses encoder for converting between the JVM objects and
  tabular form | 
| 
Operation on serialized data |  |  | 
Comments
Post a Comment