DataFrame vs Dataset

	DataFrame	DataSet (RDD + DF)
Definition	Distributed collection of rows(data) organized into named columns.	Extension of DF API Distributed collection of strongly typed Domain Specific Objects.
Regeneration of Domain specific object	Can’t regenerate original domain specific object	Can regenerate original domain specific object from JVM object
Static typing/Compile time safety	DF API trying to access a column which doesn’t exists in the table, doesn’t give compile time error, detects attribute error at runtime	Gives an error at compile time
Individual attribute of an object	To access individual attribute, we need to serialize the entire object	No need to serialize the entire object to access the individual element of an object
Programming language support	Python, R, Scala and Java	Scala and Java.
Optimization	Catalyst optimizer	Catalyst optimizer + Tungsten Encoder Encoder maps the Domain specific type T to spark’s internal type system.
High level domain specific language methods		E.g. sum, avg, join, select, groupBy etc
Caching		Uses more optimal layout
Aggregation		Faster to perform aggregation on plenty of datasets
Memory usage	Uses off-heap memory for serialization, (reduces the overhead when compared to RDD)	Uses encoder for converting between the JVM objects and tabular form
Operation on serialized data

SparkScalaNotes