map vs flatMap in Spark

map
flatMap
Returns single element based on the function/custom business logic/algorithm
Returns zero or more elements based on the function/custom business logic/algorithm
Single element as return type
Iterator with our return values, but we don’t return an iterator of RDD’s, return RDD that consists of the elements from all the iterators.
e.g.
val list = 1 to 5 toList
list: List[Int] = List(1, 2, 3, 4, 5)

list.map(_.to(3))
List[scala.collection.immutable.Range.Inclusive] = List(Range(1, 2, 3), Range(2, 3), Range(3), Range(), Range())
E.g.
val list = 1 to 5 toList
list: List[Int] = List(1, 2, 3, 4, 5)

list.flatMap(_.to(3))
List[Int] = List(1, 2, 3, 2, 3, 3)
 Returns RDD of lists; refer to the example
Returns RDD of elements
map operation only
Equivalent to map followed by flatten
Returns the elements of the RDD
Returns the elements of the Iterators returned as



Another good example:
 val myse = Seq("India","China")
myse: Seq[String] = List(India, China)

val myseqrdd = sc.parallelize(myse)
myseqrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26

 myseqrdd.flatMap(_.toUpperCase)
res2: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[1] at flatMap at <console>:26

 myseqrdd.flatMap(_.toUpperCase).collect
res3: Array[Char] = Array(I, N, D, I, A, C, H, I, N, A)

 myseqrdd.map(_.toUpperCase).collect
res5: Array[String] = Array(INDIA, CHINA)

Comments

Post a Comment

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

Spark Persistence(Caching)