map vs flatMap in Spark
map
|
flatMap
|
Returns
single element based on the function/custom business logic/algorithm
|
Returns
zero or more elements based on the function/custom business logic/algorithm
|
Single
element as return type
|
Iterator
with our return values, but we don’t return an iterator of RDD’s, return RDD
that consists of the elements from all the iterators.
|
e.g.
val
list = 1 to 5 toList
list:
List[Int] = List(1, 2, 3, 4, 5)
list.map(_.to(3))
List[scala.collection.immutable.Range.Inclusive]
= List(Range(1, 2, 3), Range(2, 3), Range(3), Range(), Range())
|
E.g.
val
list = 1 to 5 toList
list:
List[Int] = List(1, 2, 3, 4, 5)
list.flatMap(_.to(3))
List[Int]
= List(1, 2, 3, 2, 3, 3)
|
Returns RDD of lists; refer to the example
|
Returns RDD
of elements
|
map
operation only
|
Equivalent
to map followed by flatten
|
Returns
the elements of the RDD
|
Returns
the elements of the Iterators returned as
|
Another good example:
val myse = Seq("India","China")
myse: Seq[String] = List(India, China)
val myseqrdd = sc.parallelize(myse)
myseqrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26
myseqrdd.flatMap(_.toUpperCase)
res2: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[1] at flatMap at <console>:26
myseqrdd.flatMap(_.toUpperCase).collect
res3: Array[Char] = Array(I, N, D, I, A, C, H, I, N, A)
myseqrdd.map(_.toUpperCase).collect
res5: Array[String] = Array(INDIA, CHINA)
val myse = Seq("India","China")
myse: Seq[String] = List(India, China)
val myseqrdd = sc.parallelize(myse)
myseqrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:26
myseqrdd.flatMap(_.toUpperCase)
res2: org.apache.spark.rdd.RDD[Char] = MapPartitionsRDD[1] at flatMap at <console>:26
myseqrdd.flatMap(_.toUpperCase).collect
res3: Array[Char] = Array(I, N, D, I, A, C, H, I, N, A)
myseqrdd.map(_.toUpperCase).collect
res5: Array[String] = Array(INDIA, CHINA)
Liked your writing style...keep writing....
ReplyDeleteThanks, I will continue to write :)
Delete