Posts

Limitations of Spark

 1. Not real time processing, it is near real time processing engine. 2. Expensive, due to in-memory computation 3. High latency and less throughput when compared to Flink 4. Small file problem, with S3. It is possible to compress zip files only when the complete file is present at one core. It requires lot of time to unzip files in sequence. For efficient processing, it needs immense shuffling of data. 5. Window criteria based on Time only not on the basis of number of records. 6. Not having own File processing System. 7. Few algorithms in ML 8. Iterative processing 9. Handling back pressure 10. Manual optimization.

When to use RDD?

1.        Want to precisely instruct Spark how to do a query i.e., controlling the low-level operations. 2.        Can forgo the code optimization, efficient space utilization and performance benefits available with DF’s and DS’. 3.        If the data is unstructured such as media streams or streams of text. 4.        Not imposing the schema while processing or accessing the attributes by name or column. 5.        Want to manipulate the data with functional programming constructs than domain specific expressions. 6.        Existing dependent third-party package is written using RDD’s.

Out Of Memory in Spark(OOM) - Typical causes and resolutions

   Spark OOM – Can occur at Driver/Executor/node manager. The error that we see is: java.lang.OutOfMermoryError. This will happen when the required memory exceeds the available memory for a particular operation. 1.        Driver OOM - possible causes: a.        Collect For e.g. in the program, we have    val data = inDF.collect()                 Collect operation, collects the data from all executors and send it to the driver. The driver will try to merge into the single object, then there is a possibility that it may be too big to fit into the driver memory. This problem can be solved in two ways: Ø   Setting proper limit using spark.driver.maxResultSize Ø   Repartitioning before saving the result to the output file. §   data.repartition(1).write… §   Repartition uses dedicated executor to collect. b.        Broadcast Join ¨        The table will be materialized at the driver and then broadcasted to the exectuors. c.        Low driver memory configurati

Best Practices

 1. Whenever we infer the Schema for a large file define the Schema explicitly. Will get the following  benefits:         Relieve Spark from the onus of inferring the schema         Prevent spark from creating a separate job just to read a large portion of file to ascertain the schema, which for a large file can be expensive and time consuming.         Early detection of errors for schema mismatches. 2. The other way is to use the option setting "sampleRatio" to 0.001 to infer the schema from the header itself.

RDD Interface

 RDD is the basic abstraction in Spark. Internally it is characterized by: Dependencies Partitions Compute Partitioner for Key Value RDD's List of Preferred locations. 4 and 5 are Optional. RDD achieves Resilience with the dependencies. 

Spark Persistence(Caching)

Image
Cached RDDs should have modest size so that they can fit in the MEMORY entirely. To identify the size, it is challenging and unclear. Caching strategy that caches the blocks in memory and disk is preferred. The reason is the cached blocks that are evicted will be written to disk. Reading from disk is relatively fast as compared to re-evaluating the RDD. Internals: Internally caching is performed at the block level. That means each RDD consists of multiple blocks, each block is being cached independently of other blocks. Caching is performed on the node that is generated the particular RDD block. Each executor in spark has an associated Block Manager that is used to cache RDD blocks. The memory allocation for the Block Manager is given by the storage memory fraction, which gives the fraction from the memory pool that is allocated for the spark engine. The block manager manages

Python vs Scala in Spark

PySpark Scala Nativity w.r.to spark API's provided, not for all features Spark is developed in Scala, things are more natural - de-facto interface Learning and use Comparatively easy to learn and use Complex to learn and easy to use Complexity Less compared to Scala More complex when compared to PySpark - lot of internal manipulations and conversions will happen Conciseness More of imperative styles More concise - fewer lines of code allows faster development, testing and deployment Performance Less compared to Scala - internal conversions are required 10 times faster than PySpark Effective for? smaller ad-hoc experiments for production and engineering applications Scalability Not much Scalable Refactoring not good, more bugs get