SparkScalaNotes

Posts

Showing posts from November, 2022

Limitations of Spark

November 07, 2022

1. Not real time processing, it is near real time processing engine. 2. Expensive, due to in-memory computation 3. High latency and less throughput when compared to Flink 4. Small file problem, with S3. It is possible to compress zip files only when the complete file is present at one core. It requires lot of time to unzip files in sequence. For efficient processing, it needs immense shuffling of data. 5. Window criteria based on Time only not on the basis of number of records. 6. Not having own File processing System. 7. Few algorithms in ML 8. Iterative processing 9. Handling back pressure 10. Manual optimization.

When to use RDD?

November 01, 2022

1. Want to precisely instruct Spark how to do a query i.e., controlling the low-level operations. 2. Can forgo the code optimization, efficient space utilization and performance benefits available with DF’s and DS’. 3. If the data is unstructured such as media streams or streams of text. 4. Not imposing the schema while processing or accessing the attributes by name or column. 5. Want to manipulate the data with functional programming constructs than domain specific expressions. 6. Existing dependent third-party package is written using RDD’s.