Posts

Best Practices

 1. Whenever we infer the Schema for a large file define the Schema explicitly. Will get the following  benefits:         Relieve Spark from the onus of inferring the schema         Prevent spark from creating a separate job just to read a large portion of file to ascertain the schema, which for a large file can be expensive and time consuming.         Early detection of errors for schema mismatches. 2. The other way is to use the option setting "sampleRatio" to 0.001 to infer the schema from the header itself.

RDD Interface

 RDD is the basic abstraction in Spark. Internally it is characterized by: Dependencies Partitions Compute Partitioner for Key Value RDD's List of Preferred locations. 4 and 5 are Optional. RDD achieves Resilience with the dependencies. 

Spark Persistence(Caching)

Image
Cached RDDs should have modest size so that they can fit in the MEMORY entirely. To identify the size, it is challenging and unclear. Caching strategy that caches the blocks in memory and disk is preferred. The reason is the cached blocks that are evicted will be written to disk. Reading from disk is relatively fast as compared to re-evaluating the RDD. Internals: Internally caching is performed at the block level. That means each RDD consists of multiple blocks, each block is being cached independently of other blocks. Caching is performed on the node that is generated the particular RDD block. Each executor in spark has an associated Block Manager that is used to cache RDD blocks. The memory allocation for the Block Manager is given by the storage memory fraction, which gives the fraction from the memory pool that is allocated for the spark engine. The block manager manages ...

Python vs Scala in Spark

PySpark Scala Nativity w.r.to spark API's provided, not for all features Spark is developed in Scala, things are more natural - de-facto interface Learning and use Comparatively easy to learn and use Complex to learn and easy to use Complexity Less compared to Scala More complex when compared to PySpark - lot of internal manipulations and conversions will happen Conciseness More of imperative styles More concise - fewer lines of code allows faster development, testing and deployment Performance Less compared to Scala - internal conversions are required 10 times faster than PySpark Effective for? smaller ad-hoc experiments for production and engineering applications Scalability Not much Scalable Refactoring not good, more bugs get...

Kafka Architecture - Draft

Image
What is Kafka Stream processing Engine Written in Scala and Java Aims to provide Unified, high-throughput, low-latency platform for handling real time data feeds. Used for real time data pipelines and streaming apps Built on top of the ZooKeeper Synchronization service     Popular messaging queues in the Market Kafka - Scalable, Fault tolerant, High R/W throughput, No Single point of failure, Durable(messages persisted in the disk and replicated within the cluster to prevent data loss), Support compression, support data retention Handles the Hunderds of Megabytes R/W's per second from thousands of clients No Master/Slave architecture - No single point of failure Guarantees ordering of messages - provides total order of messages within the partition. Same roles is assigned to all the nodes in the cluster Compression saves the storage, improves the processing performance Kafka is 5 times bett...