Python vs Scala in Spark

July 11, 2019

	PySpark	Scala
Nativity w.r.to spark	API's provided, not for all features	Spark is developed in Scala, things are more natural - de-facto interface
Learning and use	Comparatively easy to learn and use	Complex to learn and easy to use
Complexity	Less compared to Scala	More complex when compared to PySpark - lot of internal manipulations and conversions will happen
Conciseness	More of imperative styles	More concise - fewer lines of code allows faster development, testing and deployment
Performance	Less compared to Scala - internal conversions are required	10 times faster than PySpark
Effective for?	smaller ad-hoc experiments	for production and engineering applications
Scalability	Not much	Scalable
Refactoring	not good, more bugs gets introduced	Hassle-free and much easier to refactor
Data Scientist tools: ML/NLP/Deep learning	Lot of libraries based on python	Less libraries - Have MLlib in Scala
Streaming support	Not much matured	Best pick for streaming
Visualization tools	lot of them are already existing	very less compared to Python
versioning/compatibility dependencies?	Python 2 or python 3	Latest Scala version
Community	Good community when compared to scala	Community is evolving
Internals	Python is supported by serializing/deserializing between python worker process and the main spark jvm process.	Native serialization/deserialization and well suited for distributed environments
Execution mechanism	Interpreter	Compiler
Type safety	Less compared to Scala - Dynamic typed	More compared to PySpark - Static typing
Concurrency	Does not support true multi threading	Powerful concurrency with Akka Actor model
New/latest features availability?	Depends on third party, may not be available immediately	Immediately available
Overall opinion	easy to learn, use and more libraries for visualization and much matured community	Concise, Scalable, native to Spark, easy to refactor, fast and moderately easy to use.

Data Scientist preferable approach - Hybrid approach i.e. both Python and Scala.

Bottomline: Scala is faster and moderately easy to use, while Python is slower but very easy to use.

The performance issues can be addressed in python by:

1. Using vectorized UDFs in python https://databricks.com/session/vectorized-udf-scalable-analysis-with-python-and-pyspark)

2. Convert UDFs to SQL (wherever is possible)

3. In the worst case we can implement the specific step(in data factory) in Scala notebook.

Search This Blog

SparkScalaNotes

Python vs Scala in Spark

Comments

Post a Comment

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

DAG in Spark and Usage

When to use RDD?