Python vs Scala in Spark


PySpark
Scala
Nativity w.r.to spark
API's provided, not for all features
Spark is developed in Scala, things are more natural - de-facto interface
Learning and use
Comparatively easy to learn and use
Complex to learn and easy to use
Complexity
Less compared to Scala
More complex when compared to PySpark - lot of internal manipulations and conversions will happen
Conciseness
More of imperative styles
More concise - fewer lines of code allows faster development, testing and deployment
Performance
Less compared to Scala - internal conversions are required
10 times faster than PySpark
Effective for?
smaller ad-hoc experiments
for production and engineering applications
Scalability
Not much
Scalable
Refactoring
not good, more bugs gets introduced
Hassle-free and much easier to refactor
Data Scientist tools: ML/NLP/Deep learning
Lot of libraries based on python
Less libraries - Have MLlib in Scala
Streaming support
Not much matured
Best pick for streaming
Visualization tools
lot of them are already existing
very less compared to Python
versioning/compatibility dependencies?
Python 2 or python 3
Latest Scala version
Community
Good community when compared to scala
Community is evolving
Internals
Python is supported by serializing/deserializing between python worker process and the main spark jvm process.
Native serialization/deserialization and well suited for distributed environments
Execution mechanism
Interpreter
Compiler
Type safety
Less compared to Scala - Dynamic typed
More compared to PySpark - Static typing
Concurrency
Does not support true multi threading
Powerful concurrency with Akka Actor model
New/latest features availability?
Depends on third party, may not be available immediately
Immediately available
Overall opinion
easy to learn, use and more libraries  for visualization and much matured community
Concise, Scalable, native to Spark, easy to refactor, fast and moderately easy to use.



Data Scientist preferable approach - Hybrid approach i.e. both Python and Scala.
Bottomline: Scala is faster and moderately easy to use, while Python is slower but very easy to use.


The performance issues can be addressed in python by:

2. Convert UDFs to SQL (wherever is possible) 
3. In the worst case we can implement the specific step(in data factory) in Scala notebook.


Comments

Popular posts from this blog

Out Of Memory in Spark(OOM) - Typical causes and resolutions

map vs flatMap in Spark

Spark Persistence(Caching)