Python vs Scala in Spark
|
PySpark
|
Scala
|
Nativity
w.r.to spark
|
API's
provided, not for all features
|
Spark is
developed in Scala, things are more natural - de-facto interface
|
Learning
and use
|
Comparatively easy to learn and use
|
Complex to learn and easy to use
|
Complexity
|
Less
compared to Scala
|
More
complex when compared to PySpark - lot of internal manipulations and
conversions will happen
|
Conciseness
|
More of imperative styles
|
More concise - fewer lines of code
allows faster development, testing and deployment
|
Performance
|
Less
compared to Scala - internal conversions are required
|
10 times
faster than PySpark
|
Effective
for?
|
smaller ad-hoc experiments
|
for production and engineering applications
|
Scalability
|
Not much
|
Scalable
|
Refactoring
|
not good, more bugs gets introduced
|
Hassle-free and much easier to
refactor
|
Data
Scientist tools: ML/NLP/Deep learning
|
Lot of libraries based on python
|
Less
libraries - Have MLlib in Scala
|
Streaming
support
|
Not much matured
|
Best pick for streaming
|
Visualization
tools
|
lot of them are already existing
|
very
less compared to Python
|
versioning/compatibility
dependencies?
|
Python 2 or python 3
|
Latest Scala version
|
Community
|
Good community when compared to scala
|
Community
is evolving
|
Internals
|
Python is supported by
serializing/deserializing between python worker process and the main spark
jvm process.
|
Native serialization/deserialization
and well suited for distributed environments
|
Execution
mechanism
|
Interpreter
|
Compiler
|
Type
safety
|
Less compared to Scala - Dynamic
typed
|
More compared to PySpark - Static
typing
|
Concurrency
|
Does not
support true multi threading
|
Powerful
concurrency with Akka Actor model
|
New/latest
features availability?
|
Depends on third party, may not be
available immediately
|
Immediately available
|
Overall
opinion
|
easy to
learn, use and more libraries for visualization and much matured
community
|
Concise,
Scalable, native to Spark, easy to refactor, fast and moderately easy to use.
|
Data Scientist preferable approach -
Hybrid approach i.e. both Python and Scala.
Bottomline: Scala is faster and
moderately easy to use, while Python is slower but very easy to use.
The performance issues can be addressed in python
by:
1. Using vectorized UDFs in python https://databricks.com/session/vectorized-udf-scalable-analysis-with-python-and-pyspark)
2. Convert UDFs to SQL (wherever is possible)
3. In the worst case we can implement the specific
step(in data factory) in Scala notebook.
Comments
Post a Comment