Tuesday, September 8, 2020

Scala vs. Python for Apache Spark

Apache Spark is a great choice for cluster computing and includes language APIs for Scala, Java, Python, and R. Apache Spark includes libraries for SQL, streaming, machine learning, and graphics processing. With this wide range of functionality, many developers decide to migrate to distributed application development with Apache Spark.   

The first thing to decide is where you will launch it. For most, this solution is not difficult - run it using Apache Hadoop YARN. After this simple decision, the developers are faced with the question of which language to choose to work with Apache Spark. If at the same time each developer needs the ability to choose their own language, or if there is a need to support multiple languages, then this will result in a proliferation of code and tools. And the R interface isn't rich enough for that. For most companies, it's obvious that Java has too much code and a complicated interface. Based on this, their choice falls on Python or Scala. So let's take a closer look at both of these languages ​​and decide which one suits the best.

Scala has a key advantage: it is in this language that the Apache Spark platform runs. Scala running on the JVM is as powerful as Java. However, the Scala code still looks cleaner. Thanks to the JVM, an application written in this language can grow to impressive sizes. For most applications this is extremely important, however Apache Spark works with Akka and YARN, so it is not necessary for it. You just set a few parameters, and your Apache Spark application will be distributed regardless of the language you are working with. Therefore, this advantage can hardly be considered a significant plus in such a situation is computer science engineering.

Python is one of the key languages ​​in the world of Spark applications. The language is also extremely easy to learn - many schools even teach it to children. There is a huge number of code examples, books, articles, libraries and documentation, which allows you to easily delve into Python.

PySpark is the standard tool for working with Spark. And since Apache Zeppelin actively supports PySpark, and Jupyter and IBM DSX use Python as their primary language, you have many tools you can use to develop and test code, send queries, and visualize data. Knowledge of Python is becoming a must for Data Scientists, Data Engineers, and Application Developers who use streaming. Python integrates perfectly with Apache NiFi.

A key benefit of Python is the many libraries for machine learning, natural language processing, and deep learning. You also don't need to compile your code and worry about complex JVM packaging. Typically Anaconda or Pip is used, and Apache Hadoop and Apache Spark clusters already include Python and their libraries are already installed (like Apache Ambari).

There are a wide variety of great libraries for Python: NLTK, TensorFlow, Apache MXNet, TextBlob, SpaCY, and Numpy.

Python virtues

· PySpark is featured in many examples and tutorials.

· Most libraries come with Python APIs in the first place.

· Python is a mature language.

· Python is widely used and continues to grow in popularity.

· Deep learning libraries include Python.

· Easy to use.

Disadvantages of Python

Sometimes you need to use Python 2, sometimes Python 3.

· Not as fast as Scala (however Cython fixes this).

· Some libraries are difficult to build.

Scala virtues

JVM.

· Full featured IDEs and unit testing.

· Excellent serialization formats.

· Can use Java libraries.

· Fast.

AKKA.

Spark Shell.

Advanced streaming capabilities

Disadvantages of Scala

· Not as widespread.

· Smaller knowledge base.

· Many Java developers are not ready to switch to it.

· Must be compiled to work with Apache Spark.

No comments:

Post a Comment

Server management systems

Enterprises receive the services and functions they need (databases, e-mail, website hosting, work applications, etc.) for their corporate I...