Deep dive into Spark || Part-1

aditya goel
13 min readMar 1, 2024

Question → What’s so special about the Spark ?

Answer → If you have a really massive data set that can represent anything, weblogs, genomics data, Spark can slice the data, it can distribute the processing amongst a huge cluster of computers and take a data analysis problem that’s just too big to run on one machine and divide it and conquer it by splitting it up amongst multiple machines.

Question → How does Spark does the distributed processing ?

Answer → The way that, Spark scales is → It can run on top of a cluster manager.

  • The actual Spark scripts are just everyday scripts written in Python or Java or Scala and they behave just like any other script.
  • The driver program is what calls those scripts and it will run on your desktop or on one master node of your cluster, and it behaves just like any other script.
  • Under the hood when you run it, Spark knows how to take the work and actually farm it out to different computers on your cluster or different CPUs on your same machine even.

Question → How Spark takes help of Cluster Manager Cluster-Manager ?

Answer → Spark can actually run on top of different cluster managers, it has its own built in cluster manager that you can use by default, but if you have access to a Hadoop cluster, there’s a component of Hadoop called Yarn that Spark can also run on top of to distribute work amongst a huge Hadoop cluster If you have one available.

  • Divide & Conquer → Spark will split up and create multiple executors per machine, ideally you want one per CPU core.
  • Coordination → Spark can do all the coordination using a cluster manager. The driver program itself farm out work and distribute it to different nodes.
  • Fault-Tolerance → If one of your executors goes down, it can recover without actually stopping your entire job and making you start it all over.

So the beauty of Spark is that it scales out to entire clusters of computers and gives you horizontal scalability.

Question → Why should we use Spark as compared to MapReduce ?

Answer → MapReduce has been around a lot longer and you know the the ecosystem surrounding it, the tools surrounding it are more mature at this point.

1.) Reason of Spark’s popularity →

  • Speed → Well one of the main reasons is that Spark is really fast and on the Apache website they claim that Hadoop MapReduce is 100 times slower than Spark in some cases, and 10 times faster on disk.
  • Simplicity →What would take many lines of code and different functions in a MapReduce job can often be done in just one line and much more quickly and efficiently using Spark.

2.) How is Spark so speedy → It achieves that performance by using what it calls a directed acyclic graph engine (DAG).

Question → Who all uses Spark ?

Answer → Some known people that are using Spark, lot of big companies are kind of secretive about what they’re doing inside :-

Question → What are various components of Spark ?

Answer → Spark is made up of many components :-

1.) Spark core → It provides us with Resilient Distributed Data (RDD) set objects and we can you use that to distribute the processing and processing of data, the mapping and reducing of large data sets.

2.) Spark Streaming → It actually gives you the ability to analyze real time data streams, for example from a set of web logs coming in from a fleet of web servers that need to be continually updated.

3.) Sparks SQL → It lets you run Spark on top of a hive context and actually deal with structured data within Spark and actually run SQL queries on top of it.

4.) MLLlib → It’s a series of machine learning algorithms and if you’re gonna be doing any sort of machine learning or data mining with Spark, it contains a lot of useful tools you can use to simplify a lot of common operations.For example :- If you need to do Pearson correlation or you need to know statistical properties of your dataset, MLLlib makes that very very easy to do.

5.) GraphX → If you have a graph like for example a social graph of people that are friends with each other or things of that nature or citations between documents and scholarly articles things that of that nature, GraphX can help you make sense of those sorts of networks and give you high level information about the properties of those graphs, and then all runs on top of Spark Core, so, Spark has a lot to offer and it’s expanding all the time.

Question → What’s an RDD ?

Answer → RDD’s, the Resilient Distributed Data set object in Spark is the core object that everything in Spark revolves around, even for the libraries built on top of Spark, like Sparks SQL or MLLlib.

  • Abstraction → So, fundamentally it’s a data set, and it’s an abstraction for a giant set of data.
  • The beauty is that, RDD’s are both distributed and resilient → They can be spread out across an entire cluster of computers that may or may not be running locally and they can also handle the failure of specific executor nodes in your cluster automatically and keep on going even if one node shuts down and redistribute the work as needed when that occurs. That’s what Spark cluster manager does for you.

Important conclusion → RDD represents a really big data set and you can use the RDD object to transform that data set from one set of data to another, or to perform actions on that data set to actually get the results you want from it.

Question → How do you install the Apache-Spark ?

Answer → Let’s follow along following steps to install Apache-Spark :-

Step #1.) Install Homebrew if you don’t have it already by entering this from a terminal prompt:

/usr/bin/ruby -e “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)”

Step #2.) Enter following command to install Apache-Spark in terminal prompt :-

brew install apache-spark

Note: In order to see the installation directory run the following command in terminal prompt :-

brew info apache-spark

Step #3.) Enter into pyspark and try to show that whether spark installed properly :-

Question → What’s happening under the hoods in Step #3 above ?

Answer → Here, following things are happening :-

Step #1.) Pyspark gave us an interactive shell for Spark, and that gave us an “sc” object automatically that we could use to run operations on.

Step #2.) Once you have that Spark Context as shown in above screenshot, you can use that for creating RDD’s.

Note:- The SC object stands for Spark Context and it is sort of the starting point that you have from a programming standpoint, and that SC object gives you methods you need to actually create an RDD.

Question → In Apache-Spark, what’s the other method by which, you can create an RDD ?

Answer → There are various approaches to create RDDs :-

Approach #1.) We can use the “sc.parallelize” function to take a hard coded set of data and make an RDD out of it OR we can also use “sc.textFile” to create an RDD object from the Spark Context and then we can just use that RDD object going forward and process it.

Approach #2.) We can create an RDD from S3, or an HDFS URI, as well as the file URI, to load up data from a cluster or from a distributed file system as well as from a simple file that might be running on the same machine as your driver script.

Approach #3.) We can also create RDD’s from Hive, if you have a Hive context object that’s already been connected to an existing Hive repository.

Note:- Hive is basically another thing that runs on top of Hadoop, that is for basically data warehousing.

Approach #4.) We can also create RDD’s from things like JDBC. You can also use popular no SQL databases like Cassandra, and it has interfaces for things like HBase and Elasticsearch and, a lot of other things that are always growing all the time.

Conclusion → Basically any data format that you can access from Python or from Java, you can access through Spark as well.

Question → In Apache-Spark, show some example of creating an RDD and perform some transformation on it ?

Answer → Here is the code snippet to do simple transformation in Spark :-

CodeLine #1.) sc.parallelize([1, 2, 3, 4]): This creates an RDD (Resilient Distributed Dataset) named rdd by parallelizing a Python list [1, 2, 3, 4]. Parallelization distributes the elements of the list across the nodes in the Spark cluster, creating a distributed dataset.

CodeLine #2.) >>> rdd.map(lambda x: x * x) : This applies a map transformation to the RDD rdd.

  • The map transformation takes a function (in this case, a lambda function lambda x: x * x) and applies it to each element of the RDD independently, producing a new RDD with the transformed elements. In this specific lambda function, each element x is squared (x * x).
  • The map transformation has been applied to the RDD named rdd, resulting in a new RDD named result. However, this statement does not trigger any computation; it simply represents the RDD transformation lineage and at this stage, merely a computation plan is created, which indicates that each element in the RDD should be squared.

CodeLine #3.) result.collect(): This triggers an action called collect on the RDD result.

  • At this line, we trigger the execution of the computation plan by calling the collect() action on the result RDD.
  • At this point, Spark evaluates the transformation (map) on each element of the RDD rdd. Specifically, it squares each element, resulting in the elements [1, 4, 9, 16]. These squared elements are then collected and returned as a list. This is when the square operation actually happens. In this case, it returns [1, 4, 9, 16], which are the squared values of [1, 2, 3, 4].

Question → In Apache-Spark, what are other transformation methods possible ?

Answer → The basic operations you can do on RDD’s are as follows. This is not a complete list but these are some of the most common operations you can do on an RDD :-

Question → In Apache-Spark, what are various action methods possible ?

Answer → The basic actions you can do on RDD’s are as follows. This is not a complete list but these are some of the most common actions you can do on an RDD :-

Question → In Apache-Spark, show a real histogram example ?

Answer → Let’s consider the example of MoviesData taken from here.

Step #1.) From the aforementioned link, let’s download the movies data, which shows the ratings that each movie has received. Our data has around 100K entries :-

Step #2.) Now, we shall be writing code to compute : How many number of movies have got 5 star rating, 4 star rating and so on :-

Step #3.) Output looks as follows :-

  • There are around ~34K ratings which the 4 star ratings.
  • Here, spark-submit is just going to invoke a wrapper script around spark to submit our python script to it and then spark will work out all the magic to distribute that and actually feed it to the spark framework itself.

Let’s now understand this code piece :-

CodeLine #1 & 2.) The first couple of lines is just boilerplate stuff.

from pyspark import SparkConf, SparkContext
import collections
  • Imports → One thing you’ll see in every Python Spark script is the import statement to import SparkConf and SparkContext from the Pyspark library that Spark includes → You will at a minimum need those two objects.
  • SparkContext → It’s sort of the fundamental starting point that the Spark framework gives you to create RDD’s from.
  • SparkConf → You can’t create a SparkContext without a SparkConf which allows you to configure the SparkContext and tell it things like, do I want to run on just one computer? Or do you want to run it on a cluster, and if so in what way?
  • collections → And to get a bit of housekeeping, we are importing the collections package from Python.

CodeLine #3 & 4.)

conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")
sc = SparkContext(conf = conf)
  • Conf → We have the SparkConf object and we’re telling it to set its master node as the local machine. So basically it says we’re going to be running on a local box only, not on a cluster, just on this one system. “local” means so we’re not really doing any sort of distribution of the data in this case it’s just sort of running on one process to keep it simple for now.
  • Name → And finally we need to set the app name as part of this call, we’re going to call this RatingHistogram and that’s just so if you actually look in the Spark Web UI to actually see what’s going on while it’s running you’ll be able to look up the job by its name there and understand and identify it. Good practice to give name.
  • SparkContext → Next, we actually create our SparkContext using that Spark configuration object and assign that to something called sc and by convention we will always call that sc for SparkContext. Note that, this is going to look very similar in every Spark script that we write in Python.

CodeLine #5.) lines = sc.textFile(“/Users/adityagoel/Documents/SPARK-LEARN/ml-100k/u.data”) : So now we’re going to load up our data file. Also, this is a very common way of creating an RDD i.e. through the sc.textFile method.

  • So, this is actually going to go out to our local file system and load up the data file that includes all of the movie ratings data.
  • “textFile” breaks up the input file line by line, so that every line of text corresponds to one value in your RDD. After the file is being loaded fully, my RDD would consist of number of entries equal to the number of lines in the file.

CodeLine #6.) ratings = lines.map(lambda x: x.split()[2]) : This lambda sequence allows us to have a shorthand of passing some sort of function that we want to pass into our map.

  • Here, we extract the two number field in the data. Now, remember in programming we always start counting with zero. In every line is take the line of input, It’s going to split it based on whitespace individua- individual fields so, running split on each line is going to result in a list of values of 196, 242, 3, and 881250949.
  • These numbers represents : the user ID, the movie ID, the rating value, and a timestamp.
  • This map function actually splitting up each line into its individual fields based on whitespace, and then it’s taking field number 2, which is nothing but the actual rating values and puts that into a new RDD that we’re calling “ratings”.

CodeLine #6.) result = ratings.countByValue() : Finally we’re going to perform an action on our RDD.

  • So we’ve transformed the RTD into the form that we want i.e. we’ve taken our raw input data and created an RDD named ratings which contains ratings as its values.
  • And now we’re going to call an action method “countByValue” on that RDD. What it does is → This step would count up how many times each unique value in the RDD occurs ?
  • Example → In the example shown below, we know that the rating 3 occurs twice the rating one occurs twice and the rating 2 only occurs once and that’s the output we’ll get.
  • That’s what will end up in our result object. Note that, the object “result” is just a plain old Python object at this point and is no longer an RDD.

CodeLine #7.)

sortedResults = collections.OrderedDict(sorted(result.items()))
for key, value in sortedResults.items():
print(“%s %i” % (key, value))
  • Finally, In this code-piece, we are sorting the results. All it’s doing is using the collections package from Python to create an ordered dictionary that sorts those result based on the key which is the the actual rating itself.
  • And then we’re going to iterate through every key value pair of those results and print them out one at a time.

That’s all in this blog. We shall see you in next blog. Happy Learning.

--

--