Deep Dive into Apache Spark || Part-2

aditya goel
7 min readMar 5, 2024

If you are landing here directly, it’s strongly advisable to read through this blog first.

Question #1 → In Apache-Spark, show a real example of finding the average number of friends by age ?

Answer → Let’s consider the example from a social network of friends, for which the data looks something like this :-

For our task of finding the average number of friends by age, let’s take an example : What’s the average number of friends for the average 33 year old in our data set ? The code for the same would look like below :-

Step #1.) Importing Necessary Modules → from pyspark import SparkConf, SparkContext : This line imports the required modules from PySpark, which is a Python API for Apache Spark.

Step #2.) Configuring the SparkContext → Here, we create a SparkConf object to set configurations for the Spark application.

  • We set the master node to "local" which means Spark will run locally on one machine.
  • We set the application name to "FriendsByAge". This can be any random name that we like, but it should be contextual.
  • Then, we create a SparkContext (sc) using the configuration.
conf = SparkConf().setMaster("local").setAppName("FriendsByAge")
sc = SparkContext(conf = conf)

Step #3.) Method to parse each Line → This function parseLine() takes a line of input as a parameter, splits it by commas (as it's a CSV file as shown above), extracts the age (as it's at index 2) and the number of friends (assuming it's at index 3), converts them to integers, and returns a tuple of (age, numFriends).

def parseLine(line):
fields = line.split(',')
age = int(fields[2])
numFriends = int(fields[3])
return (age, numFriends)

Step #4.) Loading Data from File → This line reads the text file containing the data. It creates an RDD (Resilient Distributed Dataset) named lines, where each element of the RDD represents a line from the text file.

lines = sc.textFile("/Users/adityagoel/Documents/SPARK-LEARN/Example2/socialCircle.csv")

Step #5.) Mapping the Data → This line applies the parseLine() function to each element (line) of the RDD lines.

  • It transforms each line of text into a tuple (age, numFriends).
  • Here, Spark transformations like map() create a new RDD as a result. So, when you apply the map() operation on the lines RDD using parseLine function, It results in a new RDD named rdd.
  • Important concept → Here, rdd is a key-value pair RDD where the keys are “age” and the values are the “number of friends”.
rdd = lines.map(parseLine)

Step #5.) Calculating Totals by Age → This line performs two operations :-

  • First, it maps each value (age, numFriends) to (age, (numFriends, 1))
  • Then, it reduces by key (age) by summing the values of friends and the count of records for each age.
  • Note here that, totalsByAge is an RDD as well which has resulted from the transformation operations applied to the rdd RDD. It contains key-value pairs where each key is an “age”, and each value is a tuple (totalFriends, count).
totalsByAge = rdd.mapValues(lambda x: (x, 1))
.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

Step #6.) Calculating Averages by Age → This line maps each value (age, (totalFriends, count)) to (age, totalFriends / count), which calculates the average number of friends for each age.

  • In this line, totalsByAge is an RDD resulting from the previous transformation operations. It contains key-value pairs where each key is an age, and each value is a tuple (totalFriends, count).
  • Now, the mapValues() transformation is applied to totalsByAge. It applies the provided lambda function (lambda x: x[0] / x[1]) to each value (tuple) in the key-value pairs. This lambda function calculates the average number of friends for each age by dividing the total number of friends by the count of records for that age.
  • Thus, the RDD averagesByAge is contains the key-value pairs where each key is an “age”, and each value is the “average number of friends” for that age.
averagesByAge = totalsByAge.mapValues(lambda x: x[0] / x[1])

Step #6.) Collecting Results → This line collects all the results from the RDD averagesByAge and stores them in the variable results.

  • In this line, collect() is an action that triggers the execution of all the previous transformations on the RDD averagesByAge. It collects all the elements of the RDD averagesByAge from the distributed nodes in the Spark cluster and brings them back to the driver program as a local Python list.
  • So, results is a Python list that contains the collected results from the RDD averagesByAge. Each element in results is a tuple representing an age and its corresponding average number of friends. This list is now available for further processing or analysis within the Python program.
results = averagesByAge.collect()

Step #7.) Printing Results → This loop iterates over the collected results and prints them. Each result is a tuple containing an age and its corresponding average number of friends.

for result in results:     print(result)
print(result)

The output of the above program looks something like this :-

Question #2 → In Apache-Spark, show a real example of finding the Minimum Temperature in the year for the data as shown below ?

Answer → The code for this problem would look something like this :-

Step #1.) Parsing Function → This function parseLine() takes a line of input as a parameter, splits it by commas (as it's a CSV) and extracts following information :-

  • The station ID.
  • The entry type.
  • The Temperature → The temperature is converted from Celsius to Fahrenheit. It returns a tuple of (stationID, entryType, temperature).
def parseLine(line):     
fields = line.split(',')
stationID = fields[0]
entryType = fields[2]
temperature = float(fields[3]) * (9.0 / 5.0) + 32.0
return (stationID, entryType, temperature)

Step #2.) Loading Data from File → This line reads the text file containing the weather data. It creates an RDD (Resilient Distributed Dataset) named lines, where each element of the RDD represents a line from the text file.

lines = sc.textFile("/Users/adityagoel/Documents/SPARK-LEARN/Example3/1800.csv")

Step #3.) Parsing the Lines → This line applies the parseLine() function to each element (line) of the RDD lines.

  • It transforms each line of text into a tuple (stationID, entryType, temperature).
  • The parsedLines is also an RDD where each element is a tuple containing three values: “stationID”, “entry type”, and “temperature”. So, it's essentially a key-value pair RDD where the first element (station ID) serves as the key, and the tuple (entryType, temperature) serves as the value.
parsedLines = lines.map(parseLine)

Step #4.) Filtering Minimum Temperatures → This line filters the parsed lines to keep only those with “TMIN” in the entryType field i.e. Field №1 i.e. entryType, indicating minimum temperatures.

  • This line filters the parsed lines to keep only those with “TMIN” in the entryType field, indicating minimum temperatures.
  • Here, x[1] represents the entry type. So, minTemps RDD will contain only the entries related to minimum temperatures.
minTemps = parsedLines.filter(lambda x: "TMIN" in x[1])

Step #5.) Extracting Station ID and Temperature → This line maps each element of minTemps RDD to a tuple (stationID, temperature), discarding the entryType field, because we anyways have only those entries now which contains TMIN in their entryType field.

stationTemps = minTemps.map(lambda x: (x[0], x[2]))

Step #6.) Reducing by Key to Find Minimum Temperature per Station → This line reduces the RDD stationTemps by key (station ID) to find the minimum temperature for each station.

  • The lambda function lambda x, y: min(x, y) is applied to each pair of temperatures for the same station to find the minimum.
  • Here, we are saying that for whenever we try to combine two observations together for minimum temperature for a given station we’re gonna call the min function to actually only take the minimum value between those two and as we keep feeding more and more observations for each weather station in there only the smallest value the minimum value will survive in the end.
minTemps = stationTemps.reduceByKey(lambda x, y: min(x, y))

Step #7.) Collecting Results → This line collects all the results from the RDD minTemps and stores them in the variable results. At this point, the Spark job is executed, and the results are returned to the driver program.

results = minTemps.collect()

Step #8.) Printing Results → This loop iterates over the collected results and prints them. Each result is a tuple containing a station ID and its corresponding minimum temperature in Fahrenheit. The temperature is formatted with two decimal places and the unit "F".

for result in results:     
print(result[0] + "\t{:.2f}F".format(result[1]))

The output of this code looks like this :-

That’s all in this blog. We shall see you in next one. Pl clap, if you liked it.

--

--

aditya goel

Software Engineer for Big Data distributed systems