Wikipedia word counter using Airflow and Spark — Part 2 — Spark job

Anurag Chatterjee
3 min readJan 16, 2021

This post is in continuation of my previous post where I showed how an Airflow DAG can be developed to schedule multiple activities in a typical data pipeline along with running Spark jobs. This post aims to serve as an independent guide on how you would go about writing a Spark job in Scala to count the number of times words appear in a text. The word counter application built here is slightly more involved since it removes stop words, changes to lower case, and removes punctuations while only using the high-level DataFrame API.

The objective

Read content from a source into a Spark DataFrame, clean the content by converting to lower case and removing the stop words. Count the occurrences of the words in this cleaned content to create a new Spark DataFrame. Finally, write this word with counts DataFrame to a sink (file here) for further processing. Also, see how this job can be packaged using sbt and run as a Spark job with the appropriate parameters.

Word counts output for Malaysia’s Wikipedia article

Set-up

In order to replicate the code and the outcome, you will need to have the Spark binary, IntelliJ Idea with Scala plugin and sbt set up in your machine along with their dependencies.

Complete code

The complete code for the Spark job along with the build.sbt are provided below for reference. A walkthrough of the code is provided after the code.

Code for the Spark job
build.sbt to list the Scala library dependencies

Code walkthrough

1. Get Spark DataFrame from the input file — getDataFramFromFile

The first step is to read the data in the input file and create a Spark data frame from it. This pattern helps in effective unit testing as we can mock this method and create a data frame of our choice for further processing without worrying about an actual file in the unit tests.

2. Process DataFrame to remove stop words — processTextDataFrame

This method lowers the case of the words and removes all non-alphabetical content. It then creates a new Spark data frame after tokenizing and removing all the stop words and returns this data frame.

3. Count the occurrence of words — countWords

Here the data frame without the stop words is processed and a new data frame is created with the word and its count in the content. The explode method separates the words in each sentence into separate rows. A group by followed by the count aggregate, groups the same words together and counts their occurrences. The resulting word count data frame is sorted in descending order by the count of the words.

Tying all together in the main method

The word count data frame is finally written to a CSV file after reducing to a single partition using the coalesce method in writeWordCountsDFToCSV method. This ensures a single output file is created with the word counts.

In the main method, after initializing the Spark Session, the input file location is read from the arguments, the output file location is formed based on the input location and then the methods described above are called one after the other with the output of the previous method being the input to the next.

Package and Submit the Spark job with the input parameter

In order to submit the Scala Spark job, a JAR needs to be created. Navigate to the location of build.sbt and run the command sbt package.

This creates the JAR file inside the scala-version folder in the ‘target’ folder. The name of the JAR file is the name specified in the build.sbt appended with the Scala and application versions also specified in the build.sbt.

In order to submit the job for execution, a spark-submit command following the below pattern needs to be issued:

%SPARK_HOME%/bin/spark-submit --class "class_name" spark_config JAR_location application_arguments

Sample usage of the above command when the data is stored in /home/bravo/data/Malaysia/content.txt and the current directory has the ‘target’ directory with the packaged JAR.

~/Documents/Spark/spark-2.4.6-bin-hadoop2.7/bin/spark-submit --class "org.spark.learning.WikiContentWordCounter" --master local[4] target/scala-2.12/wordcounter_2.12-0.1.jar /home/bravo/data/Malaysia/content.txt

Logs from the Spark application

--

--

Anurag Chatterjee

I am an experienced professional who likes to build solutions to real-world problems using innovative technologies and then share my learnings with everyone.