Hello, my readers!! If you are struggling with what’s this UDF is, then let me ask you a simple question. You all must have definitely made your own functions while coding, right?

Alike them, we have the privilege to build our own “USER DEFINED FUNCTIONS” in Apache Spark too. A UDF is a function provided by the user when pre-defined functions are unable to achieve the goal.

  • These are column-based function.
  • It helps in extending the basic functionality.
  • It allows developers to create and use new functions of their own choice.
  • Take one row at a time to produce a single corresponding value per row.
What’s the need to register a function?

After creating a function, it could only be called on the driver node of Spark. Other executors won’t be able to call the function. Thus, we need to register/send it to each executer which is also called doing CLONE or BROADCAST.

After registering, we can use it as if it’s spark own function.

We’ll be checking out all these while solving the given problems.


Create a UDF to calculate the bonus recevied by the employee depending on number of months they worked in a year.

Formula: No of months * Fixed amount – Tax

No of months*1000-250.50

Let’s create a dataframe first.

val emp=Seq(("Nelsinki",101,4),("Nairobi",103,10),("Beck",101,8),("Ruth",104,6))
// creating dataframe
val df=emp.toDF("Name","Dept_Id","Months_worked")
// View Dataframe
// View Schema

We need to import udf and col function to get working on Spark SQL.

How to create a function?

import org.apache.spark.sql.functions.{udf,col}
// s is the input parameter of Int DataType
def money=(s:Int)=>

How to register the function as Spark UDF?

METHOD 1: Using the udf function

The first argument in the udf function denotes the datatype of the output and the second argument denotes the input datatype. As in here, money function returns a value of double data type and takes an Integer.

val bonus=udf[Double,Int](money)
// Use bonus function and pass the argument (Invoking UDF)
Method 2: Using spark.udf.register

The API spark.udf.register is the standard method for registering a spark UDF where :

  • First argument: Name of UDF ( User choice)
  • Second argument: Name of the Scala function
val Salary=spark.udf.register("Income",money)
//Invoking UDF

A Simple use case to shorten up the things:

Create a UDF function to find the length of the employee’s name.

To simplify the code, we can put the whole function inside the udf as depicted below.

val count=udf{(a:String)=>a.length}

Create a UDF function to perform increase in months_worked by 2.

val add=udf{(a:Int)=>a+2}


We are creating a Student dataframe.

val row=Seq(("Jack",20,"F"),("Ayush",4,"Male"),("Rio",43,"M"),("Stan",56,"Other"))
// Creating dataframe
val df=row.toDF("Name","Roll_No","Gender")
// View Dataframe

Update the Gender Column by creating a UDF resulting in “Female”, “Male” and “Unknown” accordingly.

Creating a Scala Function

def gen=(s:String)=>
  val f=Seq("F","female","FEMALE")
  val m=Seq("M","Male","MALE")
  if (f.contains(s)){
  else if (m.contains(s))

Registering the function as UDF

Method 1
val know_gender=udf[String,String](gen)
Method 2
val fun=spark.udf.register("gender",gen)

Today we learned:

  • How to create a user-defined function
  • How to register udf
  • How to use udf function

Did you find it interesting? Put your thoughts forward using the comment section below. I hope you got something to grab today. Thanks for reading.

Happy Sparking!!

Categories: Miscellaneous

1 Comment

ปั๊มไลค์ · May 28, 2020 at 3:36 pm

Like!! Really appreciate you sharing this blog post.Really thank you! Keep writing.

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview