PySpark Aggregations – Cube, Rollup

Hola 😛 Let’s get Started and dig in some essential PySpark functions. PySpark contains loads of aggregate functions to extract out the statistical information leveraging group by, cube and rolling DataFrames. Today, we’ll be checking out some aggregate functions to ease down the operations on Spark DataFrames. Before moving ahead, Read more…

Joins in PySpark

Have you ever wondered if we could apply joins on PySpark Dataframes as we do on SQL tables? Would it be possible? Woohoo!! You guessed it right. Here we have with us, a spark module called SPARK SQL for structured data processing. Spark SQL supports all kinds of SQL joins. Read more…

PySpark DataFrame – withColumn

In this article, I will walk you through commonly used dataframe column operations. Spark withcolumn() is used to rename, drop, change the value of an existing column and to create a new column too. Let’s create a dataframe first. Suppose you want to calculate the Percentage of the Student using Read more…

PySpark – Data Type Conversion

Hey there!! In today’s article, we’ll be learning how to type cast DataFrame columns as per our requirement. You might be knowing that Data type conversion is an important step while doing the transformation of the dataframe. Let’s say we would like to add a number to the dataframe column Read more…

PySpark – Create DataFrame

Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data Read more…

Introduction to Delta Lake

Hey there!! Have you heard about Data Lake or used any in your projects? Might be. Let me help you recall it. Data Lake is a repository that helps to store structured and unstructured data, like HDFS, Azure Data Lake, or AWS S3. But are you known to DELTA LAKE? Read more…

