How is code running between spark and python when using pyspark?

In https://app.dataquest.io/m/61/transformations-and-actions/3/beyond-lambda-functions
It says,
Spark compiles the function's code to Java to run on the RDD objects (which are also in Java).
This is refering to the hamlet_speaks function on the next page

def hamlet_speaks(line):
    id = line[0]
    speaketh = False
    
    if "HAMLET" in line:
        speaketh = True
    
    if speaketh:
        yield id,"hamlet speaketh!"

hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))
hamlet_spoken.take(10)

However in this article: https://dzone.com/articles/pyspark-java-udf-integration-1
it says data and code is transfered to the python process for calculating and back.
This sounds contradictory to the DQ lesson saying the hamlet_speaks is compiled to java.

Is hamlet_speaks a form of UDF mentioned in the dzone article?

Which part is python and which part is java when doing hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))? (To answer this properly there seems to be a need to separate the language an instruction is written in and the process in which it is executing)

1 Like

Hi @hanqi,

I have no idea about how PySpark works internally. However, I found an article that seems to be trying to dive into the internals of PySpark. Hope it helps you:

If you found the answer, I would request you to post it here.

Best,
Sahil