How is code running between spark and python when using pyspark?

It says,
Spark compiles the function's code to Java to run on the RDD objects (which are also in Java).
This is refering to the hamlet_speaks function on the next page

def hamlet_speaks(line):
    id = line[0]
    speaketh = False
    if "HAMLET" in line:
        speaketh = True
    if speaketh:
        yield id,"hamlet speaketh!"

hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))

However in this article:
it says data and code is transfered to the python process for calculating and back.
This sounds contradictory to the DQ lesson saying the hamlet_speaks is compiled to java.

Is hamlet_speaks a form of UDF mentioned in the dzone article?

Which part is python and which part is java when doing hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))? (To answer this properly there seems to be a need to separate the language an instruction is written in and the process in which it is executing)

1 Like

Hi @hanqi,

I have no idea about how PySpark works internally. However, I found an article that seems to be trying to dive into the internals of PySpark. Hope it helps you:

If you found the answer, I would request you to post it here.