In https://app.dataquest.io/m/61/transformations-and-actions/3/beyond-lambda-functions
It says,
Spark compiles the function's code to Java to run on the RDD objects (which are also in Java).
This is refering to the hamlet_speaks
function on the next page
def hamlet_speaks(line):
id = line[0]
speaketh = False
if "HAMLET" in line:
speaketh = True
if speaketh:
yield id,"hamlet speaketh!"
hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))
hamlet_spoken.take(10)
However in this article: https://dzone.com/articles/pyspark-java-udf-integration-1
it says data and code is transfered to the python process for calculating and back.
This sounds contradictory to the DQ lesson saying the hamlet_speaks
is compiled to java.
Is hamlet_speaks
a form of UDF mentioned in the dzone article?
Which part is python and which part is java when doing hamlet_spoken = split_hamlet.flatMap(lambda x: hamlet_speaks(x))
? (To answer this properly there seems to be a need to separate the language an instruction is written in and the process in which it is executing)