Spark and Map-Reduce - Challenge: Transforming Hamlet into a Data Set

I am doing the Challenge: Transforming Hamlet into a Data Set (Spark and Map-Reduce).
In section 2. Extract Line Numbers: to extract the number, the answer applied the ‘split’:

def format_id(x):
    id = x[0].split('@')[1]
    results =[]
    results.append(id)

I am just wondering why do not use regular expression to extract the number directly.
I tried code below, and the error occurred. How to solve this issue? Thanks a lot!

def format_id(x):
    id = x[0].extract(r'(\d+)')
    results =[]
    results.append(id)
1 Like

I have edited your code block to use ``` back ticks. You can click the edit button to see how I edit it.

You can refer to this article on how to use triple back ticks ``` to format a code block.

x[0].split('@') returns a list.
Then using indexing [1] to access the second element.
Therefore x[0].split('@')[1] returns the element after @ in the x[0].

In your code:

Series.str.extract() returns a new dataframe or a series. The data type is different.

Do not use reserved Python keyword. Choose other label/variable name.

Thanks alvinctk!
Is there a way to apply regular expression here?

I am not familiar with pyspark. Therefore, I won’t be able to help further.

There are some issues with type that .extract is returning.

And, maybe some concurrency issues.

Hi, I’m doing the same mission but I don’t understand why the new RDD is different from mine with my code:

raw_hamlet = sc.textFile("hamlet.txt")
split_hamlet = raw_hamlet.map(lambda line: line.split('\t'))

#Dataquest Solution
def format_id(x):
    id = x[0].split('@')[1]
    results = list()
    results.append(id)
    if len(x) > 1:
        for y in x[1:]:
            results.append(y)
    return results
hamlet_with_ids1=split_hamlet.map(lambda line: format_id(line))

#My own code
def line_id(x):
    x[0]=x[0].split('@')[1]
    return x
hamlet_with_ids2=split_hamlet.map(lambda line: line_id(line))

#Check RDDs 
print(hamlet_with_ids1==hamlet_with_ids2)

Output: FALSE why??

@JoshZhang

x[0] is a python string. It does not have an extract method.

@arredocana

What is this chunk of the dataquest answer doing? Did you replicate this in your own code?
Open up the hint: it says it should return a list containing the clean version of the first value and the original remaining values in that element.