Data Engineering: Analyzing Wikipedia Pages - MapReduce Function Not Working on Local Jupyter Notebook

Hello,

I am currently working on the Analyzing Wikipedia Pages guided project.

When it comes to implementing the MapReduce framework to count the total lines in all of the wiki files, my local Jupyter Notebook won’t run on Anaconda.

I get the following error message in the Anaconda interface:

`
‘’’
[I 21:11:29.758 NotebookApp] Replaying 3 buffered messages
Process SpawnPoolWorker-1:
Traceback (most recent call last):
File “C:\Users\Levi\anaconda3\lib\multiprocessing\process.py”, line 315, in _bootstrap
self.run()
File “C:\Users\Levi\anaconda3\lib\multiprocessing\process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “C:\Users\Levi\anaconda3\lib\multiprocessing\pool.py”, line 114, in worker
task = get()
File “C:\Users\Levi\anaconda3\lib\multiprocessing\queues.py”, line 358, in get
return _ForkingPickler.loads(res)
AttributeError: Can’t get attribute ‘mapper_line_count’ on <module ‘main’ (built-in)>
Process SpawnPoolWorker-2:
Traceback (most recent call last):
File “C:\Users\Levi\anaconda3\lib\multiprocessing\process.py”, line 315, in _bootstrap
self.run()
File “C:\Users\Levi\anaconda3\lib\multiprocessing\process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “C:\Users\Levi\anaconda3\lib\multiprocessing\pool.py”, line 114, in worker
task = get()
File “C:\Users\Levi\anaconda3\lib\multiprocessing\queues.py”, line 358, in get
return _ForkingPickler.loads(res)
AttributeError: Can’t get attribute ‘mapper_line_count’ on <module ‘main’ (built-in)>
‘’’

`

I am a bit confused as the code I’m using works fine on the Dataquest embedded Jupyter Notebook… but even if I download it from there and move it to the local directory on my machine that I have been using for all of my notebooks and datasets, it stops working.

Does anyone know how to resolve this?

Thanks,
Levi

Hello @deeble.levi

I was looking for the error message that anaconda gives you, I have found a post that refers to the same thing that happens to you on StackOverflow:

Take a look to see if you can solve it. In case it is so, do not hesitate to put as solved the post you have made, in this way you help me and all those who have the same problem as you in the future.

If this is not the solution, then we see what we can do.

I hope it works!

A&E

1 Like

Hi @Edelberth ,

Thanks for getting back to me. I would never have found that StackOverflow post by myself haha!

The solution of using the multiprocess module instead of the multiprocessing module solved the issue AttributeError: Can’t get attribute ‘mapper_line_count’ on <module ‘main’ (built-in.

So, the code is executing on Jupyter notebooks and there is no longer an error message on Anaconda…

However, I’m now getting an error in the Jupyter notebook in relation to the os module:

‘’’

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
“”"
Traceback (most recent call last):
File “C:\Users\Levi\anaconda3\lib\site-packages\multiprocess\pool.py”, line 125, in worker
result = (True, func(*args, **kwds))
File “C:\Users\Levi\anaconda3\lib\site-packages\multiprocess\pool.py”, line 48, in mapstar
return list(map(*args))
File “”, line 24, in mapper_line_count
NameError: name ‘os’ is not defined
“”"

The above exception was the direct cause of the following exception:

NameError Traceback (most recent call last)
in
30
31
—> 32 total_lines = map_reduce(file_names, 2, mapper_line_count, reducer_line_count)
33 print(total_lines)
34

~\data_engineering\projects\Analysing Wikipedia Pages\map_reduce_framework.py in map_reduce(data, num_processes, mapper, reducer)
11 chunks = make_chunks(data, num_processes)
12 with Pool(num_processes) as pool:
—> 13 chunk_results = pool.map(mapper, chunks)
14 return functools.reduce(reducer, chunk_results)

~\anaconda3\lib\site-packages\multiprocess\pool.py in map(self, func, iterable, chunksize)
362 in a list that is returned.
363 ‘’’
→ 364 return self._map_async(func, iterable, mapstar, chunksize).get()
365
366 def starmap(self, func, iterable, chunksize=None):

~\anaconda3\lib\site-packages\multiprocess\pool.py in get(self, timeout)
769 return self._value
770 else:
→ 771 raise self._value
772
773 def _set(self, i, obj):

NameError: name ‘os’ is not defined

‘’’

I have imported the os module outside of the function, so I’m not sure why this is happening.

Thanks,
Levi

Hello I’m glad it worked for you. :grinning_face_with_smiling_eyes:

According to the rules of the forum.

  • if the problem you raised has been solved then mark as solved.

  • having a new problem opens a new thread. that helps to separate topics.

At the moment I have time, so I will give it a look :thinking:, but as I said, open a new thread and as you say the problem was solved mark it as solved.

I’m sure you’ll understand.

A&E.

Hi A&E,

Thanks for your help! I carried on with my project using the DataQuest dashboard for now but will return to the other issue later on and create a new thread if it still occurs.

Best,
Levi

1 Like