Text Summarization using SumBasic approach with NLTK

This program takes pages from Wikipedia and summarizes the texts. It uses the word-frequency approach, where the average probabilities of words in sentences are used to rank them. The word-frequency approach used is called SumBasic.

The program uses the idea of functional programming, where functions are inputs in other functions. What better way to test this program than to have it summarize both Wikipedia pages of Functional programming and Automatic Text Summarization.

Functional Progamming - Text Summarization Using SumBasic Method and NLKT.ipynb (12.0 KB)

Click here to view the jupyter notebook file in a new tab

4 Likes

Hi @monorienaghogho,

Do you know what? I missed your notebook until Dataquest Download published it! Now we are good.
This is pretty fascinating, with such a program, I guess you can easily get any job requiring summarization!
I have read part of the summaries but I wonder something: looks the program is selecting sentences at random (well, for sure it’s not totally at random) from Wikipedia? But I would have expected the program being able to write his own sentences, this is not the case, right? The several sentences that I have checked are extacly the same phrases than in the Wikipedia article. What is worrying me, let’s say a client hires you for sumarization stuff, he will probably expect you to write your own sentences and be able to keep plagiarism at minimal levels.
Next comments I would do: you forget a bit the readers since you don’t explain your code with some comments that would have been welcome.
In any case, congrats for the great job you did! Very appreciate too the SumBasic paper’s link.

Best
W

Thank you very much for your comment. There are two types of automatic text summarization: extractive and abstractive. This method is extractive and it uses sentence importance score to select what to display. The other method requires knowledge of deep learning and it is an ongoing area of research. This is a nice article about both methods.

2 Likes

hi @monorienaghogho going further with your excellent program as I am interested in NLP.
I read quicky SumBasic and tried to understand your code. Then using it for made homework stuff.
2 comments:

  • I have been forced to add 3 lines:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

Summarizing ‘Vehicle insurance’, get error message:

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-13-76705753430b> in <module>
     79 
     80 for i, datum in enumerate(data('Actuary', 'Vehicle insurance')):
---> 81     sumbasic(lemmatizer(word_tokenizer(sentence_tokenizer(datum))))
     82     if i == 0:
     83         print('*'*100, sep='\n')

<ipython-input-13-76705753430b> in sumbasic(lemmatizer)
     66             for word in value:
     67                 scores[key].append(probs[word])
---> 68             importance[key] = sum(scores[key]) / len(scores[key])
     69 
     70         most_importance_sentence = max(scores, key=scores.get)

ZeroDivisionError: division by zero

So len(scores) is zero. He got a sentence where all prob[words] = 0. Strange. What do you think ?

@WilfriedF

Yes, I have installed everything including these.

You will make a great Tester! You found the case where it broke.

Quick fix:

importance[key] = (sum(scores[key]) + 1) / (len(scores[key]) + 1)

To clear zero division error. Got this:

1 Like

Nice! So this was a special sentence that broke the program where all words have a prob value near to zero ? Wait, with a quick fix we should be able to find the special sentence. Something like: try…except Exception: print(key)

1 Like

got it:

Internal wikipedia formatting stuff I guess

1 Like

Great job!!!

Have you ever taught about testing?

1 Like

Never! It was just a coincidence, I just tried ‘Actuary’, ‘Vehicle insurance’ because I wanted some insights to write some text in English (for the Guided Project KNN I am finishing) and the second produced the error!

1 Like

Back to the subject: you fixed it and now the summary starts at Third Party Property Damage… But if you look at the beginning of the Wikipiedia article we miss now the introduction and two other paragraphs. Maybe your algo is suffering from the small changes you made or do you think it’s pretty closed?

Note we can make a tradeoff:

            try:
                importance[key] = sum(scores[key]) / len(scores[key])
            except Exception:
                importance[key] = (sum(scores[key]) + 1) / (len(scores[key]) + 1)
1 Like

Yes. I should do proper processing of the texts since these two lines did not catch what caused the error.

for token in regexp_tokenize(one_sentence.lower(), '\w+'):
            if token not in string.punctuation:
1 Like

Yeah looks there is other special chars you mean that caused the error ? Other chars that are not punctuation nor stopwords. Maybe internal wikipedia stuff or invisible white spaces, things like that.

1 Like

Yes. Things not caught here.

My hypothesis is that it has something to do with the wikipedia Table of Contents (very large for Vehicle insurance) but I am not sure. The size of the Table of Contents summary seems to fit with the screenshot I posted before

1 Like