Multi-Document Summarization with NLTK

Hello Everyone!

Please I need help on something. I am working on a multi-document classification problem using nltk. I am having problems getting the right results.

Can someone explain how to go about the steps, especially step 2?


The idea was developed in this paper.

Some of my questions include:

  • In a multi-document classification problem, is the corpus the whole document or the sub-documents?

  • How do you handle cases where the words repeat in the same sentence: updating probabilities and calculating weights


NLP is not my current forte to be able to help extensively. But from what I understand -

Step 1

Let’s say you have the following two sentences in your input -

  • Hi, I am the doctor.
  • Hi doctor. I am monorienaghogho

So, to calculate the p(w_i) for every i (where i is every word in the data) you simply have to find the frequency of the word divided by the total number of unique words.

So, for example, probability of doctor would be 2/10 (ignoring all punctuation).

Step 2

Now, for each sentence, you calculate a weight that is equal to the average probability of the words in that sentence.

So, for our first sentence - Hi, I am the doctor. we would have the following probability of each word -

2/10, 2/10, 2/10, 1/10, 2/10

So the average of the above is = (9/10/5) = (0.9/5) = 0.18

And the process continues based off of the above two.

As per the paper, it seems it’s across all documents -

With regard to our system design, it must be noted that this system, similar to almost all multi-document summarization systems, produces summaries by selecting sentences from the document set, ei-ther verbatim or with some simplification.

That seems to already been accounted for by Step 2. The denominator is the count of the set of all unique words in the sentence, if I am not mistaken. And they are summing over w_i as well, which is also every unique word in the sentence.

Beyond the above, I’m afraid, any more details would have to come up through a more focused reading of the paper, which I can’t currently go through. Maybe someone else can help out as well.


Thanks for responding.

I think it is okay to use the entire document if it is the same topic to avoid repetition. If you have to summarize multiple documents at once about different topics, my guess is that the corpus should be a particular document.

I am applying similar reasoning for updating the probabilities. If a human were to summarize 2 different documents, a word appearing in document A should not affect the probability of that same word occurring in document B.

I found here that the weight for a particular sentence is the sum of the probabilities divided by the len. This was the way I had implemented it but the mathematical notation of the denominator looks like cardinality of a set.

I have been able to produce the summaries, but the program output does not match 90 percent of the result.

I would have to test the code, but it does look like it’s doing the same thing - using the cardinality of the set. In the code, they divide by total which is defined as -

total = len(fdist)

fdist is defined as -

fdist = FreqDist(tokens)

FreqDist gets the frequency distribution. So, the len() of that distribution would be the number of unique words.



I wasn’t using cardinality, I used the total len of words in a sentence.

I will code and make this change. I hope things improve.

1 Like