This is a general modelling question I was hoping someone could give me advice on.
I have a series of “example” paragraphs that are selected from a series of PDF documents. These paragraphs fall into N classifications, you can think of them as conveying sentiment. What I am wanting to do is have lets say 100 examples of each of the class of paragraphs and use that to train a machine learning model to detect when that class of paragraph appears in a random PDF document and in what portion of the document that sentiment appears. To make it more concrete, suppose that you are considering the classic “sentiment analysis” task of classifying sentiment of reviews on amazon products. This will be the same except hat once you have identified those classes of sentiments, instead of detecting that other reviews exhibit the sentiment, I am wanting to detect them in larger documents, for example, suppose a professional critic wrote a longer, multi page review. I want to use the use the previous sentiment analysis from the shorter amazon reviews to detect when sub-sections of that longer critic review exhibited sentiment similar to the trained reviews and also distinguish that from other sections of the larger text which do not exhibit any sentiment.
Another way to put it, for what I am doing I expect the majority of the PDF document to not fall into but for paragraphs exhibiting one of the classes of behavior to show up sporadically (once or twice in a document at most) and I want to distinguish those sections of text from the rest of the document.
Currently how I am modelling this scenario is that I am taking the example paragraphs with the classification tags and feeding them into a SVM model after removing stop words and using lemmatization. If i am given the partitioning of the document beforehand into paragraphs I am able to use this to analyze each paragraph of the document and detect whether that paragraph conveys one of the N sentiments or none at all (I am modelling this by allowing the “no sentiment” case as the 0th case and training the model on the rest of the pdf document in the training set which exhibits no sentiment)
Where I am struggling however is that the PDF documents do not have an established or consistent sense of “paragraph” that I can use as the spacing is inconsistent and it has many spurious elements like headers and footers that confuse things more. Also sometimes the examples while contained within a paragraph are hard to model as remaining in a single paragraph due to inconsistencies in double, single spacing in the document.
As a result my example paragraphs which I want to classify I cannot map in any sensible way to the spacing of the PDF documents. I had planned on possibly using an n-gram approach to vary the size of paragraphs dynamically and try to see if that would help but it seems computationally expensive. I was also thinking that doc2vec can possibly help but I am having trouble putting the pieces together.
Can anyone recommend a good way to model and implement a solution for this scenario?