Understanding how people feel about the news using Aspect-Based Sentiment Analysis

With this post I try to fill in the gaps of motivation behind some of the (mostly technical) decisions made for the INA project. At the end of this post there are some example outputs.

Problem introduction and motivation

In the modern day and age, people are extremely connected and sharing opinions with one another is easier than ever. People are also eager to share their thoughts (good and bad!) with the general public about services, products, and news happening around the globe. These opinions oftentimes come in the form of various flavours of upvotes, choosing a number from a likert scale to rank some aspects of products/services, and also they come in text as comments. This project is concerned with the last form of opinions seen on the internet.

Now, imagine the following scenario: you read an article online on your favourite news outlet. You think the article’s quality was really good, but the content made you feel distressed. You then start to wonder whether people think the same, whether they agree with your opinion on the article quality and on the content. The obvious thing to do, if you are no stranger to the internet, is to read the comments people have written under the article. After all, this is a direct representation of the general public’s thoughts and feelings. However, there could be hundreds or even thousands of comments. Usually there is no likert scale to judge some aspects of the article as opposed to products in e-commerce shops. How could one make an objective opinion about how people feel about the article then? Would they have to spends several hours reading all of the comments? Or simply look at the most upvoted comments? Both of these approaches have obvious downsides.

I propose that this problem can be tackled (at least partially) using aspect-based opinion mining and the methods presented in this project.

Aspect-Based Sentiment Analysis (ABSA)

ABSA brought attention to the research community in 2014 when it was introduced as one of the tasks in International Workshop on Semantic Evaluation (SemEval-2014):

The majority of current [sentiment analysis] approaches, however, attempt to detect the overall polarity of a sentence, paragraph, or text span, regardless of the entities mentioned (e.g., laptops, restaurants) and their aspects (e.g., battery, screen; food, service). By contrast, this task is concerned with aspect based sentiment analysis (ABSA), where the goal is to identify the aspects of given target entities and the sentiment expressed towards each aspect.

This approach is useful for us because we can discern what specific aspects people are commenting about and how do they feel towards that aspect.

At the moment INA is concerned with Subtask 2: Aspect term polarity and Subtask 3: Aspect category detection of SemEval14’s Task 4 (ABSA).

Subtask 2 is introduced as follows:

For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative, neutral or conflict (i.e., both positive and negative).

For example:

“I loved their fajitas” → {fajitas: positive}

“I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive}

“The fajitas are their first plate” → {fajitas: neutral}

“The fajitas were great to taste, but not to see” → {fajitas: conflict}

And Subtask 3 is introduced as follows:

Given a predefined set of aspect categories (e.g., price, food), identify the aspect categories discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of Subtask 1, and they do not necessarily occur as terms in the given sentence.

For example, given the set of aspect categories {food, service, price, ambience, anecdotes/miscellaneous}:

“The restaurant was too expensive” → {price}

“The restaurant was expensive, but the menu was great” → {price, food}

The datasets introduced in SemEval-2014 have become a standard for researchers to train and test their models on. However, these datasets are insufficient for the task outlined above. Which brings me to the next section.

Dataset

The classic SemEval datasets are collections of customer reviews about very specific things: laptops and restaurants. Thus, these datasets will not be all that useful for building a model that performs ABSA on comments about news.

There have been a few attempts at creating a more general dataset like the Twitter dataset. But again, these are are not ideal datasets for our task. However, comments under news articles are more formal (and longer) than tweets.

So, an adequate dataset for this task did not seem to exist. And when a required dataset does not exist, you have nothing else to do but to create one yourself. This is what I have done. I have stumbled upon a New York Times Comments dataset on Kaggle and thought it to be a brilliant dataset to be converted to an ABSA dataset suitable for this project. Then I wrote a small Python script to help with the annotation procedure and went to work.

By the end, I have annotated 2059 sentences, which is not a whole lot considering the usual magnitude of corporas Deep Learning Networks need to get good results, but it is a start. I tried to mix the topics of the articles as much as possible, however the most prevalent keyword (appearing in about a quarter of the articles in this dataset) was ‘Donald J Trump’. Some of this dataset’s statistics:

Category Negative Neutral Positive Total
Content 332 539 84 955 (46.4%)
Misc 151 481 42 674 (32.8%)
Quality 7 2 20 29 (1.3%)
Personal 80 222 81 383 (18.6%)
Author 1 3 14 18 (0.9%)
571 (27.7%) 1247 (60.6%) 241 (11.7%) 2059 (100%)

Only after doing so, I realized that there were two things missing in my annotation procedure, compared to the provided classical datasets. And that is the specific aspect terms talked about in the sentences (as well as their location to make it easier for processing) and every sentence was assigned to exactly one category. Regarding the latter, in the Subtask 3 example above and in the classical ABSA datasets the sentence is not restricted to simply one category, multiple topics are mentioned in a comment and the annotation should reflect that. This is not really a big problem as after a brief glance at the dataset most sentences seem to be talking about one topic. However, if a new dataset was to be created in the future, the mentioned critiques should be taken into consideration.

For more information about the datasets used in this project and how they could be further improved, refer to the README.md in the datasets directory of the INA project.

Training Deep Learning models

After the data was taken care of, the time came for training models that would later serve as one of the main components of the system.

As I have mentioned above, we are only concerned with inferring Aspect-Term polarity (from now on referred to as ABSA model) and Aspect-Category classification (from now on referred to as category classification model). Aspect terms are also needed for Aspect-Term Sentiment Analysis (ATSA) yet I have forgotten to add these myself during the annotation procedure. So, an automatic way of extracting aspects was employed. I retrieved noun phrases from sentences using TextBlob (or when none were found, spaCy). These noun phrases were used as aspects. Perhaps later another model should exist tackling ABSA Subtask 1 aspect term extraction for a more precise approach.

Aspect-Based Sentiment Analysis model

I investigated implementations of DNN architectures proposed in papers and stumbled upon a great repository. This allowed me to easily train and compare different architectures on the NYT comments dataset.

This website graphically displays the accuracies attained by various models for the ABSA Sub Task 2. Many of these models are implemented in the repository mentioned. So, I decided to try out all three implemented BERT models and the best (according to the site linked) non-BERT models for my experiments.

I hereby present the results of these experiments. NYT is the novel dataset introduced in this project, Hybrid is the NYT dataset mixed with the Laptop, Restaurants and Twitter datasets. All of these models were tested on the NYT dataset (because I was primarily interested in training a model specific for our project). All models’ hyperparameters were mainly the same as in the papers introducing them, just with a few minor tweaks to try to avoid overfitting (reduced learning rate, increased L2 regularization, increased dropout, fewer epochs). Non-BERT models were trained with 10-fold cross-validation as their results fluctuated.

Model Dataset Test accuracy Test F1
MGAN NYT 0.5841 0.5025
MGAN Hybrid
AOA NYT 0.5881 0.5137
AOA Hybrid 0.6035 0.5356
BERT SPC NYT 0.6467 0.5745
BERT SPC Hybrid 0.6509 0.6073
LCF BERT NYT 0.6780 0.6318
LCF BERT Hybrid 0.6564 0.5963
AEN-BERT NYT 0.6780 0.6500
AEN-BERT Hybrid 0.6774 0.6461

Sentence Category classification model

Category classification is a more straightforward task so I attempted to build a model myself. At the moment the model predicts exactly one category for each sentence as described above, but that can easily be changed if a new better dataset is introduced.

I hereby present the results of some experiments with Convolution layers for sentence category classification. The model names in the table correspond to the form `CNN_<no. convolution layers>__.

Model Test accuracy
CNN_1_128_5 0.6556
CNN_1_128_10 0.6012
CNN_1_256_5 0.6031
CNN_1_256_10 0.5914
CNN_2_128_5 0.6206
CNN_2_128_10 0.5837
CNN_2_256_5 0.6498
CNN_2_256_10 0.6070

Putting it all together

The end product is to have a system that supplies the user with some statistics about the comments written under a news article.

After running opinion_statistics.py the following steps are taken by the software:

  1. Retrieve comments from The New York Times using the given article’s URL
  2. Process them
  3. Use the trained sentence category classification model to classify sentences into the 5 predefined categories
  4. Use TextBlob (or when none found, spaCy) to extract noun phrases to be later used as aspect terms
  5. Use TextBlob to compute sentence-level (category-level) sentiment
  6. Use the trained ABSA model and the extracted aspect terms to compute aspect-level sentiment
  7. Retrieve the computed statistics parse them, make plots, output them to stdout and/or to directory

So, this project in essence covers all 4 ABSA subtasks (Aspect-Term extraction, Aspect-Term polarity, Aspect-Category extraction, Aspect-Category polarity), however, there currently are trained models for only subtasks 2 and 3 and subtasks 1 and 4 are carried out automatically using existing Natural Language toolkits.

Example results and possible interpretations

I chose a random article in the New York Times website. The title of the article: “The Virus Has Wrecked Some Families. It Has Brought Others Closer.”

I ran INA on it and these are the results. There have been 358 sentences analyzed.

First, let’s take a look about what people commented under this article. Category classification model classified 26.13% of the sentences to be related about their personal experiences, 35.14% of the sentences were about the content of the article, 38.74% of the comment content was miscellaneous, not falling in any of the other categories.

Now, given the title of this article I would expect quite a few positive comments and many references to personal lives. We can see already that indeed a large portion (over a quarter) of the comments were about that.

Corona families percentage

Let’s take a look about how people felt regarding each category. Indeed, we see quite a large portion of positive comments. It is natural to see the largest portion of the comments being classified as neutral because the comments under the New York Times articles are manually approved, which means that we can expect toxic, fake, erroneous comments taken out, and thus the comments are left a bit more bland. Neutral comments are also naturally occurring more often as these sentences are simply statements or questions. Our models do not yet discern sarcasm.

Corona families sentiment

category_stats_summary.txt provides us with the absolute and relative sentiment for each category. Absolute sentiment score is simply adding all sentiment scores associated with the category. As can be already discerned from the graph above, the relative score for each of the categories are positive (there are more positively classified sentences than negative).

aspect_terms_summary.txt lists top 10 most frequently talked about aspects in the comments. In this case, the top 5 is:

From this we can deduct that people are certainly referring to their personal experiences frequently. There is a common theme of positively referring to time (not sure if time left, or some eg. better period of time) and families. People has a negative relative sentiment meaning that most likely someone talked about the society in a negative context.


I took another article from the New York Times wondering that it should incite emotions a bit more. Article title: “Will the Coronavirus Kill What’s Left of Americans’ Faith in Washington?”

Much less of the comment content is about personal experience, much more about the content. I suppose a lot of people had to say how they feel about Washington.

Government trust percentage

Less positivity can be observed in the results, opinions are a bit more negative than the previous article, and it seems a lot of people presented statements (neutral).

Government trust sentiment

The following top 6 most frequently talked about aspects probably sum up how people feel about Washington…

Conclusion

As can be seen, this tool can be used to give a brief overview about how people feel about a few topics in relation to the article. Not only that, we know as well how commenters feel towards specific aspects.

As already mentioned in this post and README files in the INA repository, there still can be more things done to improve this tool. For one, better data can be collected and better models be built. Furthermore, more statistics can be computed making it easier to gain insights as to what people are talking about and how they are feeling. Lastly, a web application could be built in order for non-tech-savvy users to benefit from this tool.