I pulled the data with SQL queries, and used R to take a longitudinal look at domains with the most stories, from 2006 to 2017 year-to-date.And I used R’s TidyText package to do a pretty terse sentiment analysis of story titles for top stories by score; and again used the method from this post.
Domains with Most Stories, 2006 to May 2017
I got count of stories by domain for each year, from 2006 to 2017 year-to-date.
I read in the data :
A look at the resulting dataframe:
Here’s a bar chart of the top 10 domains, from 2006 to 5/25/2017, by story count:
Github had the most stories submitted, with 55,751 stories submitted. Medium and Youtube were close behind. The Top five all are pretty close in story count. There’s a big drop-off after the top 5 (after the NY Times), with arstechnica.com having 49% (11,990) fewer stories than the NY Times.
Most Popular 2017 Domains and Change over Time
I looked at the 2017 YTD (May 2017) top 10 domains in Hacker News by count of stories, and how these domains have compared in the previous years, beginning with 2017.
First I ran this query to get the count of stories by the ten most popular domains in 2017.
Then I read in the results.
Here is a look at the results:
I then pulled the counts for each of these ten domains, by year.
I read in the data and arranged it ascending by year.
Here are some of the results:
This line chart shows the change in count of stories by each top 10 2017 domain, from 2006 to May 2017.
Sentiment Analysis on Top Stories’ Titles by Score
I then pulled the top 100 stories by score for each year, and used the TidyText package to compare sentiments expressed in the top stories for each year.
It looks like more negative emotions of anger and fear have shown increases in 2017 compared to other years, and negative sentiments are already higher not even halfway through the year, than they were in 2016. Negative sentiments in top story titles are higher in 2017 than in all of the previous years. Positive sentiments, on the other hand, are the least in 2017 compared to all other years.
First, I ran this SQL query for each year from 2006 to 2017:
I then read in the data:
I then got the list of words and their associated sentiments and emotions from the TidyText package:
I then got only the words from the titles for each included story, for each year. Then I joined this table with the nrc dataframe, to associate each word included with the NRC sentiment/emotion table. I then counted the number of words associated with each sentiment/emotion.
I did this for each year, and combined them into one dataframe, joining on sentiment.
Here is what the data looks like:
I reshaped the data from wide to long format to create a bar chart:
This bar chart creates a face for each sentiment, with one bar for each year:
This stacked bar chart groups each sentiment onto one rung of the x axis, by year.
I then wanted to see if there was a difference in sentiments between the bottom 10 of the top 100 for each year, and the top ten scores for each year, from 2006 to 2017.
First, I took the top 10 from the top 100 for each year:
I did this for each year, and combined them:
Because I now had the top ten stories by score for each year, I graphed it to see what the score range was for each year in the dataset.
The biggest range was in 2016, and the second-biggest range was in 2011.
I then did got the words out of the top 10 stories by score for each year, and merged that with the NRC dataset, and counted the sentiments for these stories.
I then got the bottom ten scores for each year:
I did this for each year, then combined them; and then I merged these bottom ten with the top ten for each year.
I looked at the range for these bottom ten out of the top 100 stories by score. The range is much smaller than that for the top stories by score:
I then got just the words out of the titles of the stories, merged those words with the NRC datasets to get the sentiment/emotion for each word, and counted the number of words for each sentiment/emotion.
I combined the a17_1 dataframe with the sentiments for the top ten stories, with the b17_1 dataframe, with the sentiments for the bottom ten out of the top 100 stories.
I reshaped the data to be able to create the bar chart comparing the sentiments between the two stories.
It looks like the top 10 stories were more negative and had more negative emotions than the bottom ten out of the top 100 stories.
The top 10 stories were higher in anger, anticipation, disgust, fear, and sadness. It was higher in the negative sentiments.
The bottom 10 stories were higher in positive sentiments and emotions of joy, surprise, and trust.