Every day I check Hacker News for interesting information, be it articles, stories, software or tools. Most of the submissions that make it to the front page are extraordinarily interesting and useful and the fact that the curation of the posts is so well community-driven fascinates me.

For the purpose of this article I've used the Hacker News API to gather around 200 of the best stories ever submitted to HackerNews and their comments and played a bit around with the data to get a little bit of insight on what makes a good HN post.

Before we begin, I must say I have no doubt that a Hacker News submission is good thanks to the quality of the information provided and the level of interest on that particular information. But there may be other factors which, in small percentages, help a HN submission make it to the front page.

With that in mind, let's see an overview of this article:

Getting the data for our analysis
Data visualisation: word clouds and scores analysis
When to post on HackerNews
Ask HN vs Show HN
Who people on HackerNews talk about: Entity Recognition and Keyword Extraction
Conclusions

Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.

Getting the data for analysis

I've used the HackerNews API /beststories endpoint to gather a list of 188 of the best stories ever. For every story I've also gathered the comments(but not comments to the comments, only the main thread). Here's the data I've stored for every entry.

id - the id of the entry
parent - the id of the parent. For a story, it is the same as the id field. For a comment, it's the id of the story to which the commend was added
kids_number - only for stories, meaning the number of comments
score - only for stories: the number of points the submission got
time - UNIX timestamp of the time the entry was added
text - title of posts or texts of comments
type - 'story' or 'comment'

Full code of the class I've used to fetch the data will be available at the end of this article.

The data is then stored in a csv_file and loaded from there into a Pandas Frame. I also needed to create 4 more columns for my analysis: DayOfWeek, HourOfDay, isAsk, isShow. The names are pretty self-explanatory.

    dataFetcher = DataFetcher("https://hacker-news.firebaseio.com/v0/", "data.csv")
    dataFetcher.fetchData()
    df = pd.read_csv("data.csv")
    df['DateTime'] = pd.to_datetime(df['time'], unit='s')
    df['DayOfWeek'] = df['DateTime'].dt.day_name()
    df['HourOfDay'] = df['DateTime'].dt.hour
    df['isAsk'] = df.apply(lambda x: x.type=='story' and x.text.lower().startswith("ask hn:"), axis=1)
    df['isShow'] = df.apply(lambda x: x.type == 'story' and x.text.lower().startswith("show hn:"), axis=1)

Data visualisation: word clouds and scores analysis

I started by doing some exploratory analysis on the data. First, I've built 2 separate word clouds from the story titles and from the comments, hoping I would get an idea about frequently used words on HackerNews. I've removed "Show HN" and "Ask HN" labels from the titles.

    stopwords = set(STOPWORDS)
    stopwords.update(["Ask", "Show", "HN"])
    titles_text = " ".join(df[df['type']=='story']['text'].unique())
    titles_cloud = WordCloud(stopwords=stopwords, background_color='white').generate(titles_text)
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(titles_cloud, interpolation="bilinear")
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()

Building a word cloud from story titles

Besides the big, obvious Covid and Coronavirus words, most of the words are related to software, programming and technologies. A good observation is that videos seem to to work very well on Hacker News(at least that's what this word cloud tells us).

Let's see the comments also.

    comments = " ".join(df[df['type'] == 'comment']['text'].unique())
    comments_cloud = WordCloud(background_color='white').generate(comments)
    plt.figure(figsize=(8, 8), facecolor=None)
    plt.imshow(comments_cloud, interpolation="bilinear")
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.show()

Building a word cloud from comments

I am a little bit dissapointed that I've not included all the comments on this analysis, but the number of comments was very big and I was not sure it would help me very much for this article. But we all know we sometimes spend more time in the comments section than on the original submitted post 😀

Then I wanted to look at the scores of the best posts. I've plotted a histogram to illustrate what are the values that scores tend to cluster around and I've also calculated the mean and the median values for the scores.

    # Histogram of scores
    scores = df[df['type']=='story']['score']
    scores.plot.hist(bins=12, alpha=0.5)
    plt.show()
    # Average score
    print ("Average score: ", df[df['type']=='story']['score'].mean())
    # Median score
    print("Median score: ", df[df['type'] == 'story']['score'].median())

Histogram on scores of best posts on Hacker News

We can see here that most of the stories received less that than 200 points, but there also some outliers here, with at least 1000 points.

The average score for my dataset was 194.80 but that is hugely influenced by the outliers. That's why I've also calculated the median value, which is 140.0. That translates to the fact that roughly half of the best stories on Hacker News received less than 140 points, and the other half received more than that.

When to post on Hacker News

This is a question a lot of people are asking on the Internet. This article is by no means a receipe for finding that answer, but still I think I've found some interesting stuff.

First I've plotted the distribution of stories by day of the week.

    daysOfWeek = df[df['type']=='story'].groupby(['DayOfWeek']).size()
    daysOfWeek.plot.bar()
    plt.show()

When to post on HackerNews - posts by day of week

Most of the best stories have been posted around weekends. Somehow, I was expecting this. But the most interesting fact for me is that none of the best stories were submitted on Tuesday or Wednesday. Monday seems a very bad day too, with very few successful submissions.

Before doing this analysis, I would have also guessed that Friday would get the highest number of successful submissions. I don't know exactly why, only my intuition.

There is another temporal dimension we can look at, and that's the hour of the day. Let's plot the same distribution for that.

    hoursOfDay = df[df['type']=='story'].groupby(['HourOfDay']).size()
    hoursOfDay.plot.bar()
    plt.show()

When to post on Hackernews - posts by day of week

Having our time columns displayed in UTC time, we can see that most of the successful posts were submitted in the afternoon, with the biggest spike at 5 pm UTC.

Another thing I wanted to check was whether there is any correlation between the number of points a post is getting and the number of comments on that post. To me it seems obvious that this should be true: if people find something interesting enough to vote it, they might also start a discussion on that post.

I've also included the hour of the day in this correlation matrix, to check whether there are times in a day when people feel more like engaging into conversations.

    correlationsData = df[df['type'] =='story'][['score', 'kids_number', 'HourOfDay']]
    print (correlationsData.corr(method='pearson'))

When to post on Hackernews - correlations

It's seems there is a very strong correlation between scores and number of comments. As I said, I somehow expected this. But I was a bit dissapointed by the non-existent correlation between the score and the hour of the day.

Ask HN vs Show HN

Moving on, I wanted to check how many of the most successful posts on Hacker News were Ask/Show submissions.

    print ("Count of Ask HN stories: ", df[df['isAsk']==True].shape[0])
    print ("Percentage of Ask HN stories:", 100 * df[df['isAsk']==True].shape[0] / df[df['type']=='story'].shape[0])
    print ("Count of Show HN stories: ", df[df['isShow']==True].shape[0])
    print ("Percentage of Show HN stories:", 100 * df[df['isShow']==True].shape[0] / df[df['type']=='story'].shape[0])

It seems only 8 of the posts where Ask HN(that's 4.30% of my dataset) and 16 posts where Show HN(or 8.60% of the dataset). Not much to see here after all, only a few number of these submissions where Ask HN/Show posts.

Who people on Hacker News talk about: Entity Recognition and Keyword Extraction

Next step was to run an entity extractor on the titles of the best posts on Hacker News and keep from here People and Organisation entities to see if anything pops out. I used spacy for entity extraction.

I obtained a list of 175 entities. Because that's a big list that does not tells us anything, I've only extracted the entities that appear more than once.

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(". ".join(df[df['type']=='story']['text'].unique()))
    entity_names = [entity.text for entity in doc.ents if entity.label_ in ["PERSON", "ORG"]]
    freq = {entity_names.count(entity): entity  for entity in entity_names}
    for i in sorted (freq.keys()):
        if i > 1:
            print (freq[i])
    # Prints: Amazon, Google, Apple

Named Entity Extraction on Hacker News titles

Three tech giants are the only 3 entities that appeared more than once in the best Hacker News posts.

Last step was to use gensim to extract keywords from the the titles of the posts.

print(keywords(". ".join(df[df['type']=='story']['text'].unique())).split('\n'))

Keywords extraction

This results in a huge list of keywords, out of which the first 3 are: "covid", "pdf" and "video". Other than that, most of the keywords relate to "generators", "apps", and "machine learning".

Let's not forget to add the code for the class I've used to extract the data from the Hacker News API, as I've promised at the beginning of the article.

import csv
import requests
from bs4 import BeautifulSoup
BEST_STORIES="beststories.json"
class DataFetcher:
    def __init__(self, baseUrl, dataFile):
        self.baseUrl = baseUrl
        self.dataFile = dataFile
    def fetchData(self):
        with open(self.dataFile, mode='w') as data_file:
            data_writer = csv.writer(data_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            data_writer.writerow(['id', 'parent', 'kids_number', 'score', 'time', 'text', 'type'])
            # Best stories
            r = requests.get(url=self.baseUrl + BEST_STORIES)
            bestStoriesIds = r.json()
            count = 0
            for id in bestStoriesIds:
                count = count + 1
                print (str(count) + " / " + str(len(bestStoriesIds)))
                story = requests.get(url=self.baseUrl + "item/" + str(id) + ".json")
                storyJson = story.json()
                data_writer.writerow([storyJson['id'], storyJson['parent'] if "parent" in storyJson else storyJson['id'],
                                      len(storyJson['kids']) if 'kids' in storyJson else 0, storyJson['score'],
                                      storyJson['time'], BeautifulSoup(storyJson['title'], features="html.parser").getText(), storyJson['type']])
                # Getc
                if "kids" in storyJson:
                    for kidId in storyJson["kids"]:
                        kid = requests.get(url=self.baseUrl + "item/" + str(kidId) + ".json")
                        kidJson = kid.json()
                        if kidJson and kidJson['type'] == 'comment' and "text" in kidJson:
                            data_writer.writerow(
                                [kidJson['id'], storyJson['id'],
                                 len(kidJson['kids']) if 'kids' in kidJson else 0, 0,
                                 kidJson['time'], BeautifulSoup(kidJson['text'], features="html.parser").getText(), kidJson['type'], ''])
            print ("Latest stories")
            maxId = requests.get(url=self.baseUrl + "maxitem.json").json()
            countDown = 1000
            while countDown > 0:
                print ("Countdown: ", str(countDown))
                story = requests.get(url=self.baseUrl + "item/" + str(maxId) + ".json")
                storyJson = story.json()
                if storyJson["type"] == "story" and storyJson["score"] > 50:
                    countDown = countDown - 1
                    maxId = maxId - 1
                    data_writer.writerow(
                        [storyJson['id'], storyJson['parent'] if "parent" in storyJson else storyJson['id'],
                         len(storyJson['kids']) if 'kids' in storyJson else 0, storyJson['score'],
                         storyJson['time'], BeautifulSoup(storyJson['title'], features="html.parser").getText(),
                         storyJson['type'],
                         storyJson['url'] if "url" in storyJson else ''])
                    # Getc
                    if "kids" in storyJson:
                        for kidId in storyJson["kids"]:
                            kid = requests.get(url=self.baseUrl + "item/" + str(kidId) + ".json")
                            kidJson = kid.json()
                            if kidJson['type'] == 'comment' and "text" in kidJson:
                                data_writer.writerow(
                                    [kidJson['id'], storyJson['id'],
                                     len(kidJson['kids']) if 'kids' in kidJson else 0, 0,
                                     kidJson['time'], BeautifulSoup(kidJson['text'], features="html.parser").getText(),
                                     kidJson['type'], ''])

Conclusions

That was it for my little analysis on best Hacker News posts of all time. I've really enjoyed playing with the data for a little bit. I hope you've also enjoyed this and you've got some meaningful insights from this project.

Thank you very much for reading this! Interested in more? Follow me on Twitter at @b_dmarius and I'll post there every new article.

Analyzing Best Hacker News Posts

Marius Borcan

Marius Borcan

Getting the data for analysis

Data visualisation: word clouds and scores analysis

When to post on Hacker News

Ask HN vs Show HN

Who people on Hacker News talk about: Entity Recognition and Keyword Extraction

Conclusions

BERT NLP: Using DistilBert To Build A Question Answering System

Explained: Word2Vec Word Embeddings - Gensim Implementation Tutorial And Visualization

Python Knowledge Graph: Understanding Semantic Relationships

K-Means Clustering For Image Segmentation

Latent Dirichlet Allocation For Topic Modelling Explained: Algorithm And Python Scikit-Learn Implementation