Anand Sudhanaboina

Reddit's Ranking Algorithm

I was curious on how Reddit ranks the front page posts in the “hot” section. I explored it and found a few interesting things.

Reddit decides the front (hot) page posts by three factors:

  1. Up Votes
  2. Down Votes
  3. Posted Date

This is the Reddit’s algorithm for hot posts: explanation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from math import log10, sqrt
from datetime import datetime, timedelta

epoch = datetime(1970, 1, 1)

def epoch_seconds(date):
    """Returns the number of seconds from the epoch to date. Should
       match the number returned by the equivalent function in
       postgres."""
    td = date - epoch
    return td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000)

def score(ups, downs):
    return ups - downs

def hot(ups, downs, date):
    return _hot(ups, downs, epoch_seconds(date))

def _hot(ups, downs, date):
    """The hot formula. Should match the equivalent function in postgres."""
    s = score(ups, downs)
    order = log10(max(abs(s), 1))
    if s > 0:
        sign = 1
    elif s < 0:
        sign = -1
    else:
        sign = 0
    seconds = date - 1134028003
    return round(sign * order + seconds / 45000, 7)

This seems very exciting, so I’ve decided to use Reddit search api to get the JSON of a day’s data, then run the algorithm with the data and see if i can see the same front page of the Reddit. Entire day’s data would be huge so I’ve decided to go with a subreddit, I choose /r/technology. I have the JSON data using Reddit search API and taken a screenshot of /r/technology to compare the results.

Now I have 3 things:

  1. Reddit ranking algorithm
  2. Data of /r/technology for a day (sorted based on posted date)
  3. Screenshot of /r/technology to compare with generated results

I’ve written a python script to do the job.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from reddit import *
import json

# Load the data:
posts = False
with open("data.json") as file:
    posts = json.load(file)

# Out JSON variable:
outJson = []

# Iterate and claculate the hot score:
for children in posts:
    for post in children:
        downs = post["data"]["downs"]
        ups = post["data"]["ups"]
        title = post["data"]["title"]
        created = datetime.fromtimestamp(int(post["data"]["created"]))
        hotScore = hot(ups, downs, created)
        out = {}
        out["title"] = title
        out["hotScore"] = hotScore
        outJson.append(out)

# Sort based on hotScore
def sortScore(json):
    try:
        return int(json['hotScore'])
    except KeyError:
        return 0
outJson.sort(key=sortScore, reverse=True)

# Print JSON of top 25 posts:
print json.dumps(outJson[0:25])

But there is one big challenge, Reddit does not reveal the no of down votes, neither in website nor API, so the generated results match closely but not exactly with the screenshot.

Now I have 25 hot posts generated by algorithm based on the input data. Out of these 25, 22 matched with the screenshot but not exactly at the same position, this is due to the mismatch of the downvotes.

Comments