Hello and welcome to our community! Is this your first visit?
Enjoy an ad free experience by logging in. Not a member yet? Register.

# Thread: How does digg/reddit do it?

1. ## How does digg/reddit do it?

Hi Guys

Have you ever wondered how digg, reddit and other huge social network sites keep track of who voted for what?

Take digg for example. Each user can post a comment, and each user can also vote a comment up or down. But they also stop you from voting for same comment twice.

It's not a time based thing - if you voted "+1" for a comment 6 months ago you cannot go and "+1" it again 6 months later.

My question is: How do they keep track of all those votes programmatically?

The first and obvious solution is to create a seperate table like userCommentVotes with cols:

commentID (int), userID (int)

So if userID 324 votes for the commentID 5556 you just insert that into the table, and if the user tries to vote for that same comment again, you do a
Code:
`SELECT * FROM userCommentVotes WHERE commentID = 5556 AND userID = 324`
If the database returns a row you know that the user has already voted and don't count his second vote.

That's great for small sites, but if you have a 1000 users and 1000 comments the table userCommentVotes could grow to 1000 * 1000 = 1 million rows! Not very efficient

The next obvious solution would be to use bitwise operators. That way you could store just 1 integer value for each user and use XOR to figure out if user 324 has indeed already voted for commentID 5556. Much much more efficient.

But! Mysql only allow you to store 64bit numbers. And with bitwise operators you need

Code:
```2^x-1
x = number of options```
This means that if you have a 1000 comments, you'll need to store an integer value of 2^1000-1 which is the astronomical figure of 1.07150860718626732094842504906e+301

And what if you have 7000 comments? I'm not even going to try to calculate that. Also, I don't think XORing a number of 1.07150860718626732094842504906e+301 is very efficient (barring that fact that PHP cannot store that number as an integer - it will convert it to a float)

No go, TEXT field in database maybe?

Now I'm starting to think that creating a table with a text field might just be the best option. For example

userID (int), votedOnComments (text) (ie mysql's TEXT type)

I then use comma seperated values in the text field whenever a user votes on a comment, for example, if user 324 voted on comment 5546 and comment 9323, comment 10345 the row would look like this

userID = 324, votedOnComments = "5564,9323,10345"

Whenever a user tries to vote on comment 9323 again, I can select his votedOnComments from the database and do a string search to see if 9323 pops up. In this case it will and I won't count his vote again

The last method seems to be the best approach, but it still isn't very efficient. If a user voted on 4000 comments that votedOnComments TEXT field will get increadibly long.

I'm pretty sure I'm missing something. There MUST be a better way.

Any ideas?

• I want to see the outcome of this :P

• I'd go with your first solution. Really not that weird: you are counting the number of votes on an article and you're deciding if a user is allowed to vote.

All information about an article is stored in a table articles, as is the information about a user. I'd give the votes their own table.

The moment a users logs on, his credentials get loaded. Also, a quick look is given to the votes table. All this userinfo is stored in a session, so it's only looked up once. When the user votes, update the database and the session and you're done.

• ## Users who have thanked jaap for this post:

deamonlizard (02-20-2008)

• Mmm, you're probably right

The first method would be more effient in terms of space - if I represent the comment ID 402137 in TEXT it would take up 6 bytes whereas that same number as an INT would take up 32 bits/4 bytes (Sorry for that- I REALLY REALLY need to code this as efficiently as possible)

Also mysql has a lot of optimisations (like SQL_CACHE) so I think it will be faster to pinpoint a record compared to method 3 where PHP must do a pos() on a massively long string

But I still have to ask - is this the way digg does it? They must have a table with over a billion records (if not more!). Just imagine doing a mysql_dump on that!

• Maybe a billion, maybe more... can't tell, I'm not them

This way, you're using the database like it's meant to be. This is always more efficient than storing comma-separated-values in a text field.

• Are you sure they are using MySQL though?
Surely other databases should be more efficient at storing excessive data.

• yea no they definately use mysql (and PHP!!)

found the following articles which gives and in depth view of the infrastructure at digg:
http://www.oreillynet.com/onlamp/blo..._and_perf.html
http://www.highscalability.com/digg-architecture

As for the amount of data, I have full trust in Mysql. Hell, even nasa is using mysql to store massive amounts of data from space missions

• One thing that will help searches go faster is to break it up into as many sub tables as possible, within reason.

So if Digg has 1,000,000 articles can be dugg, then then thats a huge table to search when you multiply it be number of votes for each one. But if you split the tables up into News_Technology_Apple_Rumors then their might only be 4,500 articles in that section, therefore a much smaller list to search through.

You might think that that won't work because articles make their way out of that section through each rung until they hit the homepage, but that article still belongs to that deep sub category.

This method isn't perfect, as it could still get huge. You could even split it up into months as well as category:

Something like this:
Code:
```All								1,000,000
News							800,000
News_Tech						630,000
News_Tech_Apple					95,000
News_Tech_Apple_iPod			13,000
News_Tech_Apple_iPod_0208		1,278```

• ## Users who have thanked Ultragames for this post:

deamonlizard (02-20-2008)

• Wait - I think we're onto something here!

I could do it the other way around - split based on the userID. The userID is always available from the login

So I could have these subtables

userFrom0_999VotedOn
userFrom1000_2999VotedOn
userFrom2000_3999VotedOn

I can then write a function getBounds that will return a 0_999 if the userID is for exampe 324 and return 4000_4999 if the userID is 4581

The code would then look like

PHP Code:
``` \$tableBoundString = getBounds(\$userID);\$query = "SELECT * FROM userFrom{\$tableBoundString}VotedOn WHERE userID = \$userID";  ```
By splitting the "voting" table up from a user point of view it is also possible to implement the method Jaap suggested. Since the userID is always known (even if the user isn't on a Article page - for example he logged in on the front page), it is possible to bug that massive table only once when the user logs in. After that his "userVotedOn" data is stored in a session variable.

As a sidenote: I'm not acutally using this on a digg/reddit like page. I used that as an example because everyone has seen their comment voting system. I'm implementing this for a client on a custom written forum, and I want to give the user the ability to vote a message in a theard up or down.

The problem is - the website has only been running for a year and there's already 29800 messages in the forum, generated by 17200 users. The website grows between 11 - 14&#37; per month, so I'm really worried about space and speed. There could easily be 80 000 messages in the forum a year down the line.

So using the method above (splitting the voting table for every 1000 users), I would have 17 userFrom{bounds}VotedOn tables, and each of those tables could be:

(1000 users) * (30 000 possible new votes on messages in treads in the next year) = 30,000,000 entries per table! F! me

Oh well. I'll let you guys know how it went. It this works I'll have an absolute newfound respect for mysql!

Thank you for everyone's input!!

• http://www.sythe.org (Illegal RuneS cheating site. Not advised to visit, simply an example.) uses that feature. You may be able to ask them.

Since you're using it for a forum, I would just suggest adding a column to the posts area in the database table for ratings, then set up the system. If you get what I mean.

• Originally Posted by deamonlizard
yea no they definately use mysql (and PHP!!)

found the following articles which gives and in depth view of the infrastructure at digg:
http://www.oreillynet.com/onlamp/blo..._and_perf.html
http://www.highscalability.com/digg-architecture

As for the amount of data, I have full trust in Mysql. Hell, even nasa is using mysql to store massive amounts of data from space missions

Brilliant site: highscalability. I think you found the article about Shards yourself, but here's an interesting one for you:
http://www.highscalability.com/unort...n-coming-shard

Don't know where they get their data though...

• Your supposition that method #1 is inefficient is incorrect. That table would have a unique index on (userID, commentID) that would make lookups very fast well beyond a million rows on commodity hardware, and if it does get slow it's trivial to partition vertically by storing blocks of user or comment ids in different physical databases.

The bitwise method will not work at all per your explanation, and method 3 is very inefficient and contrary to the logic of db design, essentially using a column as a table that can't be indexed and which involves a lot of reallocation of space to add even small amounts of data.

• Though it's great to build your system so that if every human voted on every thread that it would still work, but you have to realize that thats probably not going to happen.

I've seen news hit the homepage of Digg with just a few hundred votes. Event then you almost never see any go above a few thousand. Digg has HUGE numbers of users. So I think when you do the math, its a rather small percentage of users who actually vote. If so, that gives you a lot of breathing room.

• Originally Posted by Ultragames
Though it's great to build your system so that if every human voted on every thread that it would still work, but you have to realize that thats probably not going to happen.

I've seen news hit the homepage of Digg with just a few hundred votes. Event then you almost never see any go above a few thousand. Digg has HUGE numbers of users. So I think when you do the math, its a rather small percentage of users who actually vote. If so, that gives you a lot of breathing room.
Not to mention the fact that between the last three posters (Ralph_L_Mayo, yourself and myself) have been here for combined total of more than 10 years don't even have 3000 combined posts.

I use method one for rating music on a website with about 2000 members and every member MUST rate a song before they can download it and I don't have any lag whatsoever. I haven't done a count on that database lately but it was somewhere in the ballpark of 600,000-700,000 rows back in October.

As Ralph has stated, they are indexed so it doesn't have to compare row by row. You can think about it like an Encyclopedia set. (If you're old enough to remember them ) If you were to look up information on NASA you wouldn't start off at the first book and read every page until you find NASA. You'd simply select the N volume and start searching there.

Edit: I have a table that associates Keywords to Products with 2,440,919 rows. My system seem okay.

•

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•