Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 14 of 14
  1. #1
    New Coder
    Join Date
    Feb 2008
    Posts
    13
    Thanks
    2
    Thanked 0 Times in 0 Posts

    Lightbulb How does digg/reddit do it?

    Hi Guys

    Have you ever wondered how digg, reddit and other huge social network sites keep track of who voted for what?

    Take digg for example. Each user can post a comment, and each user can also vote a comment up or down. But they also stop you from voting for same comment twice.

    It's not a time based thing - if you voted "+1" for a comment 6 months ago you cannot go and "+1" it again 6 months later.

    My question is: How do they keep track of all those votes programmatically?

    The first and obvious solution is to create a seperate table like userCommentVotes with cols:

    commentID (int), userID (int)

    So if userID 324 votes for the commentID 5556 you just insert that into the table, and if the user tries to vote for that same comment again, you do a
    Code:
    SELECT * FROM userCommentVotes WHERE commentID = 5556 AND userID = 324
    If the database returns a row you know that the user has already voted and don't count his second vote.

    That's great for small sites, but if you have a 1000 users and 1000 comments the table userCommentVotes could grow to 1000 * 1000 = 1 million rows! Not very efficient

    No go, what about bitwise?

    The next obvious solution would be to use bitwise operators. That way you could store just 1 integer value for each user and use XOR to figure out if user 324 has indeed already voted for commentID 5556. Much much more efficient.

    But! Mysql only allow you to store 64bit numbers. And with bitwise operators you need

    Code:
    2^x-1
    x = number of options
    This means that if you have a 1000 comments, you'll need to store an integer value of 2^1000-1 which is the astronomical figure of 1.07150860718626732094842504906e+301

    And what if you have 7000 comments? I'm not even going to try to calculate that. Also, I don't think XORing a number of 1.07150860718626732094842504906e+301 is very efficient (barring that fact that PHP cannot store that number as an integer - it will convert it to a float)

    No go, TEXT field in database maybe?

    Now I'm starting to think that creating a table with a text field might just be the best option. For example

    userID (int), votedOnComments (text) (ie mysql's TEXT type)

    I then use comma seperated values in the text field whenever a user votes on a comment, for example, if user 324 voted on comment 5546 and comment 9323, comment 10345 the row would look like this

    userID = 324, votedOnComments = "5564,9323,10345"

    Whenever a user tries to vote on comment 9323 again, I can select his votedOnComments from the database and do a string search to see if 9323 pops up. In this case it will and I won't count his vote again


    The last method seems to be the best approach, but it still isn't very efficient. If a user voted on 4000 comments that votedOnComments TEXT field will get increadibly long.

    I'm pretty sure I'm missing something. There MUST be a better way.

    Any ideas?
    Last edited by deamonlizard; 02-19-2008 at 11:42 PM. Reason: Because I could

  • #2
    Regular Coder Ludatha's Avatar
    Join Date
    Jan 2008
    Posts
    250
    Thanks
    51
    Thanked 5 Times in 5 Posts
    I want to see the outcome of this :P

  • #3
    New Coder
    Join Date
    Aug 2005
    Location
    Groningen, Netherlands
    Posts
    57
    Thanks
    0
    Thanked 6 Times in 6 Posts
    I'd go with your first solution. Really not that weird: you are counting the number of votes on an article and you're deciding if a user is allowed to vote.

    All information about an article is stored in a table articles, as is the information about a user. I'd give the votes their own table.

    The moment a users logs on, his credentials get loaded. Also, a quick look is given to the votes table. All this userinfo is stored in a session, so it's only looked up once. When the user votes, update the database and the session and you're done.

  • Users who have thanked jaap for this post:

    deamonlizard (02-20-2008)

  • #4
    New Coder
    Join Date
    Feb 2008
    Posts
    13
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Mmm, you're probably right

    The first method would be more effient in terms of space - if I represent the comment ID 402137 in TEXT it would take up 6 bytes whereas that same number as an INT would take up 32 bits/4 bytes (Sorry for that- I REALLY REALLY need to code this as efficiently as possible)

    Also mysql has a lot of optimisations (like SQL_CACHE) so I think it will be faster to pinpoint a record compared to method 3 where PHP must do a pos() on a massively long string

    But I still have to ask - is this the way digg does it? They must have a table with over a billion records (if not more!). Just imagine doing a mysql_dump on that!
    Last edited by deamonlizard; 02-20-2008 at 12:25 AM.

  • #5
    New Coder
    Join Date
    Aug 2005
    Location
    Groningen, Netherlands
    Posts
    57
    Thanks
    0
    Thanked 6 Times in 6 Posts
    Maybe a billion, maybe more... can't tell, I'm not them

    This way, you're using the database like it's meant to be. This is always more efficient than storing comma-separated-values in a text field.

  • #6
    Regular Coder Ludatha's Avatar
    Join Date
    Jan 2008
    Posts
    250
    Thanks
    51
    Thanked 5 Times in 5 Posts
    Are you sure they are using MySQL though?
    Surely other databases should be more efficient at storing excessive data.

  • #7
    New Coder
    Join Date
    Feb 2008
    Posts
    13
    Thanks
    2
    Thanked 0 Times in 0 Posts
    yea no they definately use mysql (and PHP!!)

    found the following articles which gives and in depth view of the infrastructure at digg:
    http://www.oreillynet.com/onlamp/blo..._and_perf.html
    http://www.highscalability.com/digg-architecture

    As for the amount of data, I have full trust in Mysql. Hell, even nasa is using mysql to store massive amounts of data from space missions

    http://linux.sys-con.com/read/46490.htm
    Last edited by deamonlizard; 02-20-2008 at 12:53 AM.

  • #8
    Regular Coder
    Join Date
    Aug 2002
    Location
    Oregon, United States of America
    Posts
    882
    Thanks
    1
    Thanked 9 Times in 9 Posts
    One thing that will help searches go faster is to break it up into as many sub tables as possible, within reason.

    So if Digg has 1,000,000 articles can be dugg, then then thats a huge table to search when you multiply it be number of votes for each one. But if you split the tables up into News_Technology_Apple_Rumors then their might only be 4,500 articles in that section, therefore a much smaller list to search through.

    You might think that that won't work because articles make their way out of that section through each rung until they hit the homepage, but that article still belongs to that deep sub category.

    This method isn't perfect, as it could still get huge. You could even split it up into months as well as category:

    Something like this:
    Code:
    All								1,000,000
    News							800,000
    News_Tech						630,000
    News_Tech_Apple					95,000
    News_Tech_Apple_iPod			13,000
    News_Tech_Apple_iPod_0208		1,278
    If I'm postin here, I NEED YOUR HELP!!

  • Users who have thanked Ultragames for this post:

    deamonlizard (02-20-2008)

  • #9
    New Coder
    Join Date
    Feb 2008
    Posts
    13
    Thanks
    2
    Thanked 0 Times in 0 Posts
    Wait - I think we're onto something here!

    I could do it the other way around - split based on the userID. The userID is always available from the login

    So I could have these subtables

    userFrom0_999VotedOn
    userFrom1000_2999VotedOn
    userFrom2000_3999VotedOn

    I can then write a function getBounds that will return a 0_999 if the userID is for exampe 324 and return 4000_4999 if the userID is 4581

    The code would then look like

    PHP Code:
    $tableBoundString getBounds($userID);
    $query "SELECT * FROM userFrom{$tableBoundString}VotedOn WHERE userID = $userID"
    By splitting the "voting" table up from a user point of view it is also possible to implement the method Jaap suggested. Since the userID is always known (even if the user isn't on a Article page - for example he logged in on the front page), it is possible to bug that massive table only once when the user logs in. After that his "userVotedOn" data is stored in a session variable.

    As a sidenote: I'm not acutally using this on a digg/reddit like page. I used that as an example because everyone has seen their comment voting system. I'm implementing this for a client on a custom written forum, and I want to give the user the ability to vote a message in a theard up or down.

    The problem is - the website has only been running for a year and there's already 29800 messages in the forum, generated by 17200 users. The website grows between 11 - 14% per month, so I'm really worried about space and speed. There could easily be 80 000 messages in the forum a year down the line.

    So using the method above (splitting the voting table for every 1000 users), I would have 17 userFrom{bounds}VotedOn tables, and each of those tables could be:

    (1000 users) * (30 000 possible new votes on messages in treads in the next year) = 30,000,000 entries per table! F! me

    Oh well. I'll let you guys know how it went. It this works I'll have an absolute newfound respect for mysql!

    Thank you for everyone's input!!

  • #10
    Regular Coder Deacon Frost's Avatar
    Join Date
    Feb 2008
    Location
    Between the Lines
    Posts
    279
    Thanks
    31
    Thanked 4 Times in 4 Posts
    http://www.sythe.org (Illegal RuneS cheating site. Not advised to visit, simply an example.) uses that feature. You may be able to ask them.

    YouTube also uses it.

    Since you're using it for a forum, I would just suggest adding a column to the posts area in the database table for ratings, then set up the system. If you get what I mean.

  • #11
    New Coder
    Join Date
    Aug 2005
    Location
    Groningen, Netherlands
    Posts
    57
    Thanks
    0
    Thanked 6 Times in 6 Posts
    Quote Originally Posted by deamonlizard View Post
    yea no they definately use mysql (and PHP!!)

    found the following articles which gives and in depth view of the infrastructure at digg:
    http://www.oreillynet.com/onlamp/blo..._and_perf.html
    http://www.highscalability.com/digg-architecture

    As for the amount of data, I have full trust in Mysql. Hell, even nasa is using mysql to store massive amounts of data from space missions

    http://linux.sys-con.com/read/46490.htm
    Brilliant site: highscalability. I think you found the article about Shards yourself, but here's an interesting one for you:
    http://www.highscalability.com/unort...n-coming-shard

    Don't know where they get their data though...

  • #12
    Regular Coder ralph l mayo's Avatar
    Join Date
    Nov 2005
    Posts
    951
    Thanks
    1
    Thanked 31 Times in 29 Posts
    Your supposition that method #1 is inefficient is incorrect. That table would have a unique index on (userID, commentID) that would make lookups very fast well beyond a million rows on commodity hardware, and if it does get slow it's trivial to partition vertically by storing blocks of user or comment ids in different physical databases.

    The bitwise method will not work at all per your explanation, and method 3 is very inefficient and contrary to the logic of db design, essentially using a column as a table that can't be indexed and which involves a lot of reallocation of space to add even small amounts of data.
    Last edited by ralph l mayo; 02-21-2008 at 06:22 AM.

  • #13
    Regular Coder
    Join Date
    Aug 2002
    Location
    Oregon, United States of America
    Posts
    882
    Thanks
    1
    Thanked 9 Times in 9 Posts
    Though it's great to build your system so that if every human voted on every thread that it would still work, but you have to realize that thats probably not going to happen.

    I've seen news hit the homepage of Digg with just a few hundred votes. Event then you almost never see any go above a few thousand. Digg has HUGE numbers of users. So I think when you do the math, its a rather small percentage of users who actually vote. If so, that gives you a lot of breathing room.
    If I'm postin here, I NEED YOUR HELP!!

  • #14
    Senior Coder
    Join Date
    Mar 2003
    Location
    Atlanta
    Posts
    1,037
    Thanks
    14
    Thanked 30 Times in 28 Posts
    Quote Originally Posted by Ultragames View Post
    Though it's great to build your system so that if every human voted on every thread that it would still work, but you have to realize that thats probably not going to happen.

    I've seen news hit the homepage of Digg with just a few hundred votes. Event then you almost never see any go above a few thousand. Digg has HUGE numbers of users. So I think when you do the math, its a rather small percentage of users who actually vote. If so, that gives you a lot of breathing room.
    Not to mention the fact that between the last three posters (Ralph_L_Mayo, yourself and myself) have been here for combined total of more than 10 years don't even have 3000 combined posts.

    I use method one for rating music on a website with about 2000 members and every member MUST rate a song before they can download it and I don't have any lag whatsoever. I haven't done a count on that database lately but it was somewhere in the ballpark of 600,000-700,000 rows back in October.

    As Ralph has stated, they are indexed so it doesn't have to compare row by row. You can think about it like an Encyclopedia set. (If you're old enough to remember them ) If you were to look up information on NASA you wouldn't start off at the first book and read every page until you find NASA. You'd simply select the N volume and start searching there.

    Edit: I have a table that associates Keywords to Products with 2,440,919 rows. My system seem okay.
    Last edited by StupidRalph; 02-21-2008 at 01:51 AM.
    Most of my questions/posts are fairly straightforward and simple. I post long verbose messages in an attempt to be thorough.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •