Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Page 1 of 3 123 LastLast
Results 1 to 15 of 37
  1. #1
    New Coder
    Join Date
    Jan 2012
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Perform complex calculations entirely within MySQL

    I currently have this working by passing bucket-loads of data to PHP and then doing all the calculations there. I want (and most likely need) to perform all the calculations using MySQL only. I know that it is possible to write functions and procedures in MySQL but I've never done it before and every time I sit down to get this written I confuse myself into a crying mess of self-disappointment. So....

    Explanation of database:
    Let's start with a database diagram:


    You can immediately disregard the users table. We'll not bother with that in this question.

    So this database hold information drawn from web pages. You can see the links table holds the URL and title, as well as the maximum term frequency for any term in that document.

    The object table is just to allow both users and links to be foreign primary keys in the occurences table. BTW, I am aware of the constant mis-spelling of occurrences as occurences! =)

    The occurences table holds the majority of data. The table has a primary foreign key of linkoruser (though just ignore user for this). It has word_id which refers to the dictionary table, tag_id which refers to the tags table and finally score which is the number of occurrences of that word, of that tag type, in that document.

    The tags table has a tag id, the name of a tag (eg. title, or H1) and its weight which is normally set between 1 and 10.

    The dictionary table has a word id, the actual word, and the inverse document frequency of that word. If you care what an IDF is then it's the number of web pages, divided by the number of web pages containing that specific word somewhere.

    Explanation of the system's function:
    Given a link ID (we'll call it BaseID), all other links are compared for similarity (cosine similarity) and the IDs, url and titles of all links are shown to the user, ordered by most to least similar.

    Scores for each word are:
    ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
    ie the word's number of occurrences in the document multiplied by the tag weighting for this particular word's tag, divided my the maximum term frequency for that document. Then that multiplied by the word's inverse document frequency to give you the final score for that word, in that context.

    The similarity is done via cosine similarity, best show in the image below.


    So each page has it's similarity to the BaseID, and the system shows the user all the page IDs, URLs and Titles, ordered by similarity from most to least similar.


    Explanation of the problem:
    Currently massive amounts of data is dragged out of MySQL and handed over to PHP which then does all the processing.

    This is because I only know my way around basic MySQL. I know it is possible to use functions and procedures, but I am getting mighty tangled.

    What I would like to have is for me to pass a BaseID to MySQL and it return the page IDs, URLs and Titles, ordered by similarity from most to least similar.

    I know this is a massive question, and nobody's getting paid to sit here and churn out solutions. So I really appreciate that you've even read this far down!

    Thanks in advance!

    P.S. If you want a download of the database so you can have a fiddle, it's available here:
    http://dl.dropbox.com/u/22289145/linksitev2.sql
    Last edited by PikaD; 01-23-2012 at 03:06 PM.

  • #2
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    Ummm...occurences table must also have a document id? Oh...I see...that's the linkoruser field, right?

    This could probably be done without using temp tables, but it's likely easier to code by creating at least one and probably two temp tables as part of the stored procedure.

    Hmmm...or maybe we should just add another field to the occurences table? No, wait. You already have the score in there.

    Hmmm...not sure this isn't easier than you think.

    Let me cogitate a while.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #3
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    Okay, explain this:
    The occurences table holds ... score which is the number of occurrences of that word, of that tag type, in that document.
    ...
    Scores for each word are:
    ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
    You use "scores" to mean two different things. Wouldn't "frequency" have been a better name for the occurences.scores field?

    Why can't you pre-compute the final score for each occurrence?? I don't see anything in your formula there which would be time-variant. Am I missing something?

    Hmmm...or does word_idf change over time, as new documents are added. Looks like it. Still, you could pre-compute all but that last multiply, no?

    The more tables you can leave out of the massive ugly JOIN that this will need, the better. So being able to leave out tags and links would be a big help.
    Last edited by Old Pedant; 01-23-2012 at 08:56 PM.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #4
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    One more question: So the summations used must take into account any word founc in *EITHER* of the two documents, right.

    But no, that's not really true. If a word has a "final score" of zero in either table (e.g., it's not there) then multiplying 0 by non-zero is still zero. So for the numerator we only care about words that are indeed in both documents. And for each of the sums in the denominator, we won't even *see* a zero value.

    Hmmm...I can see this think is going to take forever, but it doesn't seem very complex, actually.

    So far, I have done these two operations on your occurences table:
    Code:
    alter table occurences add pscore float;
    
    UPDATE occurences, tags, links
    SET occurences.pscore = 1.0 * occurences.score * tags.weight / links.max_tf
    WHERE occurences.tag_id = tags.tag_id
    AND occurences.linkoruser = links.idlinks;
    Adding that "pscore" field and then calculating it took several minutes for each operation, so that is several minutes that won't have to be repeated in each computation.

    It looks to me like it's going to be worth creating at least one TEMP table: The first operation will be to go out and get the ||A|| value for each document and store that in the TEMP table. Though it's tempting to add it to the LINKS table.

    You certainly aren't going to have two people running this query at the same time are you? And/or you aren't going to add a new document while the results are being generated? (Doing so would alter the dictionary.word_idf value midway through the operation, thus invalidating all work done to that point.)
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #5
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    Okay, I've now done these steps:
    Code:
    CREATE TABLE squares (
        idlinks INT PRIMARY KEY,
        sumofsquares FLOAT,
        similarity FLOAT );
    
    INSERT INTO squares ( idlinks, sumofsquares )
    SELECT links.idlinks, SUM( POWER( (occurences.pscore * dictionary.word_idf), 2 ) ) AS ss
           FROM links, occurences, dictionary
           WHERE links.idlinks = occurences.linkoruser
           AND occurences.word_id = dictionary.word_id;
    When I convert this into a stored procedure, either the squares table will be a temp table or I will truncate it before doing the INSERT.

    Does that math make sense to you so far?

    [By the by: The INSERT there took only 21 seconds on my 4-year old machine. Precalculating those pscore values was a tremendous time saver.]
    Last edited by Old Pedant; 01-24-2012 at 04:13 AM.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #6
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    Okay, I completed it. But something seems to be wrong.

    The similarity for link 726 *TO ITSELF* is not even close to the highest similarity score.

    I've been over the numbers three times now, and they all seem right. But why doesn't the similarity score then make sense.

    FWIW, here are the top 10 (highest) similarity scores to 726:
    Code:
    mysql> select * from similarities order by similarity desc limit 10;
    +---------+-----------------+
    | idlinks | similarity      |
    +---------+-----------------+
    |    1933 | 0.000000118274  |
    |    2869 | 0.000000109086  |
    |     720 | 0.0000000935016 |
    |    4516 | 0.0000000910589 |
    |    2504 | 0.0000000862625 |
    |    5677 | 0.0000000843    |
    |    1132 | 0.0000000838255 |
    |    2792 | 0.0000000828706 |
    |    3799 | 0.0000000826505 |
    |    5202 | 0.0000000824886 |
    +---------+-----------------+
    and for comparison:
    Code:
    mysql> select * from similarities where idlinks = 726;
    +---------+-----------------+
    | idlinks | similarity      |
    +---------+-----------------+
    |     726 | 0.0000000527725 |
    +---------+-----------------+
    I'm going to try rerunning the numbers again, later, but think it's time for you to chime in.
    Last edited by Old Pedant; 01-24-2012 at 08:37 AM.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #7
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    Here's a sample dump of some of the data that goes into creating the numbers.

    pscore is that precalulated value of occurences.score * tags.weight / links.max_tf

    What do you think?
    Code:
    mysql> select d.word, o.pscore, d.word_idf, o.pscore * d.word_idf, o.score
        -> from dictionary as d, occurences as o
        -> where d.word_id = o.word_id
        -> and o.linkoruser = 725
        -> limit 20;
    +----------+----------+----------+-----------------------+-------+
    | word     | pscore   | word_idf | o.pscore * d.word_idf | score |
    +----------+----------+----------+-----------------------+-------+
    | new      | 0.333333 |     4376 |     1458.666710138321 |     1 |
    | what     | 0.333333 |     1619 |     539.6666827499866 |     1 |
    | usernam  | 0.333333 |      156 |    52.000001549720764 |     1 |
    | password | 0.666667 |      436 |     290.6666753292084 |     2 |
    | member   | 0.333333 |     1287 |     429.0000127851963 |     1 |
    | happen   | 0.333333 |      766 |    255.33334094285965 |     1 |
    | updat    | 0.333333 |     2353 |     784.3333567082882 |     1 |
    | world    | 0.333333 |     2447 |     815.6666909754276 |     1 |
    | expert   | 0.333333 |      268 |     89.33333599567413 |     1 |
    | favorit  | 0.333333 |      618 |     206.0000061392784 |     1 |
    | friend   | 0.333333 |     2087 |     695.6666873991489 |     1 |
    | twitter  | 0.666667 |     2242 |    1494.6667112112045 |     1 |
    | twitter  | 0.666667 |     2242 |    1494.6667112112045 |     1 |
    | twitter  |        1 |     2242 |                  2242 |     3 |
    | email    | 0.333333 |     2264 |      754.666689157486 |     1 |
    | celebr   | 0.333333 |     1308 |    436.00001299381256 |     1 |
    | sign     | 0.333333 |     2908 |     969.3333622217178 |     1 |
    | sign     | 0.666667 |     2908 |    1938.6667244434357 |     1 |
    | sign     |        1 |     2908 |                  2908 |     3 |
    | forgot   | 0.333333 |      166 |     55.33333498239517 |     1 |
    +----------+----------+----------+-----------------------+-------+
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #8
    New Coder
    Join Date
    Jan 2012
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Old Pedant View Post
    Here's a sample dump of some of the data that goes into creating the numbers.

    pscore is that precalulated value of occurences.score * tags.weight / links.max_tf

    What do you think?
    Code:
    mysql> select d.word, o.pscore, d.word_idf, o.pscore * d.word_idf, o.score
        -> from dictionary as d, occurences as o
        -> where d.word_id = o.word_id
        -> and o.linkoruser = 725
        -> limit 20;
    +----------+----------+----------+-----------------------+-------+
    | word     | pscore   | word_idf | o.pscore * d.word_idf | score |
    +----------+----------+----------+-----------------------+-------+
    | new      | 0.333333 |     4376 |     1458.666710138321 |     1 |
    | what     | 0.333333 |     1619 |     539.6666827499866 |     1 |
    | usernam  | 0.333333 |      156 |    52.000001549720764 |     1 |
    | password | 0.666667 |      436 |     290.6666753292084 |     2 |
    | member   | 0.333333 |     1287 |     429.0000127851963 |     1 |
    | happen   | 0.333333 |      766 |    255.33334094285965 |     1 |
    | updat    | 0.333333 |     2353 |     784.3333567082882 |     1 |
    | world    | 0.333333 |     2447 |     815.6666909754276 |     1 |
    | expert   | 0.333333 |      268 |     89.33333599567413 |     1 |
    | favorit  | 0.333333 |      618 |     206.0000061392784 |     1 |
    | friend   | 0.333333 |     2087 |     695.6666873991489 |     1 |
    | twitter  | 0.666667 |     2242 |    1494.6667112112045 |     1 |
    | twitter  | 0.666667 |     2242 |    1494.6667112112045 |     1 |
    | twitter  |        1 |     2242 |                  2242 |     3 |
    | email    | 0.333333 |     2264 |      754.666689157486 |     1 |
    | celebr   | 0.333333 |     1308 |    436.00001299381256 |     1 |
    | sign     | 0.333333 |     2908 |     969.3333622217178 |     1 |
    | sign     | 0.666667 |     2908 |    1938.6667244434357 |     1 |
    | sign     |        1 |     2908 |                  2908 |     3 |
    | forgot   | 0.333333 |      166 |     55.33333498239517 |     1 |
    +----------+----------+----------+-----------------------+-------+

    I just checked this forum now. Three things I really need to say:
    1) I haven't read all this yet.
    2) I am amazed someone is helping! All the other forums were mean to me
    3) Old Pedant is truly an outstanding username.

    I will read through all this ASAP. Thanks for everything so far O.P.

  • #9
    New Coder
    Join Date
    Jan 2012
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Old Pedant View Post
    Okay, explain this:

    You use "scores" to mean two different things. Wouldn't "frequency" have been a better name for the occurences.scores field?

    Why can't you pre-compute the final score for each occurrence?? I don't see anything in your formula there which would be time-variant. Am I missing something?

    Hmmm...or does word_idf change over time, as new documents are added. Looks like it. Still, you could pre-compute all but that last multiply, no?

    The more tables you can leave out of the massive ugly JOIN that this will need, the better. So being able to leave out tags and links would be a big help.


    Yes, frequency would indeed be better. I'm working with code previously written by someone else. I also think it would help if they'd spelled 'occurrences' correctly, but such is life.

    Anyhoo, word_idf can indeed change as more pages are added to the database. Similarly, tag weight can be changed by the administrator at any time, which negates your pre-calculation of pscore I'm afraid. I'm just sorry I wasn't at a PC to catch this message when you asked.


    Quote Originally Posted by Old Pedant View Post
    One more question: So the summations used must take into account any word founc in *EITHER* of the two documents, right.

    But no, that's not really true. If a word has a "final score" of zero in either table (e.g., it's not there) then multiplying 0 by non-zero is still zero. So for the numerator we only care about words that are indeed in both documents. And for each of the sums in the denominator, we won't even *see* a zero value.

    Hmmm...I can see this think is going to take forever, but it doesn't seem very complex, actually.

    So far, I have done these two operations on your occurences table:
    Code:
    alter table occurences add pscore float;
    
    UPDATE occurences, tags, links
    SET occurences.pscore = 1.0 * occurences.score * tags.weight / links.max_tf
    WHERE occurences.tag_id = tags.tag_id
    AND occurences.linkoruser = links.idlinks;
    Adding that "pscore" field and then calculating it took several minutes for each operation, so that is several minutes that won't have to be repeated in each computation.

    It looks to me like it's going to be worth creating at least one TEMP table: The first operation will be to go out and get the ||A|| value for each document and store that in the TEMP table. Though it's tempting to add it to the LINKS table.

    You certainly aren't going to have two people running this query at the same time are you? And/or you aren't going to add a new document while the results are being generated? (Doing so would alter the dictionary.word_idf value midway through the operation, thus invalidating all work done to that point.)
    Erm, in theory the system might be used by more than one user, though I think MySQL queues users, right? And yes, if a new doc is added the results are innacturate, but the larger the database the smaller the inaccuracy.


    Quote Originally Posted by Old Pedant View Post
    Okay, I completed it. But something seems to be wrong.

    The similarity for link 726 *TO ITSELF* is not even close to the highest similarity score.

    I've been over the numbers three times now, and they all seem right. But why doesn't the similarity score then make sense.

    FWIW, here are the top 10 (highest) similarity scores to 726:
    Code:
    mysql> select * from similarities order by similarity desc limit 10;
    +---------+-----------------+
    | idlinks | similarity      |
    +---------+-----------------+
    |    1933 | 0.000000118274  |
    |    2869 | 0.000000109086  |
    |     720 | 0.0000000935016 |
    |    4516 | 0.0000000910589 |
    |    2504 | 0.0000000862625 |
    |    5677 | 0.0000000843    |
    |    1132 | 0.0000000838255 |
    |    2792 | 0.0000000828706 |
    |    3799 | 0.0000000826505 |
    |    5202 | 0.0000000824886 |
    +---------+-----------------+
    and for comparison:
    Code:
    mysql> select * from similarities where idlinks = 726;
    +---------+-----------------+
    | idlinks | similarity      |
    +---------+-----------------+
    |     726 | 0.0000000527725 |
    +---------+-----------------+
    I'm going to try rerunning the numbers again, later, but think it's time for you to chime in.
    Chiming. Those similarities are very low. If it's any consolation the current system has a bug where occasionally relevancy is higher than 1.0! I'm sure this bug can be hunted.

    Quote Originally Posted by Old Pedant View Post
    Here's a sample dump of some of the data that goes into creating the numbers.

    pscore is that precalulated value of occurences.score * tags.weight / links.max_tf

    What do you think?
    Code:
    mysql> select d.word, o.pscore, d.word_idf, o.pscore * d.word_idf, o.score
        -> from dictionary as d, occurences as o
        -> where d.word_id = o.word_id
        -> and o.linkoruser = 725
        -> limit 20;
    +----------+----------+----------+-----------------------+-------+
    | word     | pscore   | word_idf | o.pscore * d.word_idf | score |
    +----------+----------+----------+-----------------------+-------+
    | new      | 0.333333 |     4376 |     1458.666710138321 |     1 |
    | what     | 0.333333 |     1619 |     539.6666827499866 |     1 |
    | usernam  | 0.333333 |      156 |    52.000001549720764 |     1 |
    | password | 0.666667 |      436 |     290.6666753292084 |     2 |
    | member   | 0.333333 |     1287 |     429.0000127851963 |     1 |
    | happen   | 0.333333 |      766 |    255.33334094285965 |     1 |
    | updat    | 0.333333 |     2353 |     784.3333567082882 |     1 |
    | world    | 0.333333 |     2447 |     815.6666909754276 |     1 |
    | expert   | 0.333333 |      268 |     89.33333599567413 |     1 |
    | favorit  | 0.333333 |      618 |     206.0000061392784 |     1 |
    | friend   | 0.333333 |     2087 |     695.6666873991489 |     1 |
    | twitter  | 0.666667 |     2242 |    1494.6667112112045 |     1 |
    | twitter  | 0.666667 |     2242 |    1494.6667112112045 |     1 |
    | twitter  |        1 |     2242 |                  2242 |     3 |
    | email    | 0.333333 |     2264 |      754.666689157486 |     1 |
    | celebr   | 0.333333 |     1308 |    436.00001299381256 |     1 |
    | sign     | 0.333333 |     2908 |     969.3333622217178 |     1 |
    | sign     | 0.666667 |     2908 |    1938.6667244434357 |     1 |
    | sign     |        1 |     2908 |                  2908 |     3 |
    | forgot   | 0.333333 |      166 |     55.33333498239517 |     1 |
    +----------+----------+----------+-----------------------+-------+

    The PSCORE seems to be a third, two thirds or one. Perhaps there's something there?


    Also I think I'm missing out on some of your calculations.
    Your squares table says:

    SELECT links.idlinks, SUM( POWER( (occurences.pscore * dictionary.word_idf), 2 ) ) AS ss
    FROM links, occurences, dictionary
    WHERE links.idlinks = occurences.linkoruser
    AND occurences.word_id = dictionary.word_id;

    Which confuses me (maybe because I'm bad at maths!) as it doesn't seem to calculate similarity. Maybe I'm not seeing all of your calculations, or maybe I'm just being dumb.
    To calculate similarity you create a list of all words that appear in either the base document or the document you are comparing it with. We will call these DOC1 and DOC2 respectively.

    For each DOC/word pairing you take its true score, ie doing ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
    If the word does not appear in one of the documents then it's score is zero.

    Let's assume the documents have a total of 6 words appearing in both. Obviously we'll be seeing a score of 1.0, but this is more about explaining the sum than the result.

    We'll represent these doc/word pairing scores as D1W1, D2W1, D1W2, D2W2, D1W3, D2W3, D1W4, D2W4, D1W5, D2W5, D1W6, D2W6

    Now the top part (T) of the similarity calculation is:
    T = (D1W1 * D2W1) + (D1W2 * D2W2) + (D1W3 * D2W3) + (D1W4 * D2W4) + (D1W5 * D2W5) + (D1W6 * D2W6)

    The bottom left (BL) is:
    BL = SquareRoot((D1W1*D1W1) + (D1W2*D1W2) + (D1W3*D1W3) + (D1W4*D1W4) + (D1W5*D1W5) + (D1W6*D1W6))

    The bottom right (BR) is:
    BR = SquareRoot((D2W1*D2W1) + (D2W2*D2W2) + (D2W3*D2W3) + (D2W4*D2W4) + (D2W5*D2W5) + (D2W6*D2W6))

    Making the bottom (B)
    B = BL * BR

    And the full similarity:
    T over B

    As seen in http://i.imgur.com/n3Bp8.jpg

  • #10
    New Coder
    Join Date
    Jan 2012
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Ooh, I just noticed a mistake I made in my description which explains a lot.
    For each word's Inverse Document Frequency it requires a calculation:

    SELECT COUNT(idlinks) FROM links
    _____________________________ = Inverse Document Frequency
    dictionary.word_idf

    I can see this would greatly improve things.

  • #11
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    AHA! That last message indeed helps explain a lot, I think.

    Regarding tag weight being changeable: Unless this happens freuently, it's still better to have the pscore precalculated. Just rerun the query that creates the pscore whenever the tag weights are changed.

    The squares table is *ONLY* finding the values of ||A|| (and ||B||) as shown in your post. That is, *ONLY* the sum-of-the-squares for one document.

    Those sums-of-squares are also "fixed" for any given table until such time as a tag weight changes or word_idf changes.

    Whether they get recalculated each time or you leave them in place until a change in tag weight or addition of another document occurs is up to you.

    It is only the product (the dividend of the final forumula) that depends on WHICH idlinks is chose as the base. So you can do all the other calculations once and then you only have to calculate the product once for each base idlinks. Again, supposing that tag weights and word_idf don't change.

    Here's the sad part: We had a power outage just after I did all of the above last night and I hadn't saved by final work. So I need to reconstruct it.

    Oh...and one major goof on my part: I forgot to do the square roots of the sums! *THAT* would make a *HUGE* difference.

    So...more later, but I have to get some work done right now.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #12
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    So this formula is actually wrong:
    ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf

    Yes?

    It should, instead, be
    ( ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf ) / COUNT(idlinks)

    Right?
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #13
    New Coder
    Join Date
    Jan 2012
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Old Pedant View Post
    So this formula is actually wrong:
    ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf

    Yes?

    It should, instead, be
    ( ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf ) / COUNT(idlinks)

    Right?
    Not quite, score is:

    ((occurences.score * tags.weight) / links.max_tf) * (COUNT(idlinks) / dictionary.word_idf)


    If you precalculate the score, and do a mass-recalculation upon tag weight change, it could also be an option to have the IDF precalculated, and have an option for the administrator to 'rebuild' the scores (and obviously do an automatic rebuild whenever tag weights are changed).

    There has to be a fair pile of calculation done at the time of request though, because there will be a considerable amount of different similarity comparisons. If comparing everything to 1 item requires a table of all things, then comparing everything to everything requires a table the size of links SQUARED. Clearly impractical.

  • #14
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    There has to be a fair pile of calculation done at the time of request though, because there will be a considerable amount of different similarity comparisons. If comparing everything to 1 item requires a table of all things, then comparing everything to everything requires a table the size of links SQUARED. Clearly impractical.
    Yes, but as I said, it is only the PRODUCTS part of the equations (the dividend, the product A.B in your diagram) that changes depending on which base idlinks is used.

    And yesterday I was able to calculate all 5106 of those products in around 30 seconds (I didn't time it, but it certainly wasn't a minute). And that's on my 4 year old machine. (Well, it is dual processor, but one of the early cheap ones...only paid $400 for it 4 years ago.)
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #15
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,596
    Thanks
    80
    Thanked 4,633 Times in 4,595 Posts
    The calculation of the final similarity scores then becomes trivial, taking way less than a second.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.


  •  
    Page 1 of 3 123 LastLast

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •