Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 4 of 4
  1. #1
    New Coder
    Join Date
    Dec 2008
    Posts
    71
    Thanks
    8
    Thanked 0 Times in 0 Posts

    A complicated search

    Hi,

    What is the fatest way for searching for a database with million of records?

    I have two goals:


    Code:
    rs.open "select count(*) as keywords.word from keywords WHERE keywords.Word IN (select economics.bodytext from economics where economics.indexid = keywords.indexid) and  keywords.Word IN ('" & StrKeyword & "'))  ", conn

    1. Not a list of keywords found at economics.bodytext, but the keyword found together their frequency.

    2. The images of the database





    3. The url que inspirated me to do that program:

    http://www.corpusdelespanol.org/

    It is impressive the speed of their search!


    regards

    BOB

  • #2
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,554
    Thanks
    80
    Thanked 4,620 Times in 4,583 Posts
    I can tell you how a company named Dialog did this, back in the late 1980s, early 1990s, when computers were many times slower than they are now.

    They created a custom database engine. With a specially designed way of storing the data.

    You won't do this with a standard relational database, especially things such as "find these two words within 7 words of each other."

    Oh, you might base it on a relational database, but you will have first pre-parsed all the text into a very specialized format. (For example, you might store a words such as "scaring" as "scare" root word and "scaring" as the variant. Which means you need a dictionary that gives you the root word of non-standard variants. "goose" as the root word of "geese", etc.)

    You will certainly never do it using the IN ( ... ) operator or even using LIKE '%goose%'.

    Again, you break each document into its component words, with each word cross indexed to its root word and with its word number in the document (so you can see if two words indeed have word numbers within 7 [or whatever] of each other).

    And and and and...

    In short, you break the problem down to its root components and then build back up to where you can construct a needed answer to a request.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #3
    New Coder
    Join Date
    Dec 2008
    Posts
    71
    Thanks
    8
    Thanked 0 Times in 0 Posts
    Very interesting answer.

    Find data in database requires special techniques indeed. I was thinking in the logic of that system.

    I am surprised with your advice regarding the SQl wildcards "IN" and "Like".

    Thank you for all

    Bob

  • #4
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    27,554
    Thanks
    80
    Thanked 4,620 Times in 4,583 Posts
    Just for starters, using LIKE will only tell you if *TEXT* exists that exactly matches what you are searching for. So you might get false positives (e.g,, you are looking for "SPREAD" and LIKE finds WIDESPREAD and BEDSPREAD and...) and you won't know how many times the word appears in the text.

    And quite frankly I can't think of any reasonable way to use IN(...), at all. Certainly that query you postulated in the first post makes no sense. Look at the essence of it:
    Code:
    ... keywords.Word IN ( economics.bodytext ) ...
    That makes no sense at all. keywords.Word would presumably be one word (at a time) and economics.bodytext would be a long paragraph. IN would never match anything at all. That's not how IN works.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •