Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Page 1 of 2 12 LastLast
Results 1 to 15 of 21
  1. #1
    Senior Coder
    Join Date
    Jul 2005
    Location
    UK
    Posts
    1,051
    Thanks
    6
    Thanked 13 Times in 13 Posts

    PHP "Artificial Intelligence" - Site Search that learns??

    Over the weekend I've been working on a theoretical model for a PHP/MySQL driven site search consisting of multiple scripts and databases. As I developed the idea it became apparent that what I was imagining is a system that could "learn" about misspellings of the entries in my database and so, in theory, become more and more efficient as time goes on.

    My reason for working on this is because the users of one site I'm working on often have a poor grasp of English. In addition, spellings for many of the entries in the database for the site aren't standardised so there isn't really one version that is absolutely correct.

    While doing some research on advanced matching I became aware of the similar_text() function. It seems to offer something quite powerful but comes with stark warnings about how server intensive it is. Because my anticipated use of the search feature on the site is *huge* I wanted to minimise server load from each request as much as possible. One route for doing this seemed to be to make the search system learn about misspellings on the fly, so that as time goes on the need for similar_text() becomes less and less.

    That's enough waffle for now. I've mocked up a flow chart that shows the processes that the various scripts run through when a search is made. There's also a key describing the necessary databases. The reason I am posting this on CodingForums is because I know there are some great minds here and no doubt there will be things I've missed, things that can be made more efficient, maybe it's already been done, maybe some people want to collaborate in actually developing the scripts etc etc...

    http://www.adambunn.co.uk/Search%20Flow%20Chart.gif

    Note 1: The flow chart is set up with my specific needs in mind, where queries are made on one field which is just the title of the entry - essentially a dictionary like function. However I'm sure it could be adapted for a wider site search.

    Note 2: While creating this flow chart I was convinced that similar_text() was actually called similar_string() - so wherever I mention similar_string() assume I mean similar_text().

  • #2
    Senior Coder
    Join Date
    Aug 2003
    Location
    One step ahead of you.
    Posts
    2,815
    Thanks
    0
    Thanked 3 Times in 3 Posts
    You should use soundex()/metaphone() instead. They are lots faster and more useful as they return a value you can compare to. It won't be a "learning" searchengine though.
    I'm not sure if this was any help, but I hope it didn't make you stupider.

    Experience is something you get just after you really need it.
    PHP Installation Guide Feedback welcome.

  • #3
    New Coder
    Join Date
    Sep 2006
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    normally what I do, is I take the search keywords and then first run them through a spell check routine. If i get a match for all keywords then I just stem the words then perform the database search. If some words don't match, I get the closet (3, 5) suggestions and for the words that didn't match or return the single suggestion when the spell checker returns a single match (like litttle == little, or poeple = people) and then add them to the search request, but using relevance so the suggestions that don't have exact single suggested words follow the suggestions that have a exact single word match, then stem the search words and perform the search! After, I add each miss spelled word to the auto change word list, so next time the spell check is done, the auto change words list replaces any miss spelled word that has already be flagged in the auto change list, while spell checking is being done. This way the search system learns as more searches are done. For closed in applications like forum search system, I maintain a unique list of words for each user, this way the search engine learns how each user likes to search for things. There is a lot more included in a refined closed in search system, because you can add all kinds of neat stuff that makes the search system work for each user the way they want.

  • #4
    Senior Coder
    Join Date
    Jul 2005
    Location
    UK
    Posts
    1,051
    Thanks
    6
    Thanked 13 Times in 13 Posts
    That looks interesting... thanks.

    The way I'm reading it, it looks like I could replace the similar_text() part of the script with something built around metaphone(), but still carry on with the "learning" part of things as planned.

    Or would you anticipate that simply using metaphone() for all searches would actually be quicker and less server intensive than building a database with matched misspellings?

    My thinking is that by doing that, it means that PHP will only usually have to consult 1 or 2, or occassionally 3 fields before finding a match, without having to resort to using metaphone() at all.

    What are your thoughts?


    EDIT: In reply to printf.

    I need to take a little time to swallow what you just said... but from my first reading it looks like you more or less have the same system as the one I'm theorising, perhaps with the different parts executed in a slightly different order. Do you use similar_text or soundex/metaphone for the spellchecking?
    Last edited by Pennimus; 06-04-2007 at 03:28 PM.

  • #5
    New Coder
    Join Date
    Sep 2006
    Posts
    51
    Thanks
    0
    Thanked 0 Times in 0 Posts
    No I don't use any of the PHP linguistic functions because they don't exactly work the way most miss-spelled words are written. What happens is that most of the functions use typographical error reasoning, when phonetically similar reasoning is most times included in the mix.

    Look at this example, it's pretty bad, but it's found 2016 times in 214,324 searches.

    Code:
    avhe
    The (3) nearest matches are (ave, ache, av he), but if you include phonetically similar reasoning you get the exact match (have), another similar one...

    found 862 times in 103,873 searches.

    Code:
    aveh
    The (3) nearest matches are (ave, aves, aver), but if you include phonetically similar reasoning you get the exact match (have)

    These are just simple examples, but they can screw a search result up big time, because they will match what the user really doesn't want, or not match what the user does want. Undoubtedly you wouldn't be searching for (have), it's just an example, but less common words also cause these problems which ruins searches, that the search engine can fix, if good reasoning is being learned from each search. I track all search results (just miss spelled words), so I get idea of what miss spelled words appear and their frequency of appearance, so I know when they need to be added to auto change list so the searches return better matching relevance.

    When go back to the office on Wednesday I'll get you a copy of my search / spell check class they both are learning engines that work really well! I don't have an example search engine up at the moment so I can't show how it works, but you try the spell checker auto change feature. In spell checker environment, you use the auto change feature so the user doesn't have to waste time changing common mistakes that the spell checker can fix automatically because it has learned the mistake from seeing it many times before!

    // place this in the box...

    Code:
    audeince
    audiance
    availalbe
    aveh
    avhe
    awya
    aywa
    bakc
    balence
    ballance
    baout
    barin
    bcak
    beacuse
    becasue
    becaus
    lerans
    levle
    libary
    lible
    librery
    lief
    lieing
    liekd
    liesure
    lieutenent
    liev
    likly
    lisense
    litature
    literture
    littel
    litttle
    liuke
    tjhe
    tje
    http://ws.ya-right.com/spell.php

  • #6
    Senior Coder
    Join Date
    Jul 2005
    Location
    UK
    Posts
    1,051
    Thanks
    6
    Thanked 13 Times in 13 Posts
    Yes this is the gist of what I wanted to achieve, except without having to add the misspellings to the autochange field (or whatever you want to call it) manually - in my proposed system it would be handled automatically (or maybe you also do it automatically and it wasn't clear to me), incorporating user feedback from when you serve a "did you mean xyz?" result. This is important because it means the whole system scales very well.

    But it does mean I need to decide on a method for deciphering the misspellings in the first place - similar_text(), or metaphone(). Ultimately I suppose only testing can determine which one will be more accurate for my particular database.

    Anyway, I look forward to seeing your script.

  • #7
    Banned
    Join Date
    Apr 2007
    Posts
    428
    Thanks
    29
    Thanked 5 Times in 5 Posts
    If you need help with the idea of making a search engine there are few thoughts i had in mind, and i would like to join whatever community you got, or help you start a new one when it comes to search engines.

    Most people here who read about future of web probably realize that it's all gona be in databases. So primary thing when building a search engine is having a massive database with information.

    2nd and the most obvious thing is building a smart search engine that will correct misspelled words. Maybe it would also be good, to run similar engine when storing information into database, beacouse people who misspell are not only the people who search, but also people who write websites.

    3rd thing, and probably the only thing that would differ this "new" search engine from egzisting ones is the user approach, and website reviews and ratings. When you compare hugest search engines now, you can see that their logic is fast, but not that much correct.

    Most of the engines rate sites based on the number of visitors, and that can be manipulated with massive advertising and other ways of promoting websites who don't have the needed quality. New search engine should be smart, fast, and user friendly. It could never support that much searches as best engines today, but user side would give engine some benefites, couse users would personaly rate sites, and would work for the engine. Like one big happy family .

  • #8
    Senior Coder
    Join Date
    Jan 2007
    Posts
    1,648
    Thanks
    1
    Thanked 58 Times in 54 Posts
    Most of the engines rate sites based on the number of visitors
    I don't know of any big search engine that does this.

    What will be the next evolution in search engines has already been discovered. So there is little value in discussing what it is. The problem is getting there.

  • #9
    Banned
    Join Date
    Apr 2007
    Posts
    428
    Thanks
    29
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by aedrin View Post
    What will be the next evolution in search engines has already been discovered
    can you share that information, i tend to miss the obvius things ?

    do you mean it's going to be something like that opencyc thing?

  • #10
    Senior Coder
    Join Date
    Jan 2007
    Posts
    1,648
    Thanks
    1
    Thanked 58 Times in 54 Posts
    I can't think of the exact buzzword they use at the moment, but the basic thought is this.

    Currently, HTML has little meaning. Something can be marked as <p>, or <h1>, but you can assume nothing from this. It could be a paragraph, or a header, but it could also be used as many different things. It's the concept behind 'markup' versus 'meaning'. <b> denotes something is bold. <strong> denotes that something has a stronger meaning.

    If search engines could read this information correctly, search results would be a lot more valid. And this is what the next evolution would be. A search engine that understands the content of websites.

    Granted, search engines today try to implement this today, but until HTML gets updated it is still guess work. And even then, all existing websites would need to be updated.

  • #11
    Banned
    Join Date
    Apr 2007
    Posts
    428
    Thanks
    29
    Thanked 5 Times in 5 Posts
    The idea comes to my mind riiight, oh here it is. Search engine that reads CSS, and denotes css that has excessive design code. Going way off here..

    Why did people make it so complex, you have h1 for main heading and so forth till h6, p for paragraph, b(strong) etc.

    Hm, maybe it would be a really good idea to look into that CSS/TAG denote thingy

    eg. Once i searched something about aDSL - what's it about, and among searches found quite simple website one Finish professor, in which she wrote WOOW article that explains aDSL technology. If i found that maybe 10 years ago, i would be the main aDSL provider in my country today :no_appropriate_smiley:

    Design killed the web!

  • #12
    Senior Coder
    Join Date
    Jan 2007
    Posts
    1,648
    Thanks
    1
    Thanked 58 Times in 54 Posts
    Correct.

    While all these graphics can entertain you, the real purpose of the internet was to share information and make it available to everyone.

    Few people use it for this purpose nowadays, and won't even consider a clean website that shows information in its most efficient state. A state without any design elements (basic markup is required of course.)

  • #13
    UE Antagonizer Fumigator's Avatar
    Join Date
    Dec 2005
    Location
    Utah, USA, Northwestern hemisphere, Earth, Solar System, Milky Way Galaxy, Alpha Quadrant
    Posts
    7,691
    Thanks
    42
    Thanked 637 Times in 625 Posts
    The term is "semantic HTML" I believe... (google it)

  • #14
    Banned
    Join Date
    Apr 2007
    Posts
    428
    Thanks
    29
    Thanked 5 Times in 5 Posts
    Do you guyz think that with XML, and option to hide the source, and HTML, whole web is going to be fully commercial?
    I belive that internet has gone in bad way, from exchanging information, to selling information.

  • #15
    Senior Coder
    Join Date
    Jan 2007
    Posts
    1,648
    Thanks
    1
    Thanked 58 Times in 54 Posts
    Quote Originally Posted by Fumigator View Post
    The term is "semantic HTML" I believe... (google it)
    Ah, yeah.

    I'm not good with buzzwords.

    Do you guyz think that with XML, and option to hide the source, and HTML, whole web is going to be fully commercial?
    Such is the fate of most successful free services. Either the owner is unable to operate because of costs, or they are tempted by money. Either way, at one point or another the service will sell out and become part of a large corporation. It is slowed by the fact that the internet is technically not owned by anyone. However, it is 'controlled' by all ISPs.


  •  
    Page 1 of 2 12 LastLast

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •