Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 6 of 6
  1. #1
    Senior Coder Spudhead's Avatar
    Join Date
    Jun 2002
    Location
    London, UK
    Posts
    1,856
    Thanks
    8
    Thanked 110 Times in 109 Posts

    Character encoding, cleaning CMS input etc.

    I just don't understand this character encoding thing. Never have. ANSI, ASCII, UTF, Unicode... it may as well be Greek...

    So: my simple CMS lets admin users type content into a textarea. Before it goes into the database, I take care of any dodgy chars (ie: single quotes) with escape(). When it comes out, I replace "%0D%0A" with a couple of <br/> tags, then unescape() it and dump it all on the page.

    Generally, that's fine. However - one client uses a Mac to update her site. It's not causing me any problems as such, although the double quotes look a bit...odd. But she's saying that some chars are getting replaced with those "I don't know what char this is supposed to be" question mark symbols.

    To clarify (hopefully - I hope the forum software doesn't do exactly what I'm trying to and fixes the dodgy char):

    - Client pastes a “ into textarea.
    - I escape() it. Apparently <%=escape("“")%> returns %E2%u20AC%u0153.
    - <%=asc("“")%> returns 226
    - I try to fix it with output= replace(output,"“", "&ldquo;") - which does, it seems, nothing.

    So... can anyone explain to me, preferably in words of two syllables or less, what the nuts is going on and how to fix it? It is character encodings? Is it locale ID's? It is ANSI or Unicode? What is it?

    How the chuff do I find these things and replace then with something... standard?

  • #2
    Regular Coder
    Join Date
    Mar 2007
    Posts
    505
    Thanks
    1
    Thanked 19 Times in 19 Posts
    Welcome to MS Word as an HTML editor...

    MS Word and Mac Word use certain special characters (UNICODE) to produce the effect that you are experiencing.

    How do you fix it? Use UTF-8. All 32bit Windows servers use UTF-8 as their character encoding.

    Set that encoding schema on your form page.

    To see the characters that they are using, go to START > RUN > charmap (or, Start > All Programs > Accessories > Character Map)

    Font: Times New Roman

    The first character to look at is double quotes, first line, second character in.

    Now use the GO TO UNICODE Box: Type in 02DD, 201C, 201D, and 2033.

    This will show you all the different types of double quotes (although not all are named 'double quotes').
    To say my fate is not tied to your fate is like saying, 'Your end of the boat is sinking.' -- Hugh Downs
    Please, if you found my post helpful, pay it forward. Go and help someone else today.

  • #3
    Senior Coder Spudhead's Avatar
    Join Date
    Jun 2002
    Location
    London, UK
    Posts
    1,856
    Thanks
    8
    Thanked 110 Times in 109 Posts
    It's getting a little clearer, thanks

    So... what you're saying is that I need to take the user input and UTF-8 encode it?

    The web seems awash with UTF-8 encoding functions: here's some I found at CodeToad:

    Code:
    <%
    function DecodeUTF8(s)
      dim i
      dim c
      dim n
      i = 1
      do while i <= len(s)
        c = asc(mid(s,i,1))
        if c and &H80 then
          n = 1
          do while i + n < len(s)
            if (asc(mid(s,i+n,1)) and &HC0) <> &H80 then
              exit do
            end if
            n = n + 1
          loop
          if n = 2 and ((c and &HE0) = &HC0) then
            c = asc(mid(s,i+1,1)) + &H40 * (c and &H01)
          else
            c = 191 
          end if
          s = left(s,i-1) + chr(c) + mid(s,i+n)
        end if
        i = i + 1
      loop
      DecodeUTF8 = s 
    end function
    
    
    function EncodeUTF8(s)
      dim i
      dim c
      i = 1
      do while i <= len(s)
        c = asc(mid(s,i,1))
        if c >= &H80 then
          s = left(s,i-1) + chr(&HC2 + ((c and &H40) / &H40)) + chr(c and &HBF) + mid(s,i+1)
          i = i + 1
        end if
        i = i + 1
      loop
      EncodeUTF8 = s 
    end function
    %>
    That look about right to you? If so... integrating this into my current code would be something like:

    - take user input
    - UFT-8 encode
    - escape()
    - drop into database

    ... and exactly the same in reverse for displaying on a page?

    God knows why I've never come up against this one before...


    ps. Just to clarify, all pages (admin forms and front-end display) have the following:
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    Last edited by Spudhead; 08-20-2007 at 05:05 PM. Reason: clarifimification

  • #4
    Regular Coder
    Join Date
    Mar 2007
    Posts
    505
    Thanks
    1
    Thanked 19 Times in 19 Posts
    You have probably not come up on this before because TEXTAREAs are not the same as WSIWYG editors.

    If the client/user is using XML schemas at all, like in Office 2000 and above, WSIWYG Editors use said XML schemas and they can screw up your input. COPY AND PASTE is a blessing and a curse.

    XML schemas, unless specified otherwise, are UNICODE. Textareas use the server's encoding (i.e., UTF-8 or whatever you tell IIS to use).

    Happened to me the first time I created one, and I haven't looked back since.

    Your code looks right, but you may be able to use the IIS server variable of Server.HTMLEncode to do the work for you.

    You might want to try that, but I cannot guarantee that will work.
    To say my fate is not tied to your fate is like saying, 'Your end of the boat is sinking.' -- Hugh Downs
    Please, if you found my post helpful, pay it forward. Go and help someone else today.

  • #5
    Regular Coder
    Join Date
    Mar 2007
    Posts
    505
    Thanks
    1
    Thanked 19 Times in 19 Posts
    To say my fate is not tied to your fate is like saying, 'Your end of the boat is sinking.' -- Hugh Downs
    Please, if you found my post helpful, pay it forward. Go and help someone else today.

  • #6
    Senior Coder Spudhead's Avatar
    Join Date
    Jun 2002
    Location
    London, UK
    Posts
    1,856
    Thanks
    8
    Thanked 110 Times in 109 Posts
    Ok, thanks for the info. Will look into altering the Codepage. Have taken interim measure of emailing client with "stop pasting stuff out of Word, it's screwing everything up".



  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •