Hello and welcome to our community! Is this your first visit?
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 3 of 3
  1. #1
    New to the CF scene
    Join Date
    Jun 2009
    Thanked 0 Times in 0 Posts

    pdf to xml or to string in c#

    I need to extract data from pdf files. I'm using .NET
    I've been pouring over the web to find a way to do this. This is a case where the web is working against me. Putting data into pdf is easy and there's about a gazillion people posting how to do that. That makes it really hard to find how to do the opposite - get data out of pdf.
    Ideally, I'd like to convert pdf into xml. Failing that, I'd like to read the text out of it into a string or stream.
    I'd love to do it without using a COM component or some buggy open source product (I'm not anti-open source, but we all know there's a lot of half-baked open source software out there).
    Is it possible?

  • #2
    teh Moderatorinator
    Join Date
    Sep 2004
    Thanked 40 Times in 40 Posts
    Your best bet is to find something that can spit it out into some type of format for you, and you can work from there to decipher it. I did a quick google on "c# parse pdf" and found a few examples:

    Looks like it uses some type of library to get it into text format.

  • #3
    Regular Coder
    Join Date
    Apr 2009
    Thanked 20 Times in 20 Posts
    Hey bnewman,

    I hear you on the open source stuff, as you are risking more chances of bugs, however in this case, I do believe that's the way to go. Look into the following open source components:

    activePDF Server
    PDFlib + PDI
    TallPDF.NET 3.0

    Now, some of these you actually have to pay for, but I think if you just use one of the free components (iTextSharp is free I think), you should be fine. Just do some good testing, that's all.



    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts