I have been tasked with extract Text out of PDF made out of images, that is, it is not possible to copy past or in general gather all of the text.
There are several tools that do this, and I am in the process of testing some of them. I’ve been told by the previous person in charge that they were using Tesseract, and reading documentation about it, I thought, ok this is a really complex matter, there has to be studies about this issue.
So I am reading stuff like Comparison of Text Extraction Techniques- A
In which I can see someone has done proper research on what works in which cases.
Now to my question. How do I go from this to, ok, I would like to test this out. Are this tests and performance ever released in program form? Is it something possible to track, which technique is a program using, to see if it’s the one that has been theoretically defined to my particular case?
Because I understand that it is not always going to be possible to do so, but I find 0 implementations, and I have to wonder, what good is it to develop this techniques and studies, if no one else can use them?
Is there other way to go arround this? Because I obviously do not have years of experience nor I am writing a paper/thesis on this issue, so even if I were to start from scratch, I doubt I’d get to, say, 98% accuracy, as stated in some cases. I need the best tool possible, and I am being told that there are, indeed, people that were able to do this.
So is it possible for me, at least in some cases, to recreate what they have done? And if so, how do I find it?