[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Auto-indexing
Hello
Tibor Simko <tibor.simko@xxxxxxx> wrote on 15.05.2007 18:01:03:
> Hello Benedikt:
>
> (Sorry for the late reply on this message. I think some parts were
> already addressed in your private correspondence with Jean-Yves.)
Yep that's right. He already described me some possible solutions for my
problems.
> On Tue, 01 May 2007, benedikt.koeppel@xxxxxxxxxxxxxx wrote:
>
> > is it possible to configure CDS Invenio to auto-index files,
> > e.g. every evening?
>
> > The new files are copied to directory A and the indexer should copy
> > the files to directory AA and index them. The files from directory A
> > should be in the category CatA. The same with files from directory
> > B. Copy to BB and index in Category CatB.
>
> > Another issue is, that the software which generates the files (from
OCR),
> > puts the metadata in the following format (into the file):
> > METAFIELD1#VALUE1
> > METAFIELD2#VALUE2
> > METAFIELDn#VALUEn
> > ...normal text/content...
>
> The best would be to write a script that transforms your input into
> the format:
>
> <record>
> <datafield tag="MF1" ind1=" " ind2=" ">
> <subfield code="a">VALUE1</subfield>
> </datafield>
> <datafield tag="MF2" ind1=" " ind2=" ">
> <subfield code="a">VALUE2</subfield>
> </datafield>
> [...]
> <datafield tag="FFT" ind1="" ind2="">
> <subfield code="a">file:///tmp/file1.txt</subfield>
> <datafield tag="980" ind1=" " ind2=" ">
> <subfield code="a">CatA</subfield>
> </datafield>
> </datafield>
> </record>
>
> and then submit this to bibupload who will then take care of
> downloading the file from the location specified in the FFT tag and of
> putting it into an appropriate place inside the Invenio file storage
> system.
I saw that BibConvert does something like that - would it be possible to
use BibConvert for my problem?
> > How difficult would it be to integrate a spell check in the indexer
> > (or is already one integrated)? I have OCR-documents where some
> > characters are some times misspelled; e.g. '...rn...' is in the OCR
> > as '...m...' and stuff like that.
>
> There is no integrated spelling corrector (yet). Three remarks:
>
> First, I think it would be better to run it before uploading, not only
> during indexing, so that other Invenio modules (displayer, keyworder)
> can take advantage of the correct spelling too.
Yes that's right. But with automatic spell checking and its (sometimes)
incorrect corrections, the document becomes more difficult to understand
as if the words are not corrected.
With the example "ornament": In a sentence with "ormament" instead of
"ornament", the reader still understands the meaning - but the same
sentence with "armament" is possibly not understandable any more ;-).
So the spell check should not correct the file itself, but only the words
which go into the index.
> Second, have you tried some existing OCR spelling software?
It's actually not planned to replace the existing OCR software. It would
be nice to integrate only a OCR spell checker between the OCR and Invenio.
> Third, such a spelling corrector would most probably necessitate human
> assistance, because automatic replacements can go wild. For a very
> simple example, let us assume that we scanned the word "ornament" that
> OCR recognized as "omament"; now if we check aspell's suggestions:
>
> $ echo omament | aspell -a
> & omament 12 0: armament, moment, ornament, immanent, momenta, [...]
>
> and simply take the first one, we will end up with "armament" instead
> of "ornament". ;-)
Yes of course, automatic spell checking is not perfect, but manual
corrections would take too much time.
I think the best solution would be:
1. The file remains with all spelling mistakes.
2. The indexed words are indexed with spelling mistakes _and_ with an
automatic correction of aspell for example. The corrected words could go
into a separate field, so that it is possible to choose not to search
within corrected words.
That's not 100% correct then, but I think the results could be better with
automatic spell check than without.
Probably, it would be useful to use two or three different spell checker
which work different. So I'd have three different strings like "armament,
moment, ornament, immanent, momenta" and could take only these words which
both/all stemmer return.
Do you know any other speller than aspell and ispell?
Best regards,
Benedikt Köppel
--
This communication is for use by the intended recipient and contains
information that may be privileged, confidential or copyrighted under
applicable law. If you are not the intended recipient, you are hereby
formally notified that any use, copying or distribution of this e-mail, in
whole or in part, is strictly prohibited. Please notify the sender by
return e-mail and delete this e-mail from your system. Unless explicitly
and conspicuously designated as "E-Contract Intended", this e-mail does
not constitute a contract offer, a contract amendment, or an acceptance of
a contract offer. This e-mail does not constitute a consent to the use of
sender's contact information for direct marketing purposes or for
transfers of data to third parties.
|