The Daily WTF: Curious Perversions in Information Technology
Welcome to TDWTF Forums Sign in | Join | Help
in Search

An algorithm for index translation

Last post 07-05-2010 11:35 AM by alois.cochard. 4 replies.
Page 1 of 1 (5 items)
Sort Posts: Previous Next
  • 06-30-2010 1:57 PM

    An algorithm for index translation

    Here is a little code challenge ! I'm actually working on a text-mining/semantic web application focused (for the moment) on biomedical informations and developed in Java. We are using external tools for text-mining analysis and unfortunatly theses tools don't handle HTML pretty well ... If we send raw HTML to the text-mining service, he simply break. So we must convert HTML to plain-text before processing text, and because the tools return identified words by giving their positions, we must translate theses position (or indexes) to find corresponding word in the original HTML. I created a simply implementation and posted it on gist.github.com ... I'm sure you can make it better ;) ! Here is the full blog entry : http://aloiscochard.blogspot.com/2010/06/bring-your-code-algorithm-for-index.html Thanks Alois Cochard http://aloiscochard.blogspot.com http://www.twitter.com/aloiscochard
  • 06-30-2010 2:19 PM In reply to

    • Xyro
    • Top 25 Contributor
    • Joined on 06-10-2007
    • Location location = new EnterpriseLocatorFactory() .newLocator(Locale.US) .getLocationFacade(this, true) .getLocation();
    • Posts 1,715

    Re: An algorithm for index translation

    Why not just keep all the useful text in the same place in the document and wipe out the HTML with spaces?  That will keep all the indexes the same but still remove the HTML.  The output file may be sparce and ugly, but it should ultimately matter.

    For example, instead of "<tag>important words</tag>", it would become "[sp][sp][sp][sp][sp]important words[sp][sp][sp][sp][sp][sp]"  (where [sp] is a space character).

    #!/usr/bin/perl

    ($_,@x,@y,$r,$o)=( "Xyro's signature!", 0xe4e10e9d0a4803ea, 0x92b0b7684cda510e,
    0x9d2a504b06a54c04, 0xb04804bca984dc0c, 0xea4889b4dadb1108, 0x90534665a4c79811)
    ;%_=map{$_,1}split//;%_=map{$o++,$_}sort keys%_;@x=map{sprintf'%x',$_}@x;for(@x
    ){$r.=$_{hex$_}for(split//)}$r=~s$Xa$xa$;$_=$r;s;(^|! )(\w+);$1\u\L$2;g;print#!
  • 07-01-2010 2:35 AM In reply to

    Re: An algorithm for index translation

    It's what I made first !

    But the fact is that html tags consume a lot of data, you can approach a 1:2 ratio in term of size.

    The text-mining service performance slow down dramatically if the size of the input data is too large.
    I don't why... analyzing white space must not be that complicated ;)

    So it's work, but the performance aren't acceptable since this service is used 'on-the-fly' by users and not by batch processing.

    Anyway, thanks to have took time to find a solution !

    Cheers,

    Alois Cochard
    http://aloiscochard.blogspot.com
    http://www.twitter.com/aloiscochard
  • 07-01-2010 7:12 AM In reply to

    • Xyro
    • Top 25 Contributor
    • Joined on 06-10-2007
    • Location location = new EnterpriseLocatorFactory() .newLocator(Locale.US) .getLocationFacade(this, true) .getLocation();
    • Posts 1,715

    Re: An algorithm for index translation

    This is a very questionable mining service that slows down on whitespace and chokes on HTML.  Not to change the scope of the project, but I'm curious, are you able to change how the mining works?  Or replace it altogether?  How does it function?  (or supposed to function?)  Is it a third-party package, and if so, do you mind sharing the name?   It sounds pretty awful.

    Why does it die on HTML anyway?  If it can handle plaintext, why can't it handle plaintext with lots of < and > characters?  Is it just too many repeating words?

    Also, does it mine for individual keywords, or whole phrases, or what?  There might be ways you can cheat and hand it a list of pre-processed mining results for it to work with, then when the service says that a certain document has the matching words, you can then re-index (i.e., search) the original document with custom code rather than looking up the index keys.  I dunno.  Searching text isn't exactly tricky.  But then, neither is skipping over whitespace or not dying on HTML.

    It's just that a large translation mappings between input and output fields seems so klugey...   There's got to be a better way.

    #!/usr/bin/perl

    ($_,@x,@y,$r,$o)=( "Xyro's signature!", 0xe4e10e9d0a4803ea, 0x92b0b7684cda510e,
    0x9d2a504b06a54c04, 0xb04804bca984dc0c, 0xea4889b4dadb1108, 0x90534665a4c79811)
    ;%_=map{$_,1}split//;%_=map{$o++,$_}sort keys%_;@x=map{sprintf'%x',$_}@x;for(@x
    ){$r.=$_{hex$_}for(split//)}$r=~s$Xa$xa$;$_=$r;s;(^|! )(\w+);$1\u\L$2;g;print#!
  • 07-05-2010 11:35 AM In reply to

    Re: An algorithm for index translation

    Yeah very questionable, but the product was not chosen by I, and I have not ANY control on that commercial 'black-box' style product.
    Unfortunately I can't tell you the name of the product due to some really 'hardcore' policy here in the company I working now...

    We can replace with an other text-mining product and we have do it. But the other product performance aren't the same...

    The final solution will be to implement the text-mining algorithm ourself, but for now we still use external tools due to lack of time to create our own implementation.

    To respond to your question, it mine for whole text, he recognize paragraph and the mining isn't same if you split word/phrase because of the loose of the 'context'.
    For sure, there's got to be a better way, that's what motivated me posting this online ;)
    Thanks a lot for your help, even if except creating my own impl. I don't see any other solution for now :(
    FYI, the product is made with a mix between C and PERL script ... and was created in a 'scientific' way more than 'entreprise' way... if you see what I mean ?

    Alois Cochard
    http://aloiscochard.blogspot.com
    http://www.twitter.com/aloiscochard
Page 1 of 1 (5 items)
Powered by Community Server (Non-Commercial Edition), by Telligent Systems