|
An algorithm for index translation
-
-
-
Xyro


- Joined on 06-10-2007
- Location location = new EnterpriseLocatorFactory() .newLocator(Locale.US) .getLocationFacade(this, true) .getLocation();
- Posts 1,715
|
Re: An algorithm for index translation
Why not just keep all the useful text in the same place in the document and wipe out the HTML with spaces? That will keep all the indexes the same but still remove the HTML. The output file may be sparce and ugly, but it should ultimately matter. For example, instead of "<tag>important words</tag>", it would become "[sp][sp][sp][sp][sp]important words[sp][sp][sp][sp][sp][sp]" (where [sp] is a space character).
#!/usr/bin/perl
($_,@x,@y,$r,$o)=( "Xyro's signature!", 0xe4e10e9d0a4803ea, 0x92b0b7684cda510e, 0x9d2a504b06a54c04, 0xb04804bca984dc0c, 0xea4889b4dadb1108, 0x90534665a4c79811) ;%_=map{$_,1}split//;%_=map{$o++,$_}sort keys%_;@x=map{sprintf'%x',$_}@x;for(@x ){$r.=$_{hex$_}for(split//)}$r=~s$Xa$xa$;$_=$r;s;(^|! )(\w+);$1\u\L$2;g;print#!
|
|
-
-
alois.cochard


- Joined on 06-30-2010
- Posts 3
|
Re: An algorithm for index translation
It's what I made first !
But the fact is that html tags consume a lot of data, you can approach a 1:2 ratio in term of size.
The text-mining service performance slow down dramatically if the size of the input data is too large.
I don't why... analyzing white space must not be that complicated ;)
So it's work, but the performance aren't acceptable since this service is used 'on-the-fly' by users and not by batch processing.
Anyway, thanks to have took time to find a solution !
Cheers,
Alois Cochard
http://aloiscochard.blogspot.com
http://www.twitter.com/aloiscochard
|
|
-
-
Xyro


- Joined on 06-10-2007
- Location location = new EnterpriseLocatorFactory() .newLocator(Locale.US) .getLocationFacade(this, true) .getLocation();
- Posts 1,715
|
Re: An algorithm for index translation
This is a very questionable mining service that slows down on whitespace and chokes on HTML. Not to change the scope of the project, but I'm curious, are you able to change how the mining works? Or replace it altogether? How does it function? (or supposed to function?) Is it a third-party package, and if so, do you mind sharing the name? It sounds pretty awful. Why does it die on HTML anyway? If it can handle plaintext, why can't it handle plaintext with lots of < and > characters? Is it just too many repeating words?
Also, does it mine for individual keywords, or whole phrases, or what? There might be ways you can cheat and hand it a list of pre-processed mining results for it to work with, then when the service says that a certain document has the matching words, you can then re-index (i.e., search) the original document with custom code rather than looking up the index keys. I dunno. Searching text isn't exactly tricky. But then, neither is skipping over whitespace or not dying on HTML.
It's just that a large translation mappings between input and output fields seems so klugey... There's got to be a better way.
#!/usr/bin/perl
($_,@x,@y,$r,$o)=( "Xyro's signature!", 0xe4e10e9d0a4803ea, 0x92b0b7684cda510e, 0x9d2a504b06a54c04, 0xb04804bca984dc0c, 0xea4889b4dadb1108, 0x90534665a4c79811) ;%_=map{$_,1}split//;%_=map{$o++,$_}sort keys%_;@x=map{sprintf'%x',$_}@x;for(@x ){$r.=$_{hex$_}for(split//)}$r=~s$Xa$xa$;$_=$r;s;(^|! )(\w+);$1\u\L$2;g;print#!
|
|
-
-
alois.cochard


- Joined on 06-30-2010
- Posts 3
|
Re: An algorithm for index translation
Yeah very questionable, but the product was not chosen by I, and I have not ANY control on that commercial 'black-box' style product.
Unfortunately I can't tell you the name of the product due to some really 'hardcore' policy here in the company I working now...
We can replace with an other text-mining product and we have do it. But the other product performance aren't the same...
The final solution will be to implement the text-mining algorithm ourself, but for now we still use external tools due to lack of time to create our own implementation.
To respond to your question, it mine for whole text, he recognize paragraph and the mining isn't same if you split word/phrase because of the loose of the 'context'.
For sure, there's got to be a better way, that's what motivated me posting this online ;)
Thanks a lot for your help, even if except creating my own impl. I don't see any other solution for now :(
FYI, the product is made with a mix between C and PERL script ... and was created in a 'scientific' way more than 'entreprise' way... if you see what I mean ?
Alois Cochard
http://aloiscochard.blogspot.com
http://www.twitter.com/aloiscochard
|
|
Page 1 of 1 (5 items)
|
|
|