TechHui

Hawaiʻi's Technology Community

Is anyone aware of a Java open source library for tokenizing words in Japanese sentences? In other words, I need something that will do this:

Note: If you can't see Japanese below it means you don't have Japanese installed on your machine.

Scanner scanner = new Scanner("私はマーケットに行きました。");

for(Word word : scanner.words()) {
    System.out.println(word.toString());
}

Output:


マーケット

行きました

I could write it, but it would be a pain in the butt. There are no spaces and even changes in the writing system (kaji, hiragana, katakana and romaji) do not necessarily signal a word boundary. Because of this, I would have to make calls to a dictionary to determine where words end.

Views: 1717

Reply to This

Replies to This Discussion

It looks like the nice folks at Kyoto University have come to our rescue. I'd prefer a simpler pure Java solution, but this library has Java bindings and a feature rich, high performance morphological engine.
Beat me to it, Dan...

I used mecab extensively in my last gig back in Sapporo, where I was parsing spoken Japanese in various NHK news broadcasts, and then tokenizing the words to build up TF-IDF data for information retrieval. Very powerful. Very fast. The lack of good Java bindings didn't deter us at the time, given the speed and accuracy of this library. Now, I reckon that it would be a cinch to integrate it with Java by using jython or jruby.

And I like the name, too. めかぶ is very tasty topping off tofu, with just a bit of ponzu to dress it.

Daniel Leuck said:
It looks like the nice folks at Kyoto University have come to our rescue. I'd prefer a simpler pure Java solution, but this library has Java bindings and a feature rich, high performance morphological engine.
The package above is a great solution. I've mostly used the Lingua::JA Perl packages (though I know that doesn't help for Java).
Hi Daniel,

It looks like there is a Java port of MeCab named Sen here:
https://sen.dev.java.net/
http://www.nilab.info/wiki/Sen.html

Hope this helps,

Daniel Leuck said:
It looks like the nice folks at Kyoto University have come to our rescue. I'd prefer a simpler pure Java solution, but this library has Java bindings and a feature rich, high performance morphological engine.
Thank you Makoto! Sen is exactly what we need. I didn't think there was a robust Japanese morphological analyzer written in Java. Great find!

Makoto Ishida said:
Hi Daniel,

It looks like there is a Java port of MeCab named Sen here:
https://sen.dev.java.net/
http://www.nilab.info/wiki/Sen.html

Hope this helps,

Your are welcome!
I've never used this kind of tools, but it seems pretty interesting.
This is cute!

Sen is pronounced "chi hi ro". You must call "chi hi ro" even if you hava a important meeting with exective.

Reply to Discussion

RSS

Sponsors

web design, web development, localization

© 2024   Created by Daniel Leuck.   Powered by

Badges  |  Report an Issue  |  Terms of Service