Skip to content
/ WS4J Public

WordNet Similarity for Java provides an API for several Semantic Relatedness/Similarity algorithms

License

Notifications You must be signed in to change notification settings

dmeoli/WS4J

Folders and files

NameName
Last commit message
Last commit date

Latest commit

39ff2a7 · Feb 28, 2024

History

61 Commits
Jul 29, 2021
Oct 13, 2018
Jul 29, 2021
Jun 24, 2018
Feb 28, 2024
Jun 14, 2023

Repository files navigation

WordNet Similarity for Java Build Status Release

WS4J provides a pure Java API for several published semantic relatedness/similarity algorithms for, in theory, any WordNet instance. You can immediately use WS4J on Princeton's English WordNet 3.0 lexical database through MIT Java WordNet Interface 2.4.0, which is the fastest Java library for interfacing with WordNet.

The codebase is mostly a Java re-implementation of WordNet::Similarity written in Perl, using the same data files as seen in src/main/resources, with some test cases for verifying the same logic. WS4J is designed to be thread-safe.

Relatedness/Similarity Algorithms

The semantic relatedness/similarity metrics available are:

  • HSO: Hirst & St-Onge, 1998 - The Hirst & St-Onge measure is based on the idea that two lexicalized concepts are semantically close if their WordNet synsets are connected by a path that is not too long and that "does not change direction too often":

HSO(s1, s2) = const_C - path_length(s1, s2) - const_k * num_of_changes_of_directions(s1, s2);

  • LCH: Leacock & Chodorow, 1998 - The Leacock & Chodorow measure relies on the length of the shortest path between two synsets for their measure of similarity:

LCH(s1, s2) = -Math.log_e(LCS(s1, s2).length / (2 * max_depth(pos)));

  • LESK: Banerjee & Pedersen, 2002 - Lesk (1985) proposed that the relatedness of two words is proportional to the extent of overlaps in their dictionary definitions. This Lesk measure is based on adapted Lesk from Banerjee and Pedersen (2002) extended this notion to use WordNet as the dictionary for the word definitions:

LESK(s1, s2) = sum_{s1' in linked(s1), s2' in linked(s2)}(overlap(s1'.definition, s2'.definition));

  • WUP: Wu & Palmer, 1994 - The Wu & Palmer measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS:

WUP(s1, s2) = 2 * dLCS.depth / (min_{dlcs in dLCS}(s1.depth - dlcs.depth)) + min_{dlcs in dLCS}(s2.depth - dlcs.depth)), where dLCS(s1, s2) = argmax_{lcs in LCS(s1, s2)}(lcs.depth);

  • RES: Resnik, 1995 - Resnik defined the similarity between two synsets to be the information content of their lowest super-ordinate (most specific common subsumer):

RES(s1, s2) = IC(LCS(s1, s2));

  • PATH - The Path measure computes the semantic relatedness of word senses by counting the number of nodes along the shortest path between the senses in the 'is-a' hierarchies of WordNet:

PATH(s1, s2) = 1 / path_length(s1, s2);

  • JCN: Jiang & Conrath, 1997 - The Jiang & Conrath measure uses the notion of information content but in the form of the conditional probability of encountering an instance of a child synset given an instance of a parent synset:

JCN(s1, s2) = 1 / jcn_distance where jcn_distance(s1, s2) = IC(s1) + IC(s2) - 2 * IC(LCS(s1, s2)); when it's 0, jcn_distance(s1, s2) = -Math.log_e((freq(LCS(s1, s2).root) - 0.01) / freq(LCS(s1, s2).root)) so that we can have a non-zero distance which results in infinite similarity;

  • LIN: Lin, 1998 - The Lin measure idea is similar to JCN with a small modification:

LIN(s1, s2) = 2 * IC(LCS(s1, s2) / (IC(s1) + IC(s2)).

The descriptions above are extracted either from each paper or from WordNet-Similarity CPAN documentation.

Prerequisites

By default, the requirements for compilation are:

  • JDK 8+
  • Maven

Any WordNet instance can be used in WS4J if it implements the ILexicalDatabase interface.

Built with Maven

To create a jar file with dependencies including resource files:

$ mvn install assembly:single

Using WS4J

Then start playing with the facade WS4J API:

src/main/java/edu/uniba/di/lacam/kdde/ws4j/WS4J.java

and a simple demo class:

src/main/java/edu/uniba/di/lacam/kdde/ws4j/demo/SimilarityCalculationDemo.java

which can be run through jar-with-dependencies from the root folder by typing into the terminal:

$ java -jar target/ws4j-1.0.2-jar-with-dependencies.jar

When using WS4J jar package from other projects add the JitPack repository to your POM file:

<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>

and declare this GitHub repo as a dependency:

<dependencies>
    <dependency>
        <groupId>com.github.dmeoli</groupId>
        <artifactId>WS4J</artifactId>
        <version>x.y.z</version>
    </dependency>
</dependencies>

Running the tests

To run JUnit test cases:

$ mvn test

The expected results from the test cases are compatible with the original WordNet::Similarity.

Initial Work

The original author is Hideki Shima.

License License: GPL v3

This software is released under GNU GPL v3 License. See the LICENSE file for details.

About

WordNet Similarity for Java provides an API for several Semantic Relatedness/Similarity algorithms

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages