Set- Expansion

                  Contributors : Sonam Gupta , Rajat Singh , Apaar Garg

The type of information present on the Internet is very very diverse and keeps growing everyday. Therefore, methods need to be developed which can interpret and use this information to produce something meaningful.

Set expansion is the task of expanding a set of “seed terms” into a more complete set having entities of the same type. It can also be seen as a task of finding all entities present in a set given only a few entities of the set. The purpose here is to propose a framework for performing the task of set expansion.

What's out there already

There has been a lot of research work being done in this field of a Named Entity Extraction. Following are some systems which perform set expansion:

Google Sets: It is one of the most famous and easy to use Set-expansion tools. It takes as input three or more related seed terms and outputs members of the set these terms are present in. It has been used extensively in other related research work.

Set Expander for Any Language(SEAL): This system takes advantage of the structuring inside a web page to find out the expanded set terms and has its own ranking algorithm. It can be used for any language unlike the Google sets, which can be used only for English queries.

KnowItAll System: This system uses textual patterns to extract patterns from the web and then ranks it in a bootstrapping manner using statistical information gathered from the search engine.

Brief Tour on working of model

The aim here is to find related words given a few words of the set as input. To perform this task, we have used the Set Expander for Any Language( SEAL) architecture to expand the seed set and then rank the entities in the expanded set using word2vec model and build our ranking algorithm on top of it. We are showing an example of how it accepts initial seed terms and outputs the expanded set of entity lists.

Example :
input : mango , apple , banana
number of results : 8

        output :

Going forward

The code is all available on Git hub under MIT License and anyone is welcome try it out (please prefer Ubuntu). And if you're feeling extra adventurous, we warmly welcome pull requests for new features or bug fixes. The code base is Java. The project is in fairly early stages and the code is not among the nicest we've produced, but we have tried our best to execute it in a proper way.


If anyone wants to explore more, here are some of the links which will help you gain more knowledge


Tags related to the project

Information Retrieval and Extraction Course, IIIT-H, Major Project, Set expansion, Apaar Garg, Rajat Singh, Sonam Gupta