Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach

Source code and data for Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach

ReHession conducts Relation Extraction with Heterogeneous Supervision, e.g., the labeling functions at left corner.

ReHession conducts Relation Extraction, featuring:

employ heterogeneous supervision, e.g., knowledge base and heuristic patterns, to train the model (as picked in left corner)
infers true label from noisy labels in a context-aware manner
true label discovery and relation extraction can mutually enhance each other

This ReadMe are in an early-release beta. Expect some adventures...

Overview
Data
Labeling Functions
Feature Extraction
Model Learning and Evaluation
- encoding
- compile
- execute
Reference

Pipeline Overview

And the pipeline of ReHession is:

recognize entities for labeling functions
apply labeling functions to get heterogeneous supervision
generate pos-tagging and brown clustering
encoding training / testing corpus
training and evaluation

Data

We include corpus, labeling functions and knowledge base in the Data folder.

Corpus

We stored KBP[1] corpus under the path Data/source/KBP/corpus.txt.zip, and NYT[2] corpus under the folder Data/source/NYT/corpus.txt.zip. Pos-tagging and entity detection has been conducted by Stanford NER tools.

Patterns

We stored the pattern-based labeling functions in the folder of the corpus, named nlf.json. These files stored one pattern-based labeling function at one line in the json format.

Knowledge base

We stored the annotations generated by KB-based labeling functions in the folder of the corpus, named train.json. It's also used as the training file adopted by CoType

Labeling Functions

We adopted three kinds of labeling functions in ReHession: KB-based, pattern-based and inversed. And the annotations generated by those labeling functions are save in the path Data/intermediate/.

KB based

KB based labeling functions are adopted to encode information of KB. Accordingly, we adopted the training file generated by distant supervision, by treating annotations of the same relation type are generated by the same labeling function (in the form of if r(e1, e2) in KB: return r).

Pattern based

Pattern-based labeling functions would annotate entity pairs with matched entity type and texture pattern with preset relation types. And each pattern-based labeling functions (as stored in nlf.json) has the following fields:

reserved: whether entity 1 is before entity 2
PID: pattern id
rule: the rule of matching multiple entities
Texture: texture pattern
relationType: the detected type of this labeling function
Type1: the type of entity 1
Type2: the type of entity 2

The rule field has value in the format of [a,b], while a and b can be any number or n. It indicates how many entities would the labeling function try to match (n indicates matching all entities). For example, {"reserved": "1", "PID": 0, "rule": ["1", "1"], "Texture": "founder of", "relationType": "/business/company/founders", "Type1": "ORGANIZATION", "Type2": "PERSON"} requires the entity before texture pattern (indicated by reserved) to be Person (indicated by Type2), and entity after texture pattern to be ORGANIZATION (indicated by Type1). Also this labeling function would only annotate the entity pair most close to the texture pattern (indicated by ["1", "1"])

Inverse

In order to annotate None type, we designed another type of labeling function, i.e., if a set of labeling functions not annotate a instance, it would annotate it as None.

Specifically, for KBP dataset, we adopted a reverse labeling function who reserved all pattern-based labeling functions; for NYT dataset, a reverse labeling function reserving all kb-based labeling functions is adopted.

KBP Dataset

We now proceed to use KBP dataset as an example (also save as labelGeneration.sh) to demonstrate the pipeline of generating heterogeneous supervision.

python LabelGeneration/UIDExtractKB.py
python LabelGeneration/post_process_chunked_corpus.py
python LabelGeneration/applying_labelling_func.py --save_all
python LabelGeneration/applying_KB.py
python LabelGeneration/reCodeFuncs.py
python LabelGeneration/MergeLFS.py
python LabelGeneration/cal_Mention_Distance.py
python LabelGeneration/applyingInverse.py

These commands requires original data to be stored in /Data/source/KBP/, while the specific requirements are stored in the default setting.

Feature Extraction

With dataset with annotation and brown clustering file stored in the path /Data/intermediate/KBP/, the feature extraction can be performed by

python DataProcessor/relation_feature_generation.py

Model Learning and Evaluation

Encoding

The model is designed to run on encoded training / evaluation / testing corpus. Each line in the encoded file is an instance, which is in the format of

InsID	FeatureNum	AnnotationNum	FeatureList	AnnotationList

InsID, FeatureNum and AnnotationNum are integers; FeatureList is in the format of featureId,featureId; AnnotationList is in the format of lfId typeId. For example, an instance in KBP dataset is like:

10      78      2       457984,153472,646018,120323,1119382,739279,152889,199945,1146378,1077643,1138574,501136,1091345,55555,65558,1125465,131097,947866,1128477,869918,485663,289,201307,1066993,359723,681233,767278,1053381,1019443,742453,324534,570977,244,1059642,434747,267324,99135,376075,815283,84805,382022,1119585,1107934,43594,43595,809036,676557,749903,485662,78163,297812,43278,95417,1092708,673498,538331,218205,977502,189791,991328,380641,354658,1030244,1173222,492263,209340,520684,379757,335215,409201,786164,835966,21497,16250,872059,562651,137214,498815        6,10,6,9

Compile

Run make under the folder of Model would compile the model

Execute

The Execute commands for KBP dataset and NYT dataset are (which is also saved in train.sh):

./Model/ReHession -train ./Data/intermediate/KBP/train.data -test ./Data/intermediate/KBP/test.data -none_idx 6 -instances 225977 -test_instances 2111
./Model/ReHession -train ./Data/intermediate/NYT/train.data -test ./Data/intermediate/NYT/test.data -none_idx 0 -instances 530767 -test_instances 3803

Reference

Please cite the following paper if you find the codes and datasets useful:

@inproceedings{Liu2017rehession,
 title={Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach},
 author={Liu, Liyuan and Ren, Xiang and Zhu, Qi and Zhi, Shi and Gui, Huan and Ji, Heng and Han, Jiawei},
 booktitle={Proc. EMNLP},
 year={2017}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach

Pipeline Overview

Data

Corpus

Patterns

Knowledge base

Labeling Functions

KB based

Pattern based

Inverse

KBP Dataset

Feature Extraction

Model Learning and Evaluation

Encoding

Compile

Execute

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Data		Data
DataProcessor		DataProcessor
LabelGeneration		LabelGeneration
Model		Model
docs		docs
LICENSE		LICENSE
README.md		README.md
labelGeneration.sh		labelGeneration.sh
train.sh		train.sh

License

aurorazhis/ReHession

Folders and files

Latest commit

History

Repository files navigation

Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach

Pipeline Overview

Data

Corpus

Patterns

Knowledge base

Labeling Functions

KB based

Pattern based

Inverse

KBP Dataset

Feature Extraction

Model Learning and Evaluation

Encoding

Compile

Execute

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages