Convert the Dataset into Unified OmniEvent Format
To simplify subsequent data loading and modeling, we provide pre-processing scripts for commonly-used Event Extraction
datasets. Users can download the dataset and convert it to the unified OmniEvent format by configuring the data path
defined in the run.sh
file under the
scripts/data_preprocessing folder with the same name as
the dataset.
Unified OmniEvent Format
A unified OmniEvent dataset is a JSON Line file with the extension .unified.jsonl (such
as, train.unified.jsonl
, valid.unified.jsonl
, and test.unified.jsonl
), which is a convenient format for
storing structured data that enables processing one record, in one line, at a time. Taking a record from TAC KBP 2016 as
an example, a piece of data in the unified OmniEvent format could be demonstrated as follows:
{
"id": "NYT_ENG_20130910.0002-6",
"text": "In 1997 , Chun was sentenced to life in prison and Roh to 17 years .",
"events": [{
"type": "sentence",
"triggers": [{
"id": "em-2342",
"trigger_word": "sentenced",
"position": [19, 28],
"arguments": [{
"role": "defendant",
"mentions": [{
"id": "m-291",
"mention": "Chun",
"position": [10, 14]}]}, ... ]}, ... ]} ... ],
"negative_triggers": [{
"id": 0,
"trigger_word": "In",
"position": [0, 2]}, ... ],
"entities": [{
"type": "PER",
"mentions": [{
"id": "m-291",
"mention": "Chun",
"position": [10, 14]}, ... ]}, ... ]}
Supported Datasets
The pre-processing scripts support almost all commonly-used Event Extraction datasets, so as to minimize the data conversion difficulties. Additional pre-processing scripts are still being developed, and you can submit datasets for which you wish us to complete in “Pull requests”. Currently, we have developed pre-processing scripts for the following datasets:
ACE2005: ACE2005-EN, ACE2005-DyGIE, ACE2005-OneIE, ACE2005-ZH
DuEE: DuEE1.0, DuEE-fin
ERE: LDC2015E29, LDC2015E68, LDC2015E78
FewFC
TAC KBP: TAC KBP 2014, TAC KBP 2015, TAC KBP 2016, TAC KBP 2017
LEVEN
MAVEN
Dataset Conversion
Step 1: Download the Dataset
The first step of data conversion is to download the proposed dataset from its corresponding website. For example, for the DuEE 1.0 dataset, it could be downloaded from here.
Step 2: Configure the Dataset Path
After downloading the dataset from the Internet, the run.sh
file under the folder with the same name as the dataset
should be configured. For example, for the DuEE 1.0 dataset, the run.sh
file under the path
scripts/data_preprocessing/duee
should be configured, in which the data_dir
path should be the same as the path of placing the downloaded dataset,
you can also modify the path of the processed dataset by configuring the save_dir
path:
python duee.py \
--data_dir ../../../data/original/DuEE1.0 \
--save_dir ../../../data/processed/DuEE1.0
Step 3: Execute the run.sh
File
After downloading the dataset and configuring the corresponding run.sh
file, finally, the dataset could finally be
converted to the unified OmniEvent format by executing the configured run.sh
file. For example, for the DuEE1.0
dataset, we could execute the run.sh
file as follows:
bash run.sh