The converjdm.sh script converts the JDM dump to three CSV files: relation_types.csv; relations.csv; nodes.csv; in order to allow loading the data into an SQL database with a simple relational schema.
The current version of the scrip should work on the gnu versions of sed
and grep
. If you are using non gnu versions, some adaptations could be necessary, for example, on MacOS, | should be escaped from regular expressions, while on gnu sed
, they shouldn't be escaped.
#!/bin/bash
DUMPFILE=$1
rtstart=$(grep "// ---- RELATION TYPES" $DUMPFILE -n -m1 | cut -d':' -f1)
rtend=$(grep "// ---- NODE TYPES" $DUMPFILE -n -m1 | cut -d':' -f1)
nstart=$(grep "// -- NODES" $DUMPFILE -n -m1 | cut -d':' -f1)
rstart=$(grep "// -- RELATIONS" $DUMPFILE -n -m1 | cut -d':' -f1)
rend=$(wc -l $DUMPFILE | cut -f1 -d' ')
#Relation Types
# rtid=0|name="r_associated"|nom_etendu="id<E9>e associ<E9>e"|info="Il est demand<E9> d'<E9>num<E9>rer les termes les plus <E9>troitement associ<E9>s au mot cible... Ce mot vous fait penser <E0> quoi ?"
head -$(expr $rtend - 2) $DUMPFILE | tail -n $(expr $rtend - $rtstart - 3) | sed 's/rtid=\([0-9]*\)|name=\("[^"]*"\)|nom_etendu=\("[^"]*"\)|info=\("[^"]*"\)\|t=\([0-9]*\)/\1;\2;\3;\4/g' > relation_types.csv
#Nodes
head -$(expr $rstart - 1) $DUMPFILE | tail -n $(expr $rstart - $nstart - 2) | sed 's/eid=\([0-9]*\)|n=\("[^"]*"\)|t=\([0-9]*\)|w=\([0-9]*\)\(|nf=\("[^"]*"\)\)\{0,1\}/\1;\2;\3;\4;\5/g' > nodes.csv
#Relations
echo "Extracting relations"
echo "$rstart"
echo "$rend"
tail -n$(expr $rend - $rstart) $DUMPFILE | sed 's/rid=\([0-9]*\)|n1=\([0-9]*\)|n2=\([0-9]*\)|t=\([0-9]*\)|w=\([0-9]*\)/\1;\2;\3;\4;\5/g' > relations.csv
-
Download the latest full dump from: http://www.jeuxdemots.org/JDM-LEXICALNET-FR/
-
Make the script executable:
chmod +x ./convertjdm.sh
-
Convert the dump to UTF-8: By default the encoding used by the JdM dump is
cp1252
(Windows) and should be converted to UTF-8, unless you are using bash on the Windows Subsystem For Linux (Windows 10.1 and later). For example:iconv -f WINDOWS-1252 -t UTF-8 02042019-LEXICALNET-JEUXDEMOTS-FR-NOHTML.txt > jdm_full_utf8.txt
-
Run the script:
./convertjdm.sh jdm_full_utf8.txt
Most SQL databases support loading CSV files in a straightforward way. We provide and example that loads the files on mysql. Please make sure to adapt the paths and to configure the database server properly (see below).
create database jdm;
use jdm;
CREATE TABLE IF NOT EXISTS nodes (eid INT, n TEXT, t INT, w INT, nf TEXT) engine = MYISAM;
CREATE TABLE IF NOT EXISTS relations (rid INT, n1 INT, n2 INT, t INT, w INT) engine = MYISAM;
CREATE TABLE IF NOT EXISTS relation_types (rtid INT, name TEXT, nom_etendu TEXT, info TEXT) engine = MYISAM;
LOAD DATA INFILE "/var/lib/mysql-files/nodes.csv" INTO TABLE nodes CHARACTER SET 'latin1' COLUMNS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' LINES TERMINATED BY '\n';
LOAD DATA INFILE "/var/lib/mysql-files/relation_types.csv" INTO TABLE relation_types CHARACTER SET 'latin1' COLUMNS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' LINES TERMINATED BY '\n';
LOAD DATA INFILE "/var/lib/mysql-files/relations.csv" INTO TABLE relations CHARACTER SET 'latin1' COLUMNS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '"' LINES TERMINATED BY '\n’;
You should configure MySQL to allow the loading of data from CSV file:
# On of the other depending on the version of MySQL, it is safe to put both
local-infile=ON
local_infile=ON
# The path from which loading files is allowed, that's where the csv files should be located
datadir=/var/lib/mysql