Skip to content

Commit

Permalink
Rewrite to support additional nutrient databases
Browse files Browse the repository at this point in the history
  • Loading branch information
m5n committed Sep 9, 2012
1 parent ced9d5e commit 8adc332
Show file tree
Hide file tree
Showing 25 changed files with 107 additions and 56 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2012 Maarten van Egmond (https://github.com/m5n/usda-nutrient-database-sql-port)
Copyright (c) 2012 Maarten van Egmond (https://github.com/m5n/nutriana)

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
Expand Down
5 changes: 4 additions & 1 deletion MySQL.pm
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
#!/usr/bin/perl

#
# MySQL SQL generating perl module
# This file is part of http://github/m5n/nutriana

use strict;

sub sql_comment {
Expand Down
38 changes: 27 additions & 11 deletions README
Original file line number Diff line number Diff line change
@@ -1,17 +1,33 @@
This project converts the information contained within the USDA National Nutrient Database for Standard Reference into SQL files for import into specific database systems.
DATABASE SYSTEMS SUPPORTED:
- MySQL

The database was downloaded from the USDA site at http://www.ars.usda.gov/nutrientdata and was not altered in any way.
NUTRIENT DATABASES INCLUDED:
- USDA National Nutrient Database for Standard Reference
http://www.ars.usda.gov/nutrientdata

Run the build.sh file to generate the SQL files. Currently, only a MySQL file is created, but it's easy to add files for other database systems.
Simply copy one of the Perl module files (*.pm) and alter it to output the format that the other database system requires.
To alter the database name or user credentials, edit the "generate_sql.pl" file.
PROJECT DESCRIPTION:
This project converts the food composition data released by various official
sources in the world to more modern formats. Often this data is provided by the
source as an Access database, an Excel file, or a set of character delimited
text files. For programmatic access, however, some sort of SQL format is
usually prefered to any of the above formats.

HOW IT WORKS:
A human being is needed to create a description file for a given nutrient database. The JSON format was chosen for readability and portability reasons.
Nutriana always converts the official data without modification. However, some changes may be necessary to ensure a successful database creation and data import. For example:
- the database schema as indicated by the official source is not compatible with the raw data files provided
- additional data rows are needed to avoid conflicts with foreign key constraints
See the *.MODIFICATIONS files for more details.

The "db_description.json" file was created manually by extracting the information from the "data/sr24_doc.pdf" file.
Modifications were made to the information in the "data/sr24_doc.pdf" file as well as the resulting SQL to remove any problems importing the nutrient database data; see the "MODIFICATIONS" file.
IF YOUR PREFERRED DATABASE IS NOT SUPPORTED:
It should be easy to add other SQL-based databases by copying one of the Perl module files (*.pm) and edit it to output the format that your database system requires. (If you find it's not, let me know by creating an issue.)
Run the build.sh file to (re)generate the SQL files. The script will automatically detect the new .pm file and attempt to output SQL for it.
To alter the database name or user credentials, edit the "generate_sql.pl" file.

Author:
AUTHOR:
- Maarten van Egmond

License:
- Usda-nutrient-database-sql-port is released under the MIT license; see the LICENSE file.
- The USDA Nutrient Database "USDA food composition data" is in the public domain and there is no copyright or licensing fees.
LICENSE:
- Nutriana is released under the MIT license; see the LICENSE file.
- Full licensing and usage information for the incuded nutrient databases is available in the *.LICENSE files; below is a summary:
- The USDA Nutrient Database "USDA food composition data" is in the public domain and there is no copyright or licensing fees.
43 changes: 26 additions & 17 deletions build.sh
Original file line number Diff line number Diff line change
@@ -1,25 +1,34 @@
#!/bin/sh

# The SQL files are generated via perl, so make sure it's installed.
# The SQL files are generated via Perl, so make sure it's installed.
PERL=`which perl`
if [ "$PERL" == "" ]; then echo "Please install perl" ; exit 1 ; fi
if [ "$PERL" == "" ]; then echo "Please install Perl" ; exit 1 ; fi

# Check that the data files do not contain any special characters.
# Because in shell scripts `file data/*.txt` does not preserve newlines, defer to perl for this.
$PERL ./check_data_files.pl
# Process all nutrient databases included.
for NUTDBDIR in `find . -type d -depth 1`; do
# Extract nutrient dabatase identifier.
NUTDBID=`expr "$NUTDBDIR" : "\./\(.*\)"`

# The perl modules indicate the databases to generate SQL for.
for PMFILE in `find . -type f -name \*.pm`; do
# Extract dabatase identifier.
DBID=`expr "$PMFILE" : "\.\/\(.*\).pm"`
# Convert outfile to lowercase.
OUTFILE="$(tr [A-Z] [a-z] <<< "usda_nndsr_$DBID.sql")"
# Ignore .git dir.
if [ "$NUTDBID" == ".git" ]; then continue; fi

# Generate the SQL file for this database.
# Make sure to add the current directory to the beginning of @INC
# to avoid accidentally using official modules with the same name.
$PERL -I . -M$DBID ./generate_sql.pl > $OUTFILE
# Check that the data files do not contain any special characters.
# Because in shell scripts `file $NUTDBID/*.txt` does not preserve newlines,
# defer to Perl for this.
$PERL ./check_data_files.pl $NUTDBID

echo "$DBID file generated: $OUTFILE"
done
# The Perl modules indicate the databases to generate SQL for.
for PMFILE in `find . -type f -name \*.pm`; do
# Extract dabatase identifier.
RDBMSID=`expr "$PMFILE" : "\./\(.*\).pm"`
# Convert outfile to lowercase.
OUTFILE="$(tr [A-Z] [a-z] <<< $NUTDBID"_"$RDBMSID.sql)"

# Generate the SQL file for this database.
# Make sure to add the current directory to the beginning of @INC
# to avoid accidentally using official modules with the same name.
$PERL -I . -M$RDBMSID ./generate_sql.pl $RDBMSID $NUTDBID > $OUTFILE

echo "$RDBMSID file for $NUTDBID generated: $OUTFILE"
done
done
6 changes: 5 additions & 1 deletion check_data_files.pl
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
#!/usr/bin/perl
#
# Warns if a raw data file is not ASCII.
# This file is part of http://github/m5n/nutriana

use strict;

my $nutdbid = $ARGV[0];
my $pwd = `pwd`; chomp $pwd;
my @files = split /\n/, `file data/*.txt`;
my @files = split /\n/, `file $nutdbid/*.txt`;
foreach (@files) {
$_ =~ /^(.*):/;
print "WARNING: data file $pwd/$1 contains non-ASCII characters\n" if $_ !~ /ASCII/;
Expand Down
23 changes: 14 additions & 9 deletions generate_sql.pl
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
#!/usr/bin/perl
#
# Generates the SQL file for a specific RDBMS.
# This file is part of http://github/m5n/nutriana


# Change these to suit your needs.
my $DB_NAME = "food";
Expand All @@ -10,8 +14,11 @@
use strict;
use JSON;


my $dbid = $ARGV[0];
my $nutdbid = $ARGV[1];
my $pwd = `pwd`; chomp $pwd;
my $file = "$pwd/db_description.json";
my $file = "$pwd/$nutdbid.json";
my $json = do { local $/ = undef; open my $fh, "<", $file or die "Could not open $file: $!"; <$fh>; };
my $data = decode_json($json);
my $field_separator = $data->{"field_separator"};
Expand All @@ -23,7 +30,7 @@
# Output header.
print sql_comment("="x$header_length). "\n";
print sql_comment($header) . "\n";
print sql_comment("This file was generated by http://github/m5n/usda-nutrient-database-sql-port") . "\n";
print sql_comment("This file was generated by http://github/m5n/nutriana") . "\n";
print sql_comment("Run this SQL with an account that has admin priviledges, e.g.: mysql -v -u root < file.sql") . "\n";
print sql_comment("="x$header_length). "\n\n";

Expand Down Expand Up @@ -70,23 +77,21 @@
print "\n";
}

# Then add data.
# Add data.
foreach (@{$data->{"tables"}}) {
my %table = %{$_};
my $table_name = substr($table{"file"}, 0, -length(".txt"));

print sql_load_file("./data/" . $table{"file"}, $table_name, $field_separator, $text_separator) . "\n";
print sql_load_file("./$nutdbid/" . $table{"file"}, $table_name, $field_separator, $text_separator) . "\n";

# Assert all records were loaded. Make sure a SQL error is generated if the count is off.
print sql_assert_record_count($table_name, $table{"records"}) . "\n\n";
}

# Correct data issues before adding foreign keys; see MODIFICATIONS file.
print sql_insert("DERIV_CD", ("Deriv_Cd" => "", "Deriv_Desc" => "Added by http://github/m5n/usda-nutrient-database-sql-port to avoid foreign key error on NUT_DATA")) . "\n";
print sql_insert("FOOD_DES", ("NDB_No" => "", "FdGrp_Cd" => "0100", "Long_Desc" => "Added by http://github/m5n/usda-nutrient-database-sql-port to avoid foreign key error on NUT_DATA", "Shrt_Desc" => "See Long_Desc")) . "\n";
print sql_insert("NUTR_DEF", ("Nutr_No" => "", "Units" => "g", "NutrDesc" => "Added by http://github/m5n/usda-nutrient-database-sql-port to avoid foreign key error on FOOTNOTE", "Num_Dec" => "0", "Sr_Order" => 0)) . "\n\n";
# Correct data before adding foreign keys; see $nutdbid.MODIFICATIONS file.
print `perl -I . -M$dbid ./$nutdbid.fix_data_pre_fk.pl http://github/m5n/nutriana` . "\n";

# Then add foreign keys.
# Add foreign keys.
foreach (@{$data->{"tables"}}) {
my %table = %{$_};
my $table_name = substr($table{"file"}, 0, -length(".txt"));
Expand Down
1 change: 1 addition & 0 deletions usda_nndsr.LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The USDA Nutrient Database "USDA food composition data" is in the public domain and there is no copyright or licensing fees.
File renamed without changes.
13 changes: 13 additions & 0 deletions usda_nndsr.fix_data_pre_fk.pl
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/perl
#
# Fixes data rows in preparation of adding foreign keys.
# This file is part of http://github/m5n/nutriana

use strict;

my $project_url = $ARGV[0];

print sql_insert("DERIV_CD", ("Deriv_Cd" => "", "Deriv_Desc" => "Added by $project_url to avoid foreign key error")) . "\n";
print sql_insert("FOOD_DES", ("NDB_No" => "", "FdGrp_Cd" => "0100", "Long_Desc" => "Added by $project_url to avoid foreign key error", "Shrt_Desc" => "See Long_Desc")) . "\n";
print sql_insert("NUTR_DEF", ("Nutr_No" => "", "Units" => "g", "NutrDesc" => "Added by $project_url to avoid foreign key error", "Num_Dec" => "0", "Sr_Order" => 0)) . "\n";

File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
32 changes: 16 additions & 16 deletions usda_nndsr_mysql.sql
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
-- =========================================================================================================
-- USDA National Nutrient Database for Standard Reference, Release 24 (http://www.ars.usda.gov/nutrientdata)
-- This file was generated by http://github/m5n/usda-nutrient-database-sql-port
-- This file was generated by http://github/m5n/nutriana
-- Run this SQL with an account that has admin priviledges, e.g.: mysql -v -u root < file.sql
-- =========================================================================================================

Expand Down Expand Up @@ -276,105 +276,105 @@ create table DATSRCLN (
);
alter table DATSRCLN add primary key (NDB_No, Nutr_No, DataSrc_ID);

load data local infile './data/FOOD_DES.txt' into table FOOD_DES fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/FOOD_DES.txt' into table FOOD_DES fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from FOOD_DES);
delete from tmp where c = 7907;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/NUT_DATA.txt' into table NUT_DATA fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/NUT_DATA.txt' into table NUT_DATA fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from NUT_DATA);
delete from tmp where c = 583957;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/WEIGHT.txt' into table WEIGHT fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/WEIGHT.txt' into table WEIGHT fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from WEIGHT);
delete from tmp where c = 13817;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/FOOTNOTE.txt' into table FOOTNOTE fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/FOOTNOTE.txt' into table FOOTNOTE fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from FOOTNOTE);
delete from tmp where c = 522;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/FD_GROUP.txt' into table FD_GROUP fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/FD_GROUP.txt' into table FD_GROUP fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from FD_GROUP);
delete from tmp where c = 25;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/LANGUAL.txt' into table LANGUAL fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/LANGUAL.txt' into table LANGUAL fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from LANGUAL);
delete from tmp where c = 40205;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/LANGDESC.txt' into table LANGDESC fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/LANGDESC.txt' into table LANGDESC fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from LANGDESC);
delete from tmp where c = 774;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/NUTR_DEF.txt' into table NUTR_DEF fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/NUTR_DEF.txt' into table NUTR_DEF fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from NUTR_DEF);
delete from tmp where c = 146;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/SRC_CD.txt' into table SRC_CD fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/SRC_CD.txt' into table SRC_CD fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from SRC_CD);
delete from tmp where c = 10;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/DERIV_CD.txt' into table DERIV_CD fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/DERIV_CD.txt' into table DERIV_CD fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from DERIV_CD);
delete from tmp where c = 54;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/DATA_SRC.txt' into table DATA_SRC fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/DATA_SRC.txt' into table DATA_SRC fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from DATA_SRC);
delete from tmp where c = 589;
insert into tmp (select count(*) from tmp);
drop table tmp;

load data local infile './data/DATSRCLN.txt' into table DATSRCLN fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
load data local infile './usda_nndsr/DATSRCLN.txt' into table DATSRCLN fields terminated by '^' optionally enclosed by '~' lines terminated by '\r\n';
create table tmp (c int unique key);
insert into tmp (c) values (2);
insert into tmp (select count(*) from DATSRCLN);
delete from tmp where c = 171155;
insert into tmp (select count(*) from tmp);
drop table tmp;

insert into DERIV_CD (Deriv_Desc, Deriv_Cd) values ('Added by http://github/m5n/usda-nutrient-database-sql-port to avoid foreign key error on NUT_DATA', '');
insert into FOOD_DES (Shrt_Desc, NDB_No, FdGrp_Cd, Long_Desc) values ('See Long_Desc', '', '0100', 'Added by http://github/m5n/usda-nutrient-database-sql-port to avoid foreign key error on NUT_DATA');
insert into NUTR_DEF (NutrDesc, Sr_Order, Units, Nutr_No, Num_Dec) values ('Added by http://github/m5n/usda-nutrient-database-sql-port to avoid foreign key error on FOOTNOTE', '0', 'g', '', '0');
insert into DERIV_CD (Deriv_Desc, Deriv_Cd) values ('Added by http://github/m5n/nutriana to avoid foreign key error', '');
insert into FOOD_DES (Shrt_Desc, NDB_No, FdGrp_Cd, Long_Desc) values ('See Long_Desc', '', '0100', 'Added by http://github/m5n/nutriana to avoid foreign key error');
insert into NUTR_DEF (NutrDesc, Sr_Order, Units, Nutr_No, Num_Dec) values ('Added by http://github/m5n/nutriana to avoid foreign key error', '0', 'g', '', '0');

alter table FOOD_DES add foreign key (FdGrp_Cd) references FD_GROUP(FdGrp_Cd);
alter table NUT_DATA add foreign key (NDB_No) references FOOD_DES(NDB_No);
Expand Down

0 comments on commit 8adc332

Please sign in to comment.