Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading and writing to SQL database #158

Open
slwu89 opened this issue Nov 30, 2024 · 18 comments
Open

Reading and writing to SQL database #158

slwu89 opened this issue Nov 30, 2024 · 18 comments

Comments

@slwu89
Copy link
Member

slwu89 commented Nov 30, 2024

We would like to read/write from an SQL database, we can rely on the https://github.com/MechanicalRabbit/FunSQL.jl package to help automate the writing of SQL code.

@slwu89
Copy link
Member Author

slwu89 commented Dec 2, 2024

@TheCedarPrince since you are the local FunSQL expert, I wanted to ask, does FunSQL have a SQLNode object to do CREATE TABLE? The feature that this issue is tracking is the very simple read/write functionality, essentially just read_json_acset and read_json_acset as exists in the current acset serialization functionality but for a general SQL DB connection. We'd need to use CREATE TABLE for each object in an acset.

@slwu89
Copy link
Member Author

slwu89 commented Dec 3, 2024

One thing we will need to think about is how to handle writing of attribute data. SQLite, for example, has a rather restricted set of data types it supports (https://www.sqlite.org/datatype3.html). Unless we use BLOB, we are basically restricted to:

julia_to_sql_type(::Type{T}) where {T} = error("type $(T) not supported for SQL conversion")
julia_to_sql_type(::Type{T}) where {T<:Real} = "REAL"
julia_to_sql_type(::Type{T}) where {T<:AbstractString} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Integer} = "INTEGER"

So any attributes that are unions, or vectors, etc, are out. One way to go about this is to let the user provide, if they want, for specific AttrTypes, an SQL target type and a conversion function (e.g. to convert a Vector{String} into a string delimited by commas. Otherwise we use the fallback above.

@slwu89
Copy link
Member Author

slwu89 commented Dec 3, 2024

For acsets, we can more or less name the obs, homs, etc anything we want. But in SQLite (and other implementations), there are quite a few reserved words we cannot use for tables (see https://sqlite.org/lang_keywords.html). So by default we should name the table associated with an object the same name, unless the user gives an override.

Also, in SQL (in yet another departure of the SQL implementation from the relational model it supposedly "implements"), the order in which columns are specified in CREATE TABLE is important. Not sure if this will complicate things in practice, but should be aware that subtle bugs might arise if it is not taken into account.

@slwu89
Copy link
Member Author

slwu89 commented Dec 4, 2024

Apparently SQLite does not allow multiple tables to be generated in a single statement, the code below (silently, of course) only generates the artist table and then ignores the remaining statements. This is a tricky problem for writing an arbitrary acset to a SQL database. The reason is that foreign keys play the role of Homs in acsets, and in SQL if you declare a column to be a foreign key to the primary key of another table, that other table had better exist already.

using SQLite

function get_table(db, table)
    query = DBInterface.execute(db, "SELECT * FROM $table")
    return DataFrames.DataFrame(query)
end

db = SQLite.DB()

DBInterface.execute(db, """
    CREATE TABLE artist(
        artistid    INTEGER PRIMARY KEY, 
        artistname  TEXT
    );
    CREATE TABLE track(
        trackid     INTEGER, 
        trackname   TEXT, 
        trackartist INTEGER,
        FOREIGN KEY(trackartist) REFERENCES artist(artistid)
    );
""")

get_table(db, "artist")
get_table(db, "track")

One thing to do to get around this is to first create each table "individually" and later add the foreign key columns. Of course here there is another problem, in that FOREIGN KEY keyword does not seem to be compatible with ALTER TABLE. The best I can get working is below. This is less than ideal, as REFERENCES is a "column level" foreign key, meaning that the column in the target table it points to doesn't necessarily have to be that table's primary key. FOREIGN KEY is what declares that column to be pointing to another table's primary key (this is not documented clearly anywhere, I read it at https://stackoverflow.com/questions/8595695/what-is-difference-between-foreign-key-and-reference-key). So we still don't have proper foreign keys. @TheCedarPrince any thoughts?

using SQLite

function get_table(db, table)
    query = DBInterface.execute(db, "SELECT * FROM $table")
    return DataFrames.DataFrame(query)
end

db = SQLite.DB()

DBInterface.execute(db, """
    CREATE TABLE artist(
        artistid    INTEGER PRIMARY KEY, 
        artistname  TEXT
    );
""")

DBInterface.execute(db, """
    CREATE TABLE track(
        trackid     INTEGER, 
        trackname   TEXT
    );
""")

get_table(db, "artist")
get_table(db, "track")

DBInterface.execute(db, """
    ALTER TABLE track
        ADD COLUMN trackartist INTEGER DEFAULT NULL
        REFERENCES artist(artistid);
""")

@TheCedarPrince
Copy link

TheCedarPrince commented Dec 4, 2024

Hey @slwu89 , here are some out of order thoughts on this!

@TheCedarPrince since you are the local FunSQL expert, I wanted to ask, does FunSQL have a SQLNode object to do CREATE TABLE?

FunSQL.jl does not have support for CREATE TABLE for I think precisely the reason we are running into -- all the variance between table creation in SQL flavors. If you want to get more clarification, the FunSQL.jl folks are very active on the Julia Zulip here. Otherwise, what they do support doing is that one can create a table first using whatever mechanism the SQL flavor we are using requires, and then we can insert the result of a FunSQL query into the table like this:

fun_sql = From(...) # Some FunSQL DSL statement

catalog = reflect(
    db_conn;
    schema = "",
    dialect=:sqlite
);

sql = render(catalog, fun_sql);

res = execute(db_conn,
    """
    INSERT INTO
        table
    SELECT
        *
    FROM
        ($sql) AS foo;
    """
)

But in SQLite (and other implementations), there are quite a few reserved words we cannot use for tables (see sqlite.org/lang_keywords.html). So by default we should name the table associated with an object the same name, unless the user gives an override.

Hm. We may need to create a small sanitizer package or routine that checks to make sure no statements are using reserved language per flavors we want to support.

One thing we will need to think about is how to handle writing of attribute data.

So one thing I am a strong advocate for here is to use DBInterface.jl interfaces to the DBs we are interested in using. I know that most DB packages within Julia support DBInterface interfaces which is great and should help a lot with this.

So we still don't have proper foreign keys. @TheCedarPrince any thoughts?

Is there any particular reason why we are sticking to SQLite? Could we perhaps use something like DuckDB which acts very similarly to PostgreSQL, supports in-memory databases, and should be fit for high performance use? I've encountered some of the roadblocks you were describing in SQLite and it is not a fault of FunSQL here but rather SQLite being strange in this regard. Here's the DuckDB.jl package and info about constraints.

This is just my thought on SQLite as we could switch to something else that "just does" what we want for a proof of concept. Or is SQLite on your side a design requirement?

@slwu89
Copy link
Member Author

slwu89 commented Dec 4, 2024

Hey @TheCedarPrince, thanks!

Is there any particular reason why we are sticking to SQLite?

It's not a hard design requirement. I just started with it since of previous familiarity (though, obviously, not that familiar as I forgot about all these foot guns). It would be nice to support it nontheless given how widely it's used (and for AlgebraicJulia/AlgebraicRelations.jl#33). But yes for a proof of concept it may be better to switch to something more "fully featured". I'll check out DuckDB!

I've encountered some of the roadblocks you were describing in SQLite and it is not a fault of FunSQL here but rather SQLite being strange in this regard.

Yes, that makes sense. I haven't used FunSQL yet in my investigations as I think for this first feature addition I'll just be looking at writing an acset to an SQL db of some sort and reading in from a db a schema and an acset. I think I'll mostly rely on DBInterface to talk to whatever backend is used as the proof of concept.

@TheCedarPrince
Copy link

Here is an example for creating a connection to DuckDB.jl:

using DBInterface
using DuckDB

db = DBInterface.connect(DuckDB.DB, ":memory:")

Let me know what else you need!

@TheCedarPrince
Copy link

OH! Forgot to mention, if you want to write this to a file (like database.duckdb) all you need is actually DBInterface.connect(DuckDB.DB, "database.duckdb") then you should be fine.

What's neat is that starting as of, if I recall, v0.10.X, all versions of duckdb files are now backwards compatible across DuckDB. Generated duckdb files should also be able to be accessed from teh DuckDB CLI.

@slwu89
Copy link
Member Author

slwu89 commented Dec 9, 2024

@TheCedarPrince testing out DuckDB a bit! Unfortunately it also has the Adding columns with constraints not yet supported notice. It's a problem because in general, there's nothing stopping someone from making an acset with a schema where each object has a hom to some other object, so there isn't a general way to "start with the tables that don't point to anything" and work backwards.

using DuckDB
using DataFrames
using Catlab

# create a new in-memory database
con = DBInterface.connect(DuckDB.DB, ":memory:")

julia_to_sql_type(::Type{T}) where {T} = error("type $(T) not supported for SQL conversion")
julia_to_sql_type(::Type{T}) where {T<:Real} = "REAL"
julia_to_sql_type(::Type{T}) where {T<:AbstractString} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Symbol} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Integer} = "INTEGER"

@present SchWeightedLabeledGraph <: SchLabeledGraph begin
    Weight::AttrType
    weight::Attr(E,Weight)
end

# to_graphviz(SchWeightedLabeledGraph)

@acset_type WeightedLabeledGraph(SchWeightedLabeledGraph, index=[:src, :tgt]) <: AbstractLabeledGraph

g = erdos_renyi(WeightedLabeledGraph{Symbol,Float64}, 10, 0.25)
g[:, :label] = Symbol.(collect('a':'z')[1:nv(g)])
g[:, :weight] = floor.(rand(ne(g)) .* 100)

# first make tables of each object and its attributes
for o in objects(acset_schema(g))
    o_attrs = attrs(acset_schema(g); from=o)
    stmt = Vector{String}()
    push!(stmt, "CREATE OR REPLACE TABLE $(o)(_id INTEGER PRIMARY KEY")
    for (col, _, type) in o_attrs
        push!(stmt, ", $(col) $(julia_to_sql_type(subpart_type(g, type)))")
    end
    push!(stmt, ");")
    
    DBInterface.execute(con, join(stmt))
end

# second add all homs (REFERENCE)
for (h, h_dom, h_codom) in homs(acset_schema(g))
    DBInterface.execute(con, """
        ALTER TABLE $(h_dom) ADD $(h) INTEGER REFERENCES $(h_codom)(_id);
    """)
end

To be exactly clear, the issue is that it's possible to have an acset like this, which is quite difficult to instantiate a database for, it seems.

@present SillySch(FreeSchema) begin
    (X,Y)::Ob
    x::Hom(X,Y)
    y::Hom(Y,X)
end

@acset_type SillyType(SillySch)

x = @acset SillyType begin
    X=2
    Y=3
    x=[1,2]
    y=[1,1,2]
end

@slwu89
Copy link
Member Author

slwu89 commented Dec 10, 2024

Update:

Unsure where to go from here.

@TheCedarPrince
Copy link

Hi @slwu89 ,

Picking back up on this here! Let me share some thoughts:

Executing multiple statements in one execute or executemultiple call does not work in general regardless of database backend, see discourse.julialang.org/t/how-to-execute-sql-file/123675, however the DuckDB-specific function DuckDB.query will do it
Interestingly, @TheCedarPrince discovered this several years ago JuliaDatabases/SQLite.jl#303

As you saw in that issue, the workaround I adopted was what was suggested: split a SQL string on statements (i.e. ;'s) and execute each statement one at a time.

It seems that neither SQLite or DuckDB currently supports using ALTER TABLE to add a foreign key. However, there may be work going on in DuckDB to allow this somewhat soon duckdb/duckdb#57. Interestingly enough PostgreSQL does allow this, see their docs postgresql.org/docs/current/sql-altertable.html

OK, as painful as this might be, it seems like we should probably just develop a MVP in PostgreSQL. My apologies, I totally thought DuckDB had this capability. I must've confused it with PostgreSQL.

What can I do to help here @slwu89 ? I have more mental bandwidth now to contribute more wholly to discussions here. :)

@TheCedarPrince
Copy link

TheCedarPrince commented Jan 15, 2025

OK, I redid this in PostgreSQL as follows:

using DataFrames
using DBInterface
using Catlab
using LibPQ

# create a new in-memory database
conn = DBInterface.connect(LibPQ.Connection, "user=thecedarprince dbname=acsets")

julia_to_sql_type(::Type{T}) where {T} = error("type $(T) not supported for SQL conversion")
julia_to_sql_type(::Type{T}) where {T<:Real} = "REAL"
julia_to_sql_type(::Type{T}) where {T<:AbstractString} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Symbol} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Integer} = "INTEGER"

@present SchWeightedLabeledGraph <: SchLabeledGraph begin
    Weight::AttrType
    weight::Attr(E,Weight)
end

# to_graphviz(SchWeightedLabeledGraph)

@acset_type WeightedLabeledGraph(SchWeightedLabeledGraph, index=[:src, :tgt]) <: AbstractLabeledGraph

g = erdos_renyi(WeightedLabeledGraph{Symbol,Float64}, 10, 0.25)
g[:, :label] = Symbol.(collect('a':'z')[1:nv(g)])
g[:, :weight] = floor.(rand(ne(g)) .* 100)

# first make tables of each object and its attributes
for o in objects(acset_schema(g))
    o_attrs = attrs(acset_schema(g); from=o)
    stmt = Vector{String}()
    push!(stmt, "CREATE TABLE $(o)(_id INTEGER PRIMARY KEY")
    for (col, _, type) in o_attrs
        push!(stmt, ", $(col) $(julia_to_sql_type(subpart_type(g, type)))")
    end
    push!(stmt, ");")
    
    DBInterface.execute(conn, join(stmt))
end

# second add all homs (REFERENCE)
for (h, h_dom, h_codom) in homs(acset_schema(g))
    DBInterface.execute(conn, """
        ALTER TABLE $(h_dom) ADD $(h) INTEGER REFERENCES $(h_codom)(_id);
    """)
end

However, for some reason, the tables are ending up empty:

image

I am going to try investigating this a bit further to see why that is.

EDIT: Oh, I see! We haven't added any data yet!

@TheCedarPrince
Copy link

OK, I tried experimenting with adding data from an ACSet representation -- haven't gotten it working just yet but this is the gist of it:

DBInterface.execute(conn, "BEGIN;")
LibPQ.load!(
    (label = tables(g)[1].label, ),
    conn,
    """
    INSERT INTO v VALUES (\$1);
    """
)
DBInterface.execute(conn, "COMMIT;")

Basically, I think there are some type issues going on that I need to be more careful about:

[error | LibPQ]: InvalidTextRepresentation: ERROR:  invalid input syntax for t
ype integer: "a"
CONTEXT:  unnamed portal parameter $1 = '...'
ERROR: InvalidTextRepresentation: ERROR:  invalid input syntax for type intege
r: "a"
CONTEXT:  unnamed portal parameter $1 = '...'

Otherwise, this approach seems more promising as non of the SQL errored so far.


Long-term thinking, this could help with the writing to a database task. Next will be figuring out the "reading" from a database task. That'll be a fun one as we can use ACSets as the "loading" language! :D

@TheCedarPrince
Copy link

OK, I went a little crazy and got this writing working more broadly. It can certainly be generalized, but it works now. Here is the full prototype script:

using DataFrames
using DBInterface
using Catlab
using LibPQ

conn = DBInterface.connect(LibPQ.Connection, "user=thecedarprince dbname=acsets")

julia_to_sql_type(::Type{T}) where {T} = error("type $(T) not supported for SQL conversion")
julia_to_sql_type(::Type{T}) where {T<:Real} = "REAL"
julia_to_sql_type(::Type{T}) where {T<:AbstractString} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Symbol} = "TEXT"
julia_to_sql_type(::Type{T}) where {T<:Integer} = "INTEGER"

@present SchWeightedLabeledGraph <: SchLabeledGraph begin
    Weight::AttrType
    weight::Attr(E,Weight)
end

# to_graphviz(SchWeightedLabeledGraph)

@acset_type WeightedLabeledGraph(SchWeightedLabeledGraph, index=[:src, :tgt]) <: AbstractLabeledGraph

g = erdos_renyi(WeightedLabeledGraph{Symbol,Float64}, 10, 0.25)
g[:, :label] = Symbol.(collect('a':'z')[1:nv(g)])
g[:, :weight] = floor.(rand(ne(g)) .* 100)

# first make tables of each object and its attributes
for o in objects(acset_schema(g))
    o_attrs = attrs(acset_schema(g); from=o)
    stmt = Vector{String}()
    push!(stmt, "CREATE TABLE $(o)(_id INTEGER PRIMARY KEY")
    for (col, _, type) in o_attrs
        push!(stmt, ", $(col) $(julia_to_sql_type(subpart_type(g, type)))")
    end
    push!(stmt, ");")
    
    DBInterface.execute(conn, join(stmt))
end

# second add all homs (REFERENCE)
for (h, h_dom, h_codom) in homs(acset_schema(g))
    DBInterface.execute(conn, """
        ALTER TABLE $(h_dom) ADD $(h) INTEGER REFERENCES $(h_codom)(_id);
    """)
end

DBInterface.execute(conn, "BEGIN;")
LibPQ.load!(
    (_id = 1:length(tables(g)[1].label), label = tables(g)[1].label .|> String, ),
    conn,
    """
    INSERT INTO v VALUES (\$1, \$2);
    """
)
DBInterface.execute(conn, "COMMIT;")

DBInterface.execute(conn, "BEGIN;")
LibPQ.load!(
    (_id = 1:length(tables(g)[2].weight), weight = tables(g)[2].weight, src = tables(g)[2].src, tgt = tables(g)[2].tgt),
    conn,
    """
    INSERT INTO e VALUES (\$1, \$2, \$3, \$4);
    """
)
DBInterface.execute(conn, "COMMIT;")

And you will get tables that are full with ACSet data in your PostgreSQL Database such as this for "v":

_id | label 
-----+-------
   1 | a
   2 | b
   3 | c
   4 | d
   5 | e
   6 | f
   7 | g
   8 | h
   9 | i
  10 | j

And here it is for "e":

 _id | weight | src | tgt 
-----+--------+-----+-----
   1 |     40 |   9 |  10
   2 |      3 |   6 |   9
   3 |     94 |  10 |   1
   4 |      5 |   8 |   5
   5 |     88 |   3 |   9
   6 |     93 |   2 |   6
   7 |     72 |   2 |   5
   8 |     10 |   4 |   3
   9 |     12 |  10 |   7
  10 |     34 |   2 |   9
  11 |     76 |   8 |   1
  12 |     28 |   4 |   7
  13 |     31 |   4 |   1
  14 |     74 |   6 |   5
  15 |     57 |   7 |   9
  16 |     29 |  10 |   6
  17 |      2 |   9 |   5
  18 |     30 |   2 |   1
  19 |      8 |   5 |   2
  20 |      4 |   5 |   4
  21 |     67 |   8 |   3

All the constraints are satisfied as well which is neat! Let me know your thoughts @slwu89 or we can chat later of course.

@jpfairbanks
Copy link
Member

Thanks guys for looking into this. I agree that this would be a really killer feature for ACSets.jl and help adoption a lot. Compatability with pgsql seems fine since it is open source and fairly well supported across the industry. I am also disappointed that we can't just use SQLite or DuckDB since I was hopeful that this could be an extremely light weight integration, at least until someone with a serious need for enterprise features came along.

@slwu89
Copy link
Member Author

slwu89 commented Feb 6, 2025

@TheCedarPrince this is really great! I also know you mentioned there was probably a better way to do the type "conversion" than my julia_to_sql_type methods. @jpfairbanks and @TheCedarPrince I'll likely miss the next 2 Catlab devs meetings...I can rejoin the convo late this month.

@TheCedarPrince
Copy link

Thanks guys for looking into this. I agree that this would be a really killer feature for ACSets.jl and help adoption a lot.

Of course @jpfairbanks! This has been an ongoing question mark and I am eager to see some prototype here based on the discussion Sean and I have been having.

I also know you mentioned there was probably a better way to do the type "conversion" than my julia_to_sql_type methods.

I thought there may be with LibPQ.jl's built-in parsing here: https://github.com/JuliaDatabases/LibPQ.jl/blob/master/src/parsing.jl It seems like we could maybe use those for our parsing purposes? We still need to discuss probably what needs to be covered with parsing here.

I guess the reading of the data should be hopefully straightforward now. I'll take a stab at writing an ACSet "reading" interface now that we have a rough cut "writing" interface. Slowly but surely, we proceed! :D

My dream is still to have pointers into databases so we can have real big ACSets -- but that is still a dream... I'll have some more tinkering to do!

I'll likely miss the next 2 Catlab devs meetings...I can rejoin the convo late this month.

Hope you are doing well and see ya around! :D

@TheCedarPrince
Copy link

After taking a day to sleep on things, I think I have a potential idea to try. I don't have too much time at the moment (trying to get some submissions and proposals out the door ASAP) but here is what I was envisioning after doing a bit of reading and research:

I have discovered it seems fully possible to interface a PostgreSQL instance with a DuckDB database via https://github.com/duckdb/pg_duckdb. My thinking now is that we can have the best of both worlds:

  • The performance of DuckDB
  • Storage and efficiency of DuckDB
  • Complex constraint configurations of PostgreSQL
  • Enforcement of constraints across databases

So, as a result, we might not have to lose out on DuckDB niceties just because it doesn't support constraints we need. The only thing I don't know yet is if we can store constraint logic within a PostgreSQL instance, read and write data consistently to DuckDB databases, and the portability of such a configuration. Like, can we also generate a PostgreSQL dump with the constraint logic, load the dump and plugin into a new PostgreSQL instance, and interface with the DuckDB data storing ACSets? I do not know.

All the questions, none of the answers at the moment.

Wanted to write this thought down before I forgot this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants