-
CSE@University of Michigan
- Ann Arbor, Michigan
- https://superctj.github.io
Highlights
- Pro
Stars
Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'
MLOS is a project to enable autotuning for systems.
Metadata and data identification tool and Python library. Identifies PII, common identifiers, language specific identifiers. Fully customizable and flexible rules
Python library for embedding inference of relational tables.
A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching methods.
Course notes for CS228: Probabilistic Graphical Models.
SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
Loopy belief propagation for factor graphs on discrete variables in JAX
Generalized and Efficient Blackbox Optimization System.
Supplementary Material for "LlamaTune: Sample-Efficient DBMS Configuration Tuning"
DuckDB is an analytical in-process SQL database management system
A collection of research materials on SSL for non-sequential tabular data (SSL4NSTD)
⏰ Collaboratively track deadlines of conferences recommended by CCF (Website, Python Cli, Wechat Applet) / If you find it useful, please star this project, thanks~
Explain complex systems using visuals and simple terms. Help you prepare for system design interviews.
Characterization of relational table embeddings (VLDB 2024).
Codebase and data for our paper - Pylon: Semantic Table Union Search in Data Lakes.
Spider join dataset for our paper - WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses (CIDR 2023).
Qdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
https://csstipendrankings.org
A library for efficient similarity search and clustering of dense vectors.
NextiaJD is a library that supports data discovery based on data profiles and machine learning algorithms to find joinable attributes in heterogeneous datasets
ATTA (Efficient Adversarial Training with Transferable Adversarial Examples)
Tools for training schema-aware Web table embedding for unsupervised and supervised machine learning on tabular data
D3L dataset discovery framework - an implementation of the ICDE 2020 paper with the same name: https://arxiv.org/pdf/2011.10427.pdf