forked from cmu-db/dbgym
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
**Summary**: redesigned logging across the whole project. Also fixed some bugs to pass the CI. **Demo**: ![Screenshot 2024-10-20 at 12 37 33](https://github.com/user-attachments/assets/6052f7e0-74cf-4593-be56-c57b7a636b49) The log files archived in `run_*/dbgym/artifacts/`. There are logs from dbgym and from third-party libraries. You can see that all logs, even info logs, are captured in `dbgym.log`. **Logging Design**: * I was having difficulty debugging some bugs because the console was too cluttered. This motivated me to redesign logging. * I removed all class-name loggers. All these loggers behaved the same (at least from what I could tell) so it's easier to just make them all use the same logger. * We use the loggers `dbgym`, `dbgym.output`, `dbgym.replay`, and the root logger. * `dbgym` is the "base" logger and should be used most of the time. It outputs errors to the console and all logs to the file `run_*/dbgym/artifacts/dbgym.log`. * `dbgym.output` is used when you actually want to output something to show the user. It just outputs the message straight to the console without any extra metadata. As a child of `dbgym`, anything logged here will also be propagated to `dbgym` and thus archived in `dbgym.log`. * `dbgym.replay` is specific to Proto-X and is where Proto-X stores log information only relevant by replay. By making it its own logger, we insulate it from any changes to the main logging system. * The root logger is used to help debug unit tests. Unit tests are isolated from the main logging system for simplicity. See `test_clean.py` for an example of this. * Certain third-party loggers like `ray` are redirected to a file to reduce console clutter. * I kept the ray dashboard in the console though because it's pretty useful. * `print()` is reserved for use for actual debugging. * I redirected `warnings` to a separate file too to further reduce clutter. * I do special handling to eliminate the warnings that show up every time when import tensorflow (see `task.py` for an example of this). **Other Details**: * Upgraded nccl to version 2.20.* in requirements.txt to fix an import error. * Embedding datagen was not working. I added additional unit tests to help me debug this. * Made workload_tests.py more robust by checking fields other than class mapping. This is done by saving reference `Workload` and `IndexSpace` objects as `pkl` files. * Verified that replay still works (since it relies on log files).
- Loading branch information
1 parent
ac849f8
commit 9e652d1
Showing
178 changed files
with
679 additions
and
1,689 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
#!/bin/bash | ||
|
||
set -euxo pipefail | ||
|
||
SCALE_FACTOR=1 | ||
INTENDED_DBDATA_HARDWARE=ssd | ||
. ./experiments/load_per_machine_envvars.sh | ||
|
||
# space for testing. uncomment this to run individual commands from the script (copy pasting is harder because there are envvars) | ||
python3 task.py tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --max-concurrent 4 --tune-duration-during-hpo 1 --intended-dbdata-hardware $INTENDED_DBDATA_HARDWARE --dbdata-parent-dpath $DBDATA_PARENT_DPATH --build-space-good-for-boot | ||
exit 0 | ||
|
||
# benchmark | ||
python3 task.py benchmark tpch data $SCALE_FACTOR | ||
python3 task.py benchmark tpch workload --scale-factor $SCALE_FACTOR | ||
|
||
# postgres | ||
python3 task.py dbms postgres build | ||
python3 task.py dbms postgres dbdata tpch --scale-factor $SCALE_FACTOR --intended-dbdata-hardware $INTENDED_DBDATA_HARDWARE --dbdata-parent-dpath $DBDATA_PARENT_DPATH | ||
|
||
# embedding | ||
python3 task.py tune protox embedding datagen tpch --scale-factor $SCALE_FACTOR --override-sample-limits "lineitem,32768" --intended-dbdata-hardware $INTENDED_DBDATA_HARDWARE --dbdata-parent-dpath $DBDATA_PARENT_DPATH | ||
python3 task.py tune protox embedding train tpch --scale-factor $SCALE_FACTOR --train-max-concurrent 10 | ||
|
||
# agent | ||
python3 task.py tune protox agent hpo tpch --scale-factor $SCALE_FACTOR --max-concurrent 4 --tune-duration-during-hpo 4 --intended-dbdata-hardware $INTENDED_DBDATA_HARDWARE --dbdata-parent-dpath $DBDATA_PARENT_DPATH --build-space-good-for-boot | ||
python3 task.py tune protox agent tune tpch --scale-factor $SCALE_FACTOR |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
#!/bin/bash | ||
mypy --config-file scripts/mypy.ini . |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,21 +1,24 @@ | ||
import logging | ||
import sys | ||
from pathlib import Path | ||
|
||
import pandas as pd | ||
|
||
from util.log import DBGYM_OUTPUT_LOGGER_NAME | ||
|
||
def read_and_print_parquet(file_path: Path) -> None: | ||
|
||
def read_and_output_parquet(file_path: Path) -> None: | ||
# Read the Parquet file into a DataFrame | ||
df = pd.read_parquet(file_path) | ||
|
||
# Print the DataFrame | ||
print("DataFrame:") | ||
print(df) | ||
# Output the DataFrame | ||
logging.getLogger(DBGYM_OUTPUT_LOGGER_NAME).info("DataFrame:") | ||
logging.getLogger(DBGYM_OUTPUT_LOGGER_NAME).info(df) | ||
|
||
|
||
if __name__ == "__main__": | ||
# Specify the path to the Parquet file | ||
parquet_file_path = Path(sys.argv[0]) | ||
|
||
# Call the function to read and print the Parquet file | ||
read_and_print_parquet(parquet_file_path) | ||
# Call the function to read and output the Parquet file | ||
read_and_output_parquet(parquet_file_path) |
Oops, something went wrong.