Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add benchmarks API-Bank, APIBench, Nexus #1136

Merged
merged 48 commits into from
Dec 29, 2024

Conversation

HHHHHejia
Copy link
Collaborator

@HHHHHejia HHHHHejia commented Oct 30, 2024

I've added the APIbank APIbench and Nexus benchmark, main method see benchmark test and utils folder (benchmark_base.py)

There're some problem to be solved for the APIBank, APIbench(gorilla) and Nexus benchmark. listed as below.

For Nexus:
run python nexus_test.py. You'll get error
1.OpenAI limits the size of the function passed into the function call api (function name, function description length, number of functions, etc.). You need to add judgment logic in Camel. If OpenAI does not allow function call, use structure output instead.

2.Critical: while true bug in camel.chatagent.step. When the incoming api is not executed correctly, while true will not terminate.The while true logic should be eliminated. You cannot assume that the function passed by the user will always be executed correctly.

For APIbench
There're three datasets 'torchhub', 'tensorhub', 'huggingface’ . "torchhub"works well. BUT
3.'tensorhub', 'huggingface’ could not be correctly evaluted by the ast matching program. This is a problem within the original repo. I have already proposed an issue. [(https://github.com/ShishirPatil/gorilla/issues/729)]

It could be version problem of tree_sitter, but if you don't use tree_sitter==0.20.4, you'll get an another bug.

For APIbank
There're three datasets 'level1', 'level2', 'level3’ . BUT
4.NO ONE knows how to eveluate 'level3'. See the issue in original repo:
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/167]
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/102]
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/114]

5.APIbank involves multiple "User-Assistant-System" messages as History Records. Camel ChatAgent does not support adding multiple rounds of system messages yet. Temporary solution: Use record_message and make_assistant_message instead of system messages.

6.The version conflict between openai in camel, Https, and Google translate in original repo, see
[https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank#demo]. Camel, Https and Google translate lib doesn't work together.
For now two way works:
-use original repo without camel, Google translate and Https works well.
-use camel, remove Google translate, it works but without Google translate tool.
See:
[https://github.com/microsoft/TaskWeaver/issues/172]

7.Some datasets need to be hosted on GitHub/HuggingFace. The original author did not do this, but we do not want to include these data in Camel's GitHub.

@harryeqs harryeqs self-assigned this Nov 25, 2024
@HHHHHejia HHHHHejia changed the title add benchmark gorilla, nexus add benchmark apibank, gorilla, nexus Nov 26, 2024
@Wendong-Fan Wendong-Fan marked this pull request as draft November 30, 2024 20:34
@harryeqs harryeqs marked this pull request as ready for review December 9, 2024 02:20
@harryeqs
Copy link
Collaborator

harryeqs commented Dec 9, 2024

@Wendong-Fan Hi Wendong, the three functional calling benchmarks have been integrated following the pattern of the GAIA benchmark integration. Sorry that the retriever has not yet been integrated into the APIBench benchmark due to time constraints and that can be done by Wednesday at the earliest, but the other parts are ready for review.

I also have a few questions regarding the integration:

  1. The code of all three benchmarks is under Apache 2.0 license, are there any patterns to follow when referencing the code I copied/adapted?
  2. What should be included in the unit test for these three benchmarks? I feel like the GAIA benchmark test is hard to reference for these benchmarks.
  3. The level 3 dataset of APIBank still can't run since there haven't been any updates from the authors on how to use the evaluator.
  4. How do you run the pre-commit test when committing to this fork?

Thanks!

@harryeqs harryeqs requested a review from Wendong-Fan December 9, 2024 02:41
@harryeqs harryeqs changed the title add benchmark apibank, gorilla, nexus feat: Add benchmarks API-Bank, APIBench, Nexus Dec 9, 2024
Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HHHHHejia @harryeqs ! Left some comments for APIBank

camel/benchmarks/apibank.py Outdated Show resolved Hide resolved
camel/benchmarks/apibank.py Outdated Show resolved Hide resolved
camel/benchmarks/apibench.py Outdated Show resolved Hide resolved
ast_database.append(ast_tree)
self._data['ast'] = ast_database

def run( # type: ignore[override]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current run method in BaseBenchmark should be refactored. cc @liuxukun2000

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wendong should this be done in this PR, or shall we set up a new issue and PR for the refactoring?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can do it in another pr, issue created here:#1338

camel/benchmarks/apibench.py Outdated Show resolved Hide resolved
camel/benchmarks/apibank.py Outdated Show resolved Hide resolved
camel/benchmarks/apibank.py Outdated Show resolved Hide resolved
camel/benchmarks/apibank.py Outdated Show resolved Hide resolved
camel/benchmarks/apibank.py Outdated Show resolved Hide resolved
camel/benchmarks/apibank.py Show resolved Hide resolved
@Wendong-Fan
Copy link
Member

some errors with running pre-commit run --all-files need to be fixed

camel/benchmarks/utils/ast_eval.py:28: error: Cannot find implementation or library stub for module named "tree_sitter_python"  [import-not-found]
camel/benchmarks/utils/ast_eval.py:28: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
camel/benchmarks/utils/ast_eval.py:29: error: Library stubs not installed for "tree_sitter"  [import-untyped]

Hi Wendong sorry I tried but could not reproduce this error. Could you please check if the tree-sitter-python package has been successfully installed in your venv after running poetry install --with dev,docs -E all. This package was added to the poetry lock and its stub check should have been disabled so there shouldn't be any errors?

Sorry @harryeqs , my bad, I didn't switch my environment properly

@harryeqs
Copy link
Collaborator

Thanks @HHHHHejia @harryeqs ! Left some comments for APIBank

Thanks @Wendong-Fan for the comprehensive review! I have made some changes and please have a look when possible.
Thanks!

Copy link
Member

@Wendong-Fan Wendong-Fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @harryeqs and sorry for the late review, left some comments below

camel/benchmarks/apibench.py Outdated Show resolved Hide resolved
camel/benchmarks/utils/ast_eval.py Outdated Show resolved Hide resolved
camel/benchmarks/apibench.py Outdated Show resolved Hide resolved
camel/benchmarks/apibench.py Outdated Show resolved Hide resolved
camel/benchmarks/nexus.py Outdated Show resolved Hide resolved
camel/benchmarks/nexus.py Outdated Show resolved Hide resolved
test/benchmarks/test_apibank_benchmark.py Outdated Show resolved Hide resolved
test/benchmarks/test_apibench_benchmark.py Outdated Show resolved Hide resolved
test/benchmarks/test_nexus_benchmark.py Outdated Show resolved Hide resolved
Comment on lines +46 to +48
tree-sitter = "*"
tree-sitter-python = "*"
googletrans-py = "4.0.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did these dependencies used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wendong, the googletrans-py is used in the APIs defined for the API-Bank, while the tree-sitter and tree-sitter-python are used for evaluation of APIBench.

@Wendong-Fan
Copy link
Member

hey @harryeqs , I noticed some review comments were marked as resolved, but the updates don’t appear to have been pushed. Did you forget to push the code? Please ensure all comments are fully addressed before marking them as resolved.

@harryeqs
Copy link
Collaborator

hey @harryeqs , I noticed some review comments were marked as resolved, but the updates don’t appear to have been pushed. Did you forget to push the code? Please ensure all comments are fully addressed before marking them as resolved.

Hi Wendong I am very sorry that the update was delayed due to the hackathon. I've addressed some comments locally but have not pushed since I am still adding the tests. I will finish the tests asap this afternoon and push the code. Sorry for the wait!

@harryeqs
Copy link
Collaborator

thanks @harryeqs and sorry for the late review, left some comments below

Thank you @Wendong-Fan very much for the review! Sorry for the delay in updating the benchmarks. The tests have been added but they are based on a number of mocks as downloading the actual datasets is time-consuming and requires extra storage.

camel/benchmarks/apibench.py Outdated Show resolved Hide resolved
@Wendong-Fan Wendong-Fan enabled auto-merge (squash) December 29, 2024 08:56
@Wendong-Fan Wendong-Fan disabled auto-merge December 29, 2024 08:56
@Wendong-Fan Wendong-Fan merged commit 926596e into camel-ai:master Dec 29, 2024
0 of 6 checks passed
@Wendong-Fan
Copy link
Member

from Guohao: in the example we should add tools to the ChatAgent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants