-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add benchmarks API-Bank, APIBench, Nexus #1136
Conversation
@Wendong-Fan Hi Wendong, the three functional calling benchmarks have been integrated following the pattern of the GAIA benchmark integration. Sorry that the retriever has not yet been integrated into the APIBench benchmark due to time constraints and that can be done by Wednesday at the earliest, but the other parts are ready for review. I also have a few questions regarding the integration:
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HHHHHejia @harryeqs ! Left some comments for APIBank
ast_database.append(ast_tree) | ||
self._data['ast'] = ast_database | ||
|
||
def run( # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current run
method in BaseBenchmark
should be refactored. cc @liuxukun2000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Wendong should this be done in this PR, or shall we set up a new issue and PR for the refactoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we can do it in another pr, issue created here:#1338
Sorry @harryeqs , my bad, I didn't switch my environment properly |
Thanks @Wendong-Fan for the comprehensive review! I have made some changes and please have a look when possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @harryeqs and sorry for the late review, left some comments below
tree-sitter = "*" | ||
tree-sitter-python = "*" | ||
googletrans-py = "4.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did these dependencies used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Wendong, the googletrans-py is used in the APIs defined for the API-Bank, while the tree-sitter and tree-sitter-python are used for evaluation of APIBench.
hey @harryeqs , I noticed some review comments were marked as resolved, but the updates don’t appear to have been pushed. Did you forget to push the code? Please ensure all comments are fully addressed before marking them as resolved. |
Hi Wendong I am very sorry that the update was delayed due to the hackathon. I've addressed some comments locally but have not pushed since I am still adding the tests. I will finish the tests asap this afternoon and push the code. Sorry for the wait! |
Thank you @Wendong-Fan very much for the review! Sorry for the delay in updating the benchmarks. The tests have been added but they are based on a number of mocks as downloading the actual datasets is time-consuming and requires extra storage. |
from Guohao: in the example we should add tools to the ChatAgent |
I've added the APIbank APIbench and Nexus benchmark, main method see benchmark test and utils folder (benchmark_base.py)
There're some problem to be solved for the APIBank, APIbench(gorilla) and Nexus benchmark. listed as below.
For Nexus:
run python nexus_test.py. You'll get error
1.OpenAI limits the size of the function passed into the function call api (function name, function description length, number of functions, etc.). You need to add judgment logic in Camel. If OpenAI does not allow function call, use structure output instead.
2.Critical: while true bug in camel.chatagent.step. When the incoming api is not executed correctly, while true will not terminate.The while true logic should be eliminated. You cannot assume that the function passed by the user will always be executed correctly.
For APIbench
There're three datasets 'torchhub', 'tensorhub', 'huggingface’ . "torchhub"works well. BUT
3.'tensorhub', 'huggingface’ could not be correctly evaluted by the ast matching program. This is a problem within the original repo. I have already proposed an issue. [(https://github.com/ShishirPatil/gorilla/issues/729)]
It could be version problem of tree_sitter, but if you don't use tree_sitter==0.20.4, you'll get an another bug.
For APIbank
There're three datasets 'level1', 'level2', 'level3’ . BUT
4.NO ONE knows how to eveluate 'level3'. See the issue in original repo:
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/167]
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/102]
[https://github.com/AlibabaResearch/DAMO-ConvAI/issues/114]
5.APIbank involves multiple "User-Assistant-System" messages as History Records. Camel ChatAgent does not support adding multiple rounds of system messages yet. Temporary solution: Use record_message and make_assistant_message instead of system messages.
6.The version conflict between openai in camel, Https, and Google translate in original repo, see
[https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank#demo]. Camel, Https and Google translate lib doesn't work together.
For now two way works:
-use original repo without camel, Google translate and Https works well.
-use camel, remove Google translate, it works but without Google translate tool.
See:
[https://github.com/microsoft/TaskWeaver/issues/172]
7.Some datasets need to be hosted on GitHub/HuggingFace. The original author did not do this, but we do not want to include these data in Camel's GitHub.