Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教下:评测判断时用instructGPT+prompt和用这些数据finetune分类模型,哪个评测的相关性更高,有对比数据不 #2

Closed
lierer007 opened this issue Apr 25, 2023 · 5 comments

Comments

@lierer007
Copy link

看文章里引用的几篇用LLM做评测的论文,好像都是针对生成比较有优势,像这钟通用领域的安全性判别问题,也会有优势吗

@TissueC
Copy link
Member

TissueC commented Apr 26, 2023

我们其实也是针对模型的生成做安全评判,也会有优势

@TissueC TissueC closed this as completed Apr 26, 2023
@lierer007
Copy link
Author

不好意思,应该是我描述的不太准确,如果是对生成的总体质量评测,因为涉及流畅、事实、一致性类的指标不太好衡量,所以有优势;
但是具体到安全判别的话,可以明确的建模成分类问题,prompt+LLM 还会比 finetune有优势吗?

或者说如果有一个类似perspectiveAPI的判别器,只考虑效果的话,prompt+LLM会更有优势吗?
您有涉及这方面实验的文章介绍吗

@TissueC
Copy link
Member

TissueC commented Apr 27, 2023

安全本身的定义比较模糊复杂,场景多样,所以可能不像普通的分类任务(例如情感极性二分类)那么简单,或者说难以明确地建模为简单的分类问题。而且安全会涉及到一些知识,LLM也会更有优势。

@lierer007
Copy link
Author

嗯嗯确实,学习到了,多谢多谢
q+a平均长度上百,还有一些安全类型明显涉及推理能力,或者提前很难定义清楚,LLM确实有优势

不过看论文,如果没理解错的话,评测是在定义好的13种安全类型上分别做二分类
那么即便对于判断”脏话侮辱“这个相对比较清晰的类型,LLM也会更有优势吗

@TissueC
Copy link
Member

TissueC commented Apr 27, 2023

关于这一点我们有在做更细致的实验,可以关注我们的未来的工作

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants