Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we consider to use Llama sovle this issue #7

Open
ZhouNan2020 opened this issue Sep 27, 2024 · 5 comments
Open

Can we consider to use Llama sovle this issue #7

ZhouNan2020 opened this issue Sep 27, 2024 · 5 comments

Comments

@ZhouNan2020
Copy link

I think we can use the 'zero-shot' prompt to clean and normalize free text data in FAERS, some open-source LLMs like Qwen 2.5, Llama 3.2 maybe useful in this project.

@fusarolimichele
Copy link
Owner

@ZhouNan2020 Great idea! I imagine we would still need to validate some of the results to check that the performance is good enough, but if it is reliable we could achieve an automatic translation. Did you have any idea to try it?

@ZhouNan2020
Copy link
Author

First of all, I've used the LLM playground to try to clean some disease names, and asked GPT or Claude to convert these disease names to ICD-10 encoding, and the results seem to be good.

Second, if we want to plug into Diana, we should look into LLama's API instead of using playgrounds.

Thirdly, I see that diana is currently mainly using JS and R, but the API that can be connected to LLMs should use more python

Finally, as you said, we need to verify the reliability of the results. I don't think we have to guarantee that every result is correct, as long as we use a reasonable verification process to prove that we can maintain reliability within a reliable range. For example, if you randomly select and cross-validate the results of a zero-shot, as long as the AUC value of the LLMs performance can reach 0.8, I think it will be fine. We must understand that what we are looking for is a balance between human effort and accuracy, not complete accuracy, because even if we use human sifting and clarity, the results are not necessarily accurate.

@fusarolimichele
Copy link
Owner

Great work! If the mapping from MedDRA to ICD-10 is of interest to you, also note that there is a human-generated validated mapping between MedDRA and ICD-10 that you could find useful.

For the non-translated drug, it would definitely worth a try, even just to provide a first automatic suggestion for translation that can then be validated depending on the need. Validating and compiling the drugname translation row by row is a really big effort that we have to repeat at every new quarter update and it would be great to have an automatic support (at the moment we are just using the already validated dictionary together with some fuzzy techniques based on Levenshtein distance and string editing to precompile translation to be validated)!

For this month we will be very busy, but if you are interested in trying to implement your idea we can discuss it next month. :)

@ZhouNan2020
Copy link
Author

ZhouNan2020 commented Oct 2, 2024 via email

@fusarolimichele
Copy link
Owner

Understand! Same situation ;) If you have not seen it, If you need for your thesis an already cleaned version of the FAERS, including drugs and events, the https://github.com/fusarolimichele/DiAna_package stores an R package that allows you to download the cleaned version with just one command! Good luck with your thesis, and feel free to reach out for anything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants