Max length #113
-
Hello, the default max length (296) is pretty low for my usage. If I understand correctly, it can be modified in the yaml config files (not sure which one though), but are there any recommendations / limits / bad practices about that? Should I try to split my text into smaller chunks instead? Thank you for your work and help. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
Yes, you can modify the max in the config file and it should work But as it is an ouf of domain scenario, I recommend chucking the text into parts |
Beta Was this translation helpful? Give feedback.
-
Hi @urchade , @reouvenzana , I am using model mediumv2.5 , In order to make changes to default max length, in which files I have to make changes ? And do I have to clone the whole repository or I can only download those files and make changes and run the fine tuning training ? Please help, its urgent !!! Thanks & Regards |
Beta Was this translation helpful? Give feedback.
-
Here is my quick and dirty take at a 300 tokens splitter import fitz # install using: pip install PyMuPDF
import tqdm
with fitz.open('ORANO-MAG-2021_205x275_FR_MEL.pdf') as doc:
text = ""
for page in doc:
text += page.get_text()
from gliner import GLiNER
#model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
model = GLiNER.from_pretrained("urchade/gliner_multi-v2.1")
model.to('cuda')
text = text.replace('\n', ' ').replace(' ', ' ')
labels = [ "Facility", "Geo-Political Entity", "Location", "Organization", "Person", "Vehicle", "Weapon", "Date", "Event"]
"""
We can't really do a clean simple whole text tokenization, chunk (with overlap) and run to inference after detokenizing the text
Simple because we will "lose" the index (cursor-offset) if we don't do it sequentially.
"""
from transformers import AutoTokenizer
tokenizer_model_name = 'microsoft/deberta-v3-large'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name)
MAX_WINDOW_SIZE = 300
OVERLAP = 50
def cap_text_tokens_count(text:str, token_count:int=MAX_WINDOW_SIZE):
token_scale_down_ratio = 5.0 #it's actually 1/0.28 -> 1./0.5 but here we take no risk
if len(text) > token_count * token_scale_down_ratio:
text = text[:int(token_count * token_scale_down_ratio)]
tokenized = tokenizer(text, add_special_tokens=False)
detokenized = tokenizer.decode(tokenized['input_ids'][:token_count], skip_special_tokens=True)
return text[:len(detokenized)]
cursor_index = 0
chunks = []
for i in range(0, len(text)-OVERLAP, MAX_WINDOW_SIZE-OVERLAP):
window_text = cap_text_tokens_count(text[cursor_index:])
#normalize window_text to readable text
chunks.append({"text": window_text, "cursor_index": cursor_index})
if len(window_text) <= OVERLAP:
print('breaking ...')
break
cursor_index += len(window_text) - OVERLAP
entities_store = {}
for chunk in tqdm.tqdm(chunks):
window_text = chunk["text"]
cursor_index = chunk["cursor_index"]
try:
entities = model.predict_entities(window_text, labels)
except:
if len(window_text) <= OVERLAP:
break
cursor_index += len(window_text) - OVERLAP
continue
for entity in entities:
entity['start'] += cursor_index
entity['end'] += cursor_index
if entity['text'].lower() not in entities_store:
entities_store[entity['text'].lower()] = entity
for key in entities_store.keys():
print(entities_store[key]["label"], entities_store[key]["text"])
print("found", len(entities_store.keys())) |
Beta Was this translation helpful? Give feedback.
-
What do you think about using the tokenizer's "return_overflowing_tokens" flag? Setting it to True, the tokenizer returns a list of lists and you can then use the offset_mapping to create appropriate chunks (they are all of max size), cf. the example code below. Perhaps this could also be implemented in the model?
|
Beta Was this translation helpful? Give feedback.
Yes, you can modify the max in the config file and it should work
But as it is an ouf of domain scenario, I recommend chucking the text into parts