Max length #113

reouvenzana · 2024-06-05T21:47:39Z

reouvenzana
Jun 5, 2024

Hello, the default max length (296) is pretty low for my usage. If I understand correctly, it can be modified in the yaml config files (not sure which one though), but are there any recommendations / limits / bad practices about that? Should I try to split my text into smaller chunks instead?

Thank you for your work and help.

Answered by urchade

Jun 6, 2024

Yes, you can modify the max in the config file and it should work

But as it is an ouf of domain scenario, I recommend chucking the text into parts

View full answer

urchade · 2024-06-06T05:26:23Z

urchade
Jun 6, 2024
Maintainer

Yes, you can modify the max in the config file and it should work

But as it is an ouf of domain scenario, I recommend chucking the text into parts

0 replies

aayush-5 · 2024-06-26T18:50:17Z

aayush-5
Jun 26, 2024

Hi @urchade , @reouvenzana ,

I am using model mediumv2.5 , In order to make changes to default max length, in which files I have to make changes ? And do I have to clone the whole repository or I can only download those files and make changes and run the fine tuning training ?

Please help, its urgent !!!

Thanks & Regards

6 replies

urchade Jun 27, 2024
Maintainer

Nice, alternatively, you can use this code

from typing import List, Dict, Any

def split_text_into_chunks(text: str, chunk_size: int) -> List[str]:
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

def calculate_offsets(chunks: List[str]) -> List[int]:
    offset = 0
    offsets = []
    for chunk in chunks:
        offsets.append(offset)
        offset += len(chunk) + 1  # +1 for the space that was removed during split
    return offsets

def adjust_indices(entities: List[Dict], offset: int) -> List[Dict]:
    for entity in entities:
        entity["start"] += offset
        entity["end"] += offset
    return entities

def predict_long_text(model, text: str, labels: List[str], chunk_size: int = 384) -> List[Dict]:
    if chunk_size <= 0:
        raise ValueError("chunk_size must be greater than 0")

    chunks = split_text_into_chunks(text, chunk_size)
    offsets = calculate_offsets(chunks)

    all_entities = []
    chunk_entities_list = model.batch_predict_entities(chunks, labels, threshold=0.5)

    for chunk_entities, offset in zip(chunk_entities_list, offsets):
        adjusted_entities = adjust_indices(chunk_entities, offset)
        all_entities.extend(adjusted_entities)

    return all_entities
    ```

ExtReMLapin Jul 4, 2024

I could be missing something but the limit is ~300 tokens, not characters, right ?

urchade Jul 4, 2024
Maintainer

yes, 300 tokens

ExtReMLapin Jul 4, 2024

Your code splits into 300 chars, not tokens

urchade Jul 4, 2024
Maintainer

@ExtReMLapin it splits the tokens. take a look at split_text_into_chunks, words = text.split()

ExtReMLapin · 2024-07-04T08:59:23Z

ExtReMLapin
Jul 4, 2024

Here is my quick and dirty take at a 300 tokens splitter

import fitz # install using: pip install PyMuPDF
import tqdm
with fitz.open('ORANO-MAG-2021_205x275_FR_MEL.pdf') as doc:
    text = ""
    for page in doc:
        text += page.get_text()


from gliner import GLiNER

#model = GLiNER.from_pretrained("knowledgator/gliner-multitask-large-v0.5")
model = GLiNER.from_pretrained("urchade/gliner_multi-v2.1")

model.to('cuda')
text = text.replace('\n', ' ').replace('  ', ' ')

labels = [ "Facility", "Geo-Political Entity", "Location", "Organization", "Person", "Vehicle", "Weapon", "Date", "Event"]

"""
We can't really do a clean simple whole text tokenization, chunk (with overlap) and run to inference after detokenizing the text
Simple because we will "lose" the index (cursor-offset) if we don't do it sequentially.
"""

from transformers import AutoTokenizer
tokenizer_model_name = 'microsoft/deberta-v3-large'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_name)

MAX_WINDOW_SIZE = 300
OVERLAP = 50
def cap_text_tokens_count(text:str, token_count:int=MAX_WINDOW_SIZE):
    token_scale_down_ratio = 5.0 #it's actually 1/0.28 -> 1./0.5 but here we take no risk
    if len(text) > token_count * token_scale_down_ratio:
        text = text[:int(token_count * token_scale_down_ratio)]
    tokenized = tokenizer(text, add_special_tokens=False)
    
    detokenized = tokenizer.decode(tokenized['input_ids'][:token_count], skip_special_tokens=True)
    return text[:len(detokenized)]


cursor_index = 0



chunks = []


for i in range(0, len(text)-OVERLAP, MAX_WINDOW_SIZE-OVERLAP):
    window_text = cap_text_tokens_count(text[cursor_index:])
    #normalize window_text to readable text

    chunks.append({"text": window_text, "cursor_index": cursor_index})
    if len(window_text) <= OVERLAP:
        print('breaking ...')
        break
    cursor_index += len(window_text) - OVERLAP




entities_store = {}
for chunk in tqdm.tqdm(chunks):
    
    window_text = chunk["text"]
    cursor_index = chunk["cursor_index"]
    try:
        entities = model.predict_entities(window_text, labels)
    except:
        if len(window_text) <= OVERLAP:
            break
        cursor_index += len(window_text) - OVERLAP
        continue
    for entity in entities:
        entity['start'] += cursor_index
        entity['end'] += cursor_index
        
        if entity['text'].lower() not in entities_store:
            entities_store[entity['text'].lower()] = entity


for key in entities_store.keys():
    print(entities_store[key]["label"], entities_store[key]["text"])

print("found", len(entities_store.keys()))

0 replies

thjco · 2024-07-12T16:43:46Z

thjco
Jul 12, 2024

What do you think about using the tokenizer's "return_overflowing_tokens" flag? Setting it to True, the tokenizer returns a list of lists and you can then use the offset_mapping to create appropriate chunks (they are all of max size), cf. the example code below.

Perhaps this could also be implemented in the model?

def get_spans_from_mapping(offset_mapping):
    return [(chunk[1][0], chunk[-2][1]) for chunk in offset_mapping]

def get_texts_from_spans(text, spans):
    return [text[start:end] for start, end in spans]

def get_entities_from_long_text(model, text, labels):
    transformer_tokenizer = model.data_processor.transformer_tokenizer
    max_len = model.config.max_len
    
    encoded = transformer_tokenizer(
        text,
        return_overflowing_tokens=True,  # get a list of lists - instead of only one list
        max_length=max_len,
        truncation=True,
        return_offsets_mapping=True  # get the mappings in order to cut the text into chunks afterwards
    )
    mapping = encoded["offset_mapping"]

    spans = get_spans_from_mapping(offset_mapping=mapping)
    texts = get_texts_from_spans(text=long_content, spans=spans)
    
    entities = model.batch_predict_entities(texts, labels, threshold=0.5)

    all_entities = []
    for span, entity_list in zip(spans, entities):
        offset, _ = span
        for entity in entity_list:
            entity["start"] += offset
            entity["end"] += offset
            all_entities.append(entity)

    return all_entities

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Max length #113

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Max length #113

reouvenzana Jun 5, 2024

Replies: 4 comments · 6 replies

urchade Jun 6, 2024 Maintainer

aayush-5 Jun 26, 2024

urchade Jun 27, 2024 Maintainer

ExtReMLapin Jul 4, 2024

urchade Jul 4, 2024 Maintainer

ExtReMLapin Jul 4, 2024

urchade Jul 4, 2024 Maintainer

ExtReMLapin Jul 4, 2024

thjco Jul 12, 2024

reouvenzana
Jun 5, 2024

Replies: 4 comments 6 replies

urchade
Jun 6, 2024
Maintainer

aayush-5
Jun 26, 2024

urchade Jun 27, 2024
Maintainer

urchade Jul 4, 2024
Maintainer

urchade Jul 4, 2024
Maintainer

ExtReMLapin
Jul 4, 2024

thjco
Jul 12, 2024