Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings #17

Open
Egelvein opened this issue Jan 25, 2024 · 0 comments
Open

Embeddings #17

Egelvein opened this issue Jan 25, 2024 · 0 comments

Comments

@Egelvein
Copy link

I want to get embeddings from last hidden state, but I can't do it, because my output has no param last_hidden_state.

But I got the embeddings this way:

1. Load model

trained_yield_bert = SmilesClassificationModel('bert', model_path,
                                               num_labels=1,
                                               args={"regression": True,
                                                    'config': {"output_hidden_states": True}},
                                               use_cuda=False,
                                              )

tokenizer1 = AutoTokenizer.from_pretrained(model_path)

2. Inputs

test_df.head(1).labels.values is an ordinary SMILES row

bert_inputs = tokenizer1.batch_encode_plus(str(test_df.head(1).labels.values),
                                        max_length=trained_yield_bert.config.max_position_embeddings,
                                           padding=True,
                                           truncation=True,
                                           pad_to_max_length=True,
                                           return_tensors='pt')

bert_inputs
{'input_ids': tensor([[12, 11, 13,  ...,  0,  0,  0],
        [12, 11, 13,  ...,  0,  0,  0],
        [12, 24, 13,  ...,  0,  0,  0],
        ...,
        [12, 43, 13,  ...,  0,  0,  0],
        [12, 98, 13,  ...,  0,  0,  0],
        [12, 11, 13,  ...,  0,  0,  0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

3. Outputs

with torch.no_grad():
    output = trained_yield_bert.model(**bert_inputs)

embeddings = output[0].squeeze().cpu().numpy().tolist()
embeddings
[0.672431230545044,
 0.672431230545044,
 0.8746748566627502,
 0.6140751242637634,
 0.5577840805053711,
 0.522050142288208,
 0.6576945781707764,
 0.6140751242637634,
 0.5635161995887756,
 0.5149366855621338,
 0.5635161995887756,
 0.672431230545044]

4. Questions

Output has a dimensionality of 2. The first dimension is presented above, the second one has a dimensionality 13 x 12 x 512 x 256.
I want to understand, which of these data are embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant