ESM-2 Sequence Embeddings

ESM-2 Sequence Embeddings takes one or more protein sequences and turns each into an embedding using the ESM-2 protein language model. An embedding is a list of numbers, a vector, that captures what the model has learned about that sequence from seeing millions of real proteins. Sequences that the model considers similar end up with similar vectors, so embeddings give you a numeric handle on proteins that you can compare, cluster, or feed into your own downstream model.

You give the tool protein sequences as one letter amino acid strings, for example a string of letters like MKIEEL..., where each letter is one amino acid. ESM-2 reads each sequence and produces numbers that describe it.

You can choose how the numbers are returned. With mean pooling, the whole sequence is summarized into a single vector. With CLS pooling, the model’s own summary token is used. With no pooling, you get one vector for every individual amino acid in the sequence, which is more detailed but much larger.

When to use it

Use it when you want a numeric representation of one or more proteins, for example to measure how similar two proteins are, to group a set of proteins by similarity, or to use the vectors as input features for your own machine learning model. Turn on contact prediction when you also want the model’s guess at which parts of the protein touch each other in 3D.

Inputs

Input	Required	What it is
`sequences`	yes	A list of one or more protein sequences, each a one letter amino acid string, each between 1 and 2048 amino acids long.
`model_variant`	no, default `650M`	Which ESM-2 model to use. `650M` produces 1280 numbers per vector. `3B` is a larger model and produces 2560 numbers per vector.
`pool`	no, default `mean`	How the embedding is summarized. `mean` returns one vector per sequence inline. `cls` returns the model’s summary token vector. `none` returns one vector per amino acid, packaged as a downloadable gzipped tar file.
`return_contacts`	no, default `false`	If `true`, also returns a predicted contact map for each sequence, a grid showing which pairs of amino acids the model thinks are in contact in 3D.
`fp16`	no, default `true`	Runs the model in a faster, lower precision number format on the GPU. It is ignored when `return_contacts` is `true`.

How to run it

Submit your sequences from Azulene Studio, the Python SDK, or the CLI. New here? The Get started page walks through installing, logging in, and running a ready made example first.

In Azulene Studio

Open ESM-2 Sequence Embeddings from the tools list, then on the Inputs and Parameters step enter your protein sequences as a JSON list, pick the model variant and pooling if you want to change the defaults, then Review and Submit.

From the Python SDK

from opal import jobs

result = jobs.submit(
    job_type="esm2_embed",
    input_data={
        "sequences": ["MKIEELKKWVEEFDKKLAEIFKFDFGGYRELADKVAEAVGKKVDEKQKKIVEIFEKVEAEA"],
        "model_variant": "650M",
        "pool": "mean",
    },
)

From the CLI

Pass the inputs as a JSON string.

opal jobs submit --job-type esm2_embed \
  --input-data '{"sequences": ["MKIEELKKWVEEFDKKLAEIFKFDFGGYRELADKVAEAVGKKVDEKQKKIVEIFEKVEAEA"], "model_variant": "650M", "pool": "mean"}'

Reading the result

The main field is embeddings, a list with one entry per sequence you sent in. What each entry holds depends on the pooling you chose:

With mean or cls pooling, each entry is one vector for the whole sequence: a list of 1280 numbers for the 650M model, or 2560 numbers for the 3B model.
With pool set to none, you get one vector per amino acid for every sequence, which is too large to return inline, so it is delivered as a downloadable gzipped tar file on the Files tab.
With return_contacts set to true, the contacts field holds a predicted contact map for each sequence, a square grid where each entry says how likely two amino acids are to be in contact. When you do not ask for contacts, contacts is empty.

The result also reports pool, the pooling that was used, and a small scalars block with runtime_s, max_length, and n_sequences. In Azulene Studio the embedding values are shown as raw result data, and any large per amino acid output is offered as a file download.

Notes

The embeddings are most useful as input to a comparison, clustering, or your own model, rather than as a number to read directly. Per amino acid output (pool set to none) can be large and is delivered as a downloadable file. The 3B model captures more but is heavier than 650M.