Skip to content

ESM-2 Sequence Embeddings

ESM-2 Sequence Embeddings

Turn protein sequences into numeric vectors you can compare, cluster, or feed into your own models.

ESM-2 Sequence Embeddings takes one or more protein sequences and turns each into an embedding using the ESM-2 protein language model. An embedding is a list of numbers, a vector, that captures what the model has learned about that sequence from seeing millions of real proteins. Sequences that the model considers similar end up with similar vectors, so embeddings give you a numeric handle on proteins that you can compare, cluster, or feed into your own downstream model.

You give the tool protein sequences as one letter amino acid strings, for example a string of letters like MKIEEL..., where each letter is one amino acid. ESM-2 reads each sequence and produces numbers that describe it.

You can choose how the numbers are returned. With mean pooling, the whole sequence is summarized into a single vector. With CLS pooling, the model’s own summary token is used. With no pooling, you get one vector for every individual amino acid in the sequence, which is more detailed but much larger.

Use it when you want a numeric representation of one or more proteins, for example to measure how similar two proteins are, to group a set of proteins by similarity, or to use the vectors as input features for your own machine learning model. Turn on contact prediction when you also want the model’s guess at which parts of the protein touch each other in 3D.

InputRequiredWhat it is
sequencesyesA list of one or more protein sequences, each a one letter amino acid string, each between 1 and 2048 amino acids long.
model_variantno, default 650MWhich ESM-2 model to use. 650M produces 1280 numbers per vector. 3B is a larger model and produces 2560 numbers per vector.
poolno, default meanHow the embedding is summarized. mean returns one vector per sequence inline. cls returns the model’s summary token vector. none returns one vector per amino acid, packaged as a downloadable gzipped tar file.
return_contactsno, default falseIf true, also returns a predicted contact map for each sequence, a grid showing which pairs of amino acids the model thinks are in contact in 3D.
fp16no, default trueRuns the model in a faster, lower precision number format on the GPU. It is ignored when return_contacts is true.

Submit your sequences from Azulene Studio, the Python SDK, or the CLI. New here? The Get started page walks through installing, logging in, and running a ready made example first.

Open ESM-2 Sequence Embeddings from the tools list, then on the Inputs and Parameters step enter your protein sequences as a JSON list, pick the model variant and pooling if you want to change the defaults, then Review and Submit.

from opal import jobs
result = jobs.submit(
job_type="esm2_embed",
input_data={
"sequences": ["MKIEELKKWVEEFDKKLAEIFKFDFGGYRELADKVAEAVGKKVDEKQKKIVEIFEKVEAEA"],
"model_variant": "650M",
"pool": "mean",
},
)

Pass the inputs as a JSON string.

Terminal window
opal jobs submit --job-type esm2_embed \
--input-data '{"sequences": ["MKIEELKKWVEEFDKKLAEIFKFDFGGYRELADKVAEAVGKKVDEKQKKIVEIFEKVEAEA"], "model_variant": "650M", "pool": "mean"}'

The main field is embeddings, a list with one entry per sequence you sent in. What each entry holds depends on the pooling you chose:

  • With mean or cls pooling, each entry is one vector for the whole sequence: a list of 1280 numbers for the 650M model, or 2560 numbers for the 3B model.
  • With pool set to none, you get one vector per amino acid for every sequence, which is too large to return inline, so it is delivered as a downloadable gzipped tar file on the Files tab.
  • With return_contacts set to true, the contacts field holds a predicted contact map for each sequence, a square grid where each entry says how likely two amino acids are to be in contact. When you do not ask for contacts, contacts is empty.

The result also reports pool, the pooling that was used, and a small scalars block with runtime_s, max_length, and n_sequences. In Azulene Studio the embedding values are shown as raw result data, and any large per amino acid output is offered as a file download.

The embeddings are most useful as input to a comparison, clustering, or your own model, rather than as a number to read directly. Per amino acid output (pool set to none) can be large and is delivered as a downloadable file. The 3B model captures more but is heavier than 650M.