ESM-2 Sequence Embeddings
ESM-2 Sequence Embeddings takes one or more protein sequences and turns each into an embedding using the ESM-2 protein language model. An embedding is a list of numbers, a vector, that captures what the model has learned about that sequence from seeing millions of real proteins. Sequences that the model considers similar end up with similar vectors, so embeddings give you a numeric handle on proteins that you can compare, cluster, or feed into your own downstream model.
You give the tool protein sequences as one letter amino acid strings, for example a string of letters like MKIEEL..., where each letter is one amino acid. ESM-2 reads each sequence and produces numbers that describe it.
You can choose how the numbers are returned. With mean pooling, the whole sequence is summarized into a single vector. With CLS pooling, the model’s own summary token is used. With no pooling, you get one vector for every individual amino acid in the sequence, which is more detailed but much larger.
When to use it
Section titled “When to use it”Use it when you want a numeric representation of one or more proteins, for example to measure how similar two proteins are, to group a set of proteins by similarity, or to use the vectors as input features for your own machine learning model. Turn on contact prediction when you also want the model’s guess at which parts of the protein touch each other in 3D.
Inputs
Section titled “Inputs”| Input | Required | What it is |
|---|---|---|
sequences | yes | A list of one or more protein sequences, each a one letter amino acid string, each between 1 and 2048 amino acids long. |
model_variant | no, default 650M | Which ESM-2 model to use. 650M produces 1280 numbers per vector. 3B is a larger model and produces 2560 numbers per vector. |
pool | no, default mean | How the embedding is summarized. mean returns one vector per sequence inline. cls returns the model’s summary token vector. none returns one vector per amino acid, packaged as a downloadable gzipped tar file. |
return_contacts | no, default false | If true, also returns a predicted contact map for each sequence, a grid showing which pairs of amino acids the model thinks are in contact in 3D. |
fp16 | no, default true | Runs the model in a faster, lower precision number format on the GPU. It is ignored when return_contacts is true. |
How to run it
Section titled “How to run it”Submit your sequences from Azulene Studio, the Python SDK, or the CLI. New here? The Get started page walks through installing, logging in, and running a ready made example first.
In Azulene Studio
Section titled “In Azulene Studio”Open ESM-2 Sequence Embeddings from the tools list, then on the Inputs and Parameters step enter your protein sequences as a JSON list, pick the model variant and pooling if you want to change the defaults, then Review and Submit.
From the Python SDK
Section titled “From the Python SDK”from opal import jobs
result = jobs.submit( job_type="esm2_embed", input_data={ "sequences": ["MKIEELKKWVEEFDKKLAEIFKFDFGGYRELADKVAEAVGKKVDEKQKKIVEIFEKVEAEA"], "model_variant": "650M", "pool": "mean", },)From the CLI
Section titled “From the CLI”Pass the inputs as a JSON string.
opal jobs submit --job-type esm2_embed \ --input-data '{"sequences": ["MKIEELKKWVEEFDKKLAEIFKFDFGGYRELADKVAEAVGKKVDEKQKKIVEIFEKVEAEA"], "model_variant": "650M", "pool": "mean"}'Reading the result
Section titled “Reading the result”The main field is embeddings, a list with one entry per sequence you sent in. What each entry holds depends on the pooling you chose:
- With
meanorclspooling, each entry is one vector for the whole sequence: a list of 1280 numbers for the650Mmodel, or 2560 numbers for the3Bmodel. - With
poolset tonone, you get one vector per amino acid for every sequence, which is too large to return inline, so it is delivered as a downloadable gzipped tar file on the Files tab. - With
return_contactsset totrue, thecontactsfield holds a predicted contact map for each sequence, a square grid where each entry says how likely two amino acids are to be in contact. When you do not ask for contacts,contactsis empty.
The result also reports pool, the pooling that was used, and a small scalars block with runtime_s, max_length, and n_sequences. In Azulene Studio the embedding values are shown as raw result data, and any large per amino acid output is offered as a file download.
The embeddings are most useful as input to a comparison, clustering, or your own model, rather than as a number to read directly. Per amino acid output (pool set to none) can be large and is delivered as a downloadable file. The 3B model captures more but is heavier than 650M.