Protein Property Prediction

This tool predicts a panel of properties for a protein or peptide from its sequence alone, without needing a structure. The properties are the kinds of things that decide whether a candidate is practical to make and use, often grouped under the word developability.

The panel covers solubility (how well it stays dissolved), aggregation (how likely it is to clump), subcellular localization (where in a cell it tends to go), intrinsic disorder (which regions have no fixed shape), toxicity, melting temperature or Tm (the temperature at which it unfolds, a measure of thermal stability), and MHC binding (how likely pieces of it are to be displayed to the immune system, for both class I and class II).

You provide exactly one of a plain amino acid sequence or a HELM string. By default it runs the whole panel.

When to use it

Use it to triage or rank candidate sequences before committing to wet lab work. When you have many designs and limited capacity to test them, this gives a fast, sequence only read on which ones look developable and which carry red flags such as high aggregation or strong predicted immune visibility. You can run the full panel or just the properties you care about.

Inputs

Input	Required	What it is
`sequence`	one of	The protein or peptide as a plain string of the 20 standard amino acids, for example `MKTAYIAKQRQ...`. Provide this or `helm`. Sequences over 1022 residues are truncated, and over 2044 are rejected.
`helm`	one of	The peptide written in HELM2 notation, for example `PEPTIDE1{A.G.F.K.L}$$$$V2.0`. Use this when your peptide has explicit linkers or modified ends that a plain string cannot capture. Provide this or `sequence`. Not available in batch mode.
`properties`	no	A list choosing which properties to predict, from `solubility`, `aggregation`, `disorder`, `localization`, `toxicity`, `tm`, and `mhc`. Leave it empty to run the full panel.
`properties_options`	no	Per property settings, for example `{"mhc": {"peptide_length": 15, "mhc_class": "II"}}` to run MHC binding for class II at a given length. Leave empty for sensible defaults.
`model_variant`	no	Model size: `650M` (the default, fast and accurate for most uses) or `3B` (larger, slightly higher accuracy at higher cost).

Two further optional flags return extra detail: return_per_residue adds per residue scores where available (currently disorder and aggregation hotspots) so you can see which regions drive a prediction, and return_embedding adds the sequence’s numeric fingerprint for downstream similarity search. Both are off by default.

How to run it

Submit your sequence from Azulene Studio, the Python SDK, or the CLI. New here? The Get started page walks through installing, logging in, and running a ready made example first.

In Azulene Studio

Open Protein Property Prediction from the tools list, then on the Inputs and Parameters step paste your sequence, optionally pick just the properties you want (or leave it to run the full panel), then Review and Submit.

From the Python SDK

from opal import jobs

result = jobs.submit(
    job_type="predict_protein_properties",
    input_data={
        "sequence": "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKR",
        "properties": ["solubility", "aggregation", "tm"],
    },
)

From the CLI

Pass the inputs as a JSON string.

opal jobs submit --job-type predict_protein_properties \
  --input-data '{"sequence": "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKR", "properties": ["solubility", "aggregation", "tm"]}'

Reading the result

You get back one value per requested property. Solubility, aggregation, and toxicity come as scores you can use to rank candidates against each other. Localization tells you the predicted cellular compartment. Disorder flags which parts of the sequence have no fixed shape. Tm is reported as a temperature, where higher means more thermally stable. MHC binding reports how likely fragments of the sequence are to be displayed to the immune system.

If you asked for per residue scores, the result also shows where along the sequence the disorder and aggregation signals come from, so you can pinpoint the regions to redesign. The full numeric output is available to download from the result, and large results are delivered as a downloadable attachment.

Notes

Provide exactly one of sequence or helm. Leaving properties empty runs the full panel, which is the simplest way to start. The 650M model is a good default; switch to 3B only when you want a little more accuracy and accept the higher cost. The model runs on a GPU. For small molecule properties rather than protein properties, see Predict ADMET Properties and Predict Aqueous Solubility.