All molecules, large or small, deserve a common digital representation

With the raise of the “new modalities” such as ADCs, representing the molecular entities is becoming more challenging but still crucial.

Small molecules’ structural representation is well established

Because small molecules have been used and studied in depth for a log time now, their digital representation is one of most advanced. All the chemists are able understand and draw representation of their molecules using one of the existing chemistry drawing software packages available on the market in a manner fairly similar to what is done on paper. The visual representation obtained via the software packages look very similar to a handwritten one, actually, likely to be better than reading scientists’ handwriting – no offence intended, my own handwriting is the stereotypical doctor’s handwriting. The various software drawing packages may have differences in the availability of this or that feature, but the overall results is consistent and understandable by other scientists.

When looking at the digital file format of the chemical structures, a variety is available. While some of the older formats such as MOL v2000 and MOL v3000 or RXN for reaction are rich and widely used, interoperable, and somewhat human readable. Drawing packages’ proprietary formats are not user friendly when you look under the hood, or interoperable. Conversion tools are necessary to open a file into a tool different than the one it was saved in. Shorter formats are also used and are more human readable e.g. SMILES for quick exchange of information as a string, but some finer information may be missed such as stereochemistry. New formats e.g. InChi, SMARTS etc., are available, richer than SMILES and easier to exchange and collaborate with because they are a string not an array like MOL files. But, so far, they are not widely as used and adopted, though their use is increasing, possibly because of the difficulty to be read directly by humans or because of the difficult rules to apply.

Biologic sequences and 3D visualisations rule the natural large molecules’ representation

When it comes to Biologics i.e. RNAs, DNAs, Proteins, Antibodies and others natural large molecules produced by a cell, the sequence in which the monomer units are stringed up together is usually used to differentiate 2 molecules. However, the full sequence is rarely used as if it is far too big, in the hundreds to the thousands of units depending on the size and nature of the biological entity. Most of the time, biologists rely on identifiers for their compounds.

One of the most important things with biologics is the 3D shape of the molecules. Using the sequences and the structural knowledge of the building blocks, natural amino acids or nucleotides, computer models can predict and visual the shape of these molecules. This is done with a “transcription” step which understands the molecular structure of the natural monomers (amino acids or nucleotides) and use their properties to calculate the most probable 3D structures like AlphaFold does. It is possible to drill down from the high level view of a protein for example, from the quaternary structure to the tertiary then secondary ones (α-helix) finally to the primary structure e.g. the amino acids sequences where a unique letter represent a given amino acid of known chemical structure. Access to the atomic arrangements can inform scientists of the expected properties and possible interactions.

The new modalities challenge

In both the small molecules and the large natural molecules (biologics) use cases, the main persona of the use case is clearly identifiable: chemists or biologists as a very rough simplification. They each require and have access to the right type of representation for the molecules to do their work.

With the introduction of non-natural building blocks into biologics, the new modalities’ representation is more challenging. This is due to the different needs of the different personas on the same entity. Putting it very simplistically: Chemists want to see the chemical structure of the unnatural modification whilst Biologists want to see the large molecule structure. Previously working on separate entities, the problem of getting the right representation for the right type of entity was not a problem. Now large and small molecules representations need to be combined on the same entity. The personae who previously were working relatively separately, are now working closely together or are actually the same scientist with combined skills.

Taking the example of Antibody Drug Conjugates (ADC) which are very promising and worked on by many R&D groups, the antibody part is huge and would be difficult to represent chemically while the linker and payload drug are small molecules for which the molecular structure is very important.

Another important piece of information is where the linker is attached on the antibody, how many attachments are available and are linked. This position information is crucial when investigating binging and the mechanism of action of the ADC and its degradation.

We are here facing the dilemma of what representation to adopt for these types of compounds to serve the needs of the scientists: compact enough but yet detailed enough in the right part of the molecules. The HELM notation, developed in partnership with the Pistoia Alliance is offering the beginning of a solution in my opinion with a smooth transition between abbreviated residues and their molecular structure with the ability to define abbreviation for novel residues. Though, we need to be careful with new abbreviations as they are not standard and can mean one thing for one person and another for another person.

In conclusion, I wish for a flexible and unified representation.

What I would like to see is a representation that enables collapsing and expanding specific part of the large molecule, for example the heavy chains and light chains of an antibody while preserving an annotation of the position of the chemical modification and a clear representation of this modification with a molecular structure, so both biologists and chemists can work on the same file, avoiding errors and misunderstanding. This may already be available, but I have not yet seen a software tool that does really smoothly.

Maybe I am wrong! I am open to learning from the readers.

Please contact me to discuss.

Small molecules’ structural representation is well established

Biologic sequences and 3D visualisations rule the natural large molecules’ representation

The new modalities challenge

In conclusion, I wish for a flexible and unified representation.

Leave a Reply Cancel reply