From Sequence to Structure to Function.

Someone hands you a protein sequence. It's just letters on a screen – MKTAYIAKQR and so on for maybe three hundred amino acids. They ask what it does.

Twenty years back, you'd need months in the lab. X-ray crystallography, if the protein even crystallises. Functional assays, if you're lucky enough to have a decent hypothesis. Today I can open my laptop, run some software while making lunch, and have a reasonable answer by afternoon. The shift has been remarkable.

We always suspected the sequence held everything. It makes the structure. Structure determines function. Seemed logical. Actually cracking that code, though? Different story entirely.

Why Structure Matters So Much

Proteins are molecular machines. Each does one specific job, and shape is everything. Haemoglobin has pockets sized perfectly for oxygen molecules. Change that geometry even slightly, and you get sickle cell disease. Antibodies recognise pathogens because their binding sites fit specific molecular patterns. Enzymes catalyse reactions at precise active sites where substrate geometry aligns just right.

Your ribosome produces a linear amino acid chain. Then something interesting happens – the chain folds. No external instructions needed. Hydrophobic residues migrate inward, away from water. Hydrophilic ones stay on the surface. Segments twist into alpha helices or fold into beta sheets. Pure chemistry is driving the process.

Christian Anfinsen demonstrated this in 1961. He completely unfolded a protein, then left it alone. It refolded perfectly into its original structure. No chaperones, no cellular machinery – just the amino acid sequence and thermodynamics. That proved the sequence contains all structural information. Nobel Prize work, and rightfully so.

But there's a catch. Information existing in the sequence and being able to extract it are completely different problems. A three-hundred-residue protein can adopt an astronomical number of possible conformations. Something like 10^300 theoretical structures. That number is meaningless – it's just incomprehensibly large.

Experimental structure determination takes months, sometimes years. X-ray crystallography requires crystals, which some proteins refuse to form. We've sequenced billions of proteins by now. Solving all those structures experimentally? Impossible.

Computers Enter the Picture

Computational prediction efforts go back decades. Early methods were straightforward – find a similar sequence with known structure, use it as a template. Homology modelling worked reasonably well for close evolutionary relatives. Novel folds? Predictions were mostly wrong.

2020 changed everything.

DeepMind released AlphaFold2. I remember reading the CASP14 results – the biennial structure prediction competition – and being genuinely stunned. AlphaFold2 achieved 0.96 Angstrom median accuracy for backbone atoms (Jumper et al., 2021). Second place got 2.8 Angstroms. That's the difference between clinically useful and essentially useless predictions.

The approach was fundamentally different. Instead of simulating physics or matching templates, AlphaFold2 learned from evolutionary patterns. Related proteins across species show co-evolution at specific sequence positions. When amino acid 50 mutates, position 200 mutates too. Consistently. They're spatially close in the folded structure despite being distant in the linear sequence. They must change together to maintain structural integrity. AlphaFold2 learned these co-evolutionary signals and converted them to three-dimensional coordinates.

Impact was immediate. By 2022, AlphaFold predicted structures for over 200 million proteins (Tunyasuvunakool et al., 2021). Essentially, every sequence in major databases. Researchers who waited years for structures got predictions within hours. Structural coverage for human proteins jumped from roughly 17% to over 50% of residues. Transformative doesn't quite capture it.

From Structure to Function

Structure is invaluable, but it doesn't directly reveal function. You can examine a 3D model, identify potential binding pockets, and note surface properties. The actual biological role requires more investigation.

Bioinformatics offers multiple approaches here. Molecular docking simulations test protein-ligand interactions across millions of configurations. Network methods analyse residue connectivity patterns, treating structures as graphs. Tools like DeepFRI use graph neural networks combining sequence and structural data for function prediction (Gligorijević et al., 2021), achieving strong accuracy.

Modern analyses integrate diverse evidence – evolutionary conservation, structural features, molecular dynamics, and annotations from homologs. Recent work on FMO enzymes illustrates this well. These enzymes metabolise drugs, and researchers wanted to understand substrate specificity. They combined AlphaFold2 predictions with docking simulations to identify residues controlling selectivity (Hussain & Brooks, 2024). Computational analysis first, experimental validation second. That's become standard practice.

Engineering Better Proteins

This isn't just academic. Protein engineering depends entirely on understanding sequence-structure-function relationships.

Historically, you had two approaches. Directed evolution generates massive variant libraries, screening for desired properties. Powerful but blind – no structural knowledge guiding the search. Rational design makes targeted mutations based on mechanistic understanding. Precise, but requires having that structure first.

Fast structure prediction changed the equation. You can now combine approaches – generate variants while using structural information to guide exploration. Map plus compass instead of wandering blind.

Pharmaceutical companies model how mutations affect drug binding and stability before synthesis. Biotech firms engineer enzymes for industrial applications – plastic degradation, biofuel production, chemical synthesis. Medical researchers study disease mutations, understanding how they disrupt normal protein function. Computational hypothesis testing before lab work accelerates everything substantially.

Current Computational Tools

The available infrastructure would have seemed like a fantasy ten years ago. Protein language models trained on millions of sequences learned evolutionary patterns automatically. ESMFold predicts structures in minutes rather than hours, trading modest accuracy for speed.

Function prediction methods have diversified. Sequence-based approaches identify conserved motifs and domains. Structure-based methods analyse binding pockets and geometric features. Systems approaches examine pathway integration. Machine learning frameworks integrate all data types simultaneously.

Accessibility has improved dramatically. ColabFold lets anyone with internet access run AlphaFold2 without installations. Numerous web servers provide predictions and visualisations without requiring programming knowledge. That matters enormously for experimental biologists wanting structural insights.

Remaining Challenges

AlphaFold2 excels with globular, well-folded proteins. Not everything follows those rules. Intrinsically disordered proteins lack stable structures but perform critical functions – still computationally challenging. Multi-chain complexes can be difficult. Proteins undergoing large conformational changes present ongoing problems.

We've largely solved static structures. Proteins aren't static. They're dynamic machines. Understanding conformational ensembles and movements remains hard. Function prediction accuracy varies significantly. Well-studied enzyme families get excellent predictions. Rare or poorly annotated functions? Much less reliable. Post-translational modifications, cellular localisation, context-dependent interactions – all add complexity beyond static structures.

Active research addresses these gaps. Flexibility and dynamics prediction are advancing. Protein-protein interaction models improve steadily. AlphaFold3, released in 2024, handles complexes with proteins, DNA, RNA, and small molecules. Next-generation tools will incorporate more biological complexity – membrane environments, pH effects, and molecular crowding.

Why It Matters

This was molecular biology's grand challenge. Bioinformatics transformed years of work into hours. Experiments remain essential – predictions guide research, experiments provide ground truth. The combination is powerful.

We once observed sequence-structure-function relationships only retrospectively, after extensive experiments. Now we predict prospectively. Methods keep improving. Our capacity to understand and engineer biological systems expands continuously.

Drug discovery accelerates. Synthetic biology becomes more predictable. Disease understanding deepens. Biotechnology develops sustainable processes. All because we can computationally decode sequence to structure to function.

Ten years ago, this seemed nearly impossible. Now we routinely predict structures and infer functions from sequence data. We're just beginning to explore what becomes possible.

References

. . .

Discus