Genomics

Large genome model: Open source AI trained on trillions of bases

AI, Biology, DNA, genes, Genetics, Genomics, Science, training / Kris Guyer / March 4, 2026

System can identify genes, regulatory sequences, splice sites, and more.

Late in 2025, we covered the development of an AI system called Evo that was trained on massive numbers of bacterial genomes. So many that, when prompted with sequences from a cluster of related genes, it could correctly identify the next one or suggest a completely novel protein.

That system worked because bacteria tend to cluster related genes together—something that’s not true in organisms with complex cells, which tend to have equally complex genome structures. Given that, our coverage noted, “It’s not clear that this approach will work with more complex genomes.”

Apparently, the team behind Evo viewed that as a challenge, because today it is describing Evo 2, an open source AI that has been trained on genomes from all three domains of life (bacteria, archaea, and eukaryotes). After training on trillions of base pairs of DNA, Evo 2 developed internal representations of key features in even complex genomes like ours, including things like regulatory DNA and splice sites, which can be challenging for humans to spot.

Genome features

Bacterial genomes are organized along relatively straightforward principles. Any genes that encode proteins or RNAs are contiguous, with no interruptions in the coding sequence. Genes that perform related functions, like metabolizing a sugar or producing an amino acid, tend to be clustered together, allowing them to be controlled by a single, compact regulatory system. It’s all straightforward and efficient.

Eukaryotes are not like that. The coding sections of genes are interrupted by introns, which don’t encode for anything. They’re regulated by a sequence that can be scattered across hundreds of thousands of base pairs. The sequences that define the edges of introns or the binding sites of regulatory proteins are all weakly defined—while they have a few bases that are absolutely required, there are a lot of bases that just have an above-average tendency to have a specific base (something like “45 percent of the time it’s a T”). Surrounding all of this in most eukaryotic genomes is a huge amount of DNA that has been termed junk: inactive viruses, terminally damaged genes, and so on.

That complexity has made eukaryotic genomes more difficult to interpret. And, while a lot of specialized tools have been developed to identify things like splice sites, they’re all sufficiently error-prone that it becomes a problem when you’re analyzing something as large as a 3 billion-base-long genome. We can learn a lot more by making evolutionary comparisons and looking for sequences that have been conserved, but there are limits to that, and we’re often as interested in the differences between species.

These sorts of statistical probabilities, however, are well-suited to neural networks, which are great at recognizing subtle patterns that can be impossible to pick out by eye. But you’d need absolutely massive amounts of data and computing time to process it and pick out some of these subtle features.

We now have the raw genome data that the process needs. Putting together a system to feed it into an effective AI training program, however, remained a challenge. That’s the challenge the team behind Evo took on.

Training a large genome model

The foundation of the Evo 2 system is a convolutional neural network called StripedHyena 2. The training took place in two stages. The initial stage focused on teaching the system to identify important genome features by feeding it sequences rich in them in chunks about 8,000 bases long. After that, there was a second stage in which sequences were fed a million bases at a time to provide the system the opportunity to identify large-scale genome features.

The researchers trained two versions of their system using a dataset called OpenGenome2, which contains 8.8 trillion bases from all three domains of life, as well as viruses that infect bacteria. They did not include viruses that attack eukaryotes, given that they were concerned that the system could be misused to create threats to humans. Two versions were trained: one that had 7 billion parameters tuned using 2.4 trillion bases, and the full version with 40 billion parameters trained on the full open genome dataset.

The logic behind the training is pretty simple: if something’s important enough to have been evolutionarily conserved across a lot of species, it will show up in multiple contexts, and the system should see it repeatedly during training. “By learning the likelihood of sequences across vast evolutionary datasets, biological sequence models capture conserved sequence patterns that often reflect functional importance,” the researchers behind the work write. “These constraints allow the models to perform zero-shot prediction without any task-specific fine-tuning or supervision.”

That last aspect is important. We could, for example, tell it about what known splice sites look like, which might help it pick out additional ones. But that might make it harder for it to recognize any unusual splice sites that we haven’t recognized yet. Skipping the fine-tuning might also help it identify genome features that we’re not aware of at all at the moment, but which could become apparent through future research.

All of this has now been made available to the public. “We have made Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset,” the paper announces.

The researchers also used a system that can identify internal features in neural networks to poke around inside of Evo 2 and figure out what things it had learned to recognize. They trained a separate neural network to recognize the firing patterns in Evo 2 and identify high-level features in it. It clearly recognized protein-coding regions and the boundaries of the introns that flanked them. It was also able to recognize some structural features of proteins within the coding regions (alpha helices and beta sheets), as well as mutations that disrupt their coding sequence. Even something like mobile genetic elements (which you can think of as DNA-level parasites) ended up with a feature within Evo 2.

What is this good for?

To test the system, the researchers started making single-base mutations and fed them into Evo 2 to see how it responded. Evo 2 could detect problems when the mutations affected the sites in DNA where transcription into RNA started, or the sites where translation of that RNA into protein started. It also recognized the severity of mutations. Those that would interrupt protein translation, such as the introduction of stop signals, were identified as more significant changes than those that left the translation intact.

It also recognized when sequences weren’t translated at all. Many key cellular functions are carried out directly by RNAs, and Evo 2 was able to recognize when mutations disrupted those, as well.

Impressively, the ability to recognize features in eukaryotic genomes occurred without the loss of its ability to recognize them in bacteria and archaea. In fact, the system seemed to be able to work out what species it was working in. A number of evolutionary groups use genetic codes with a different set of signals to stop the translation of proteins. Evo 2 was able to recognize when it was looking at a sequence from one of those species, and used the correct genetic code for them.

It was also good at recognizing features that tolerate a lot of variability, such as sites that signal where to splice RNAs to remove introns from the coding sequence of proteins. By some measures, it was better than software specialized for that task. The same was true when evaluating mutations in the BRCA2 gene, where many of the mutations are associated with cancer. Given additional training on known BRCA2 mutations, its performance improved further.

Overall, Evo 2 seems great for evaluating genomes and identifying key features. The researchers who built it suggest it could serve as a good automated tool for preliminary genome annotation.

But the striking thing about the early version of Evo was that, when prompted with a chunk of sequence that includes known bacterial genes, some of its responses included entirely new proteins with related functions. Now that it was trained on more complex eukaryotic genes, could it do the same?

We don’t entirely know. If given a bunch of DNA from yeast (a eukaryote), it would respond with a sequence that included functional RNAs, and gene-like sequences with regulatory information and splice sites. But the researchers didn’t test whether any of the proteins did anything in particular. And it’s difficult to see how they could even do that test. With bacterial genes, they could safely assume that the AI-generated gene should be doing something related to the nearby genes. But that’s generally not the case in eukaryotes, so it’s difficult to guess what functions they should even test for.

In a somewhat more informative test, the researchers asked Evo 2 to make some regulatory DNA that was active in one cell type and not another after giving it information about what sequences were active in both those cell types. The sequences that came out were then inserted into these cells and tested, but the results were pretty weak, with only 17 percent having activity that differed by a factor of two or more between the two cell types. That’s a major achievement, but it isn’t in the same realm as designing brand new proteins.

What’s next?

Overall, given that this has come out less than four months after the paper describing the original Evo, it’s not at all surprising that there wasn’t more work done to test what Evo 2 can do for designing biologically relevant DNA sequences. Biology experiments are hard and time-consuming, and it’s not always easy to judge in advance which ones will provide the most compelling information. So we’ll probably have to wait months to years to find out whether the community finds interesting things to do with Evo 2, and whether it’s good at solving any useful protein design problems.

There’s also the question of whether further training and specialization can create Evo 2 relatives that are especially good at specific tasks, such as evaluating genomes from cancer cells or annotating newly sequenced genomes. To an extent, it appears the research team wanted to get this out so that others could start exploring how it might be put to use; that’s consistent with the fact that all of the software was made available.

The big open question is whether this system has identified anything that we don’t know how to test for. Things like intron/exon boundaries and regulatory DNA have been subjected to decades of study so that we already knew how to look for them and can recognize when Evo 2 spots them. But we’ve discovered a steady stream of new features in the genome—CRISPR repeats, microRNAs, and more—over the past decades. It remains technically possible that there are features in the genome we’re not aware of yet, and Evo 2 has picked them out.

It’s possible to imagine ways to use the tools described here to query Evo 2 and pick out new genome features. So I’m looking forward to seeing what might ultimately come out of that sort of work.

Nature, 2026. DOI: 10.1038/s41586-026-10176-5 (About DOIs).

John is Ars Technica’s science editor. He has a Bachelor of Arts in Biochemistry from Columbia University, and a Ph.D. in Molecular and Cell Biology from the University of California, Berkeley. When physically separated from his keyboard, he tends to seek out a bicycle, or a scenic location for communing with his hiking boots.

Large genome model: Open source AI trained on trillions of bases Read More »

Neanderthals seemed to have a thing for modern human women

Biology, evolution, Genomics, human evolution, Neanderthals, Science / Rejus Almole / February 26, 2026

By now, it’s firmly established that modern humans and their Neanderthal relatives met and mated as our ancestors expanded out of Africa, resulting in a substantial amount of Neanderthal DNA scattered throughout our genome. Less widely recognized is that some of the Neanderthal genomes we’ve seen have pieces of modern human DNA as well.

Not every modern human has the same set of Neanderthal DNA, however; different people will, by chance, have inherited different fragments. But there are also some areas, termed “Neanderthal deserts,” where none of the Neanderthal DNA seems to have persisted. Notably, the largest Neanderthal desert is the entire X chromosome, raising questions about whether this reflects the evolutionary fitness of genes there or mating preferences.

Now, three researchers at the University of Pennsylvania, Alexander Platt, Daniel N. Harris, and Sarah Tishkoff, have done the converse analysis: examining the X chromosomes of the handful of completed genomes we have. It turns out there’s also a strong bias toward modern human sequences there, as well, and the authors interpret that as selective mating, with Neanderthal males showing a strong preference for modern human females and their descendants.

What type of selection are we looking at?

Given how long modern humans and Neanderthals had been evolving as separate populations, some degree of genetic incompatibility is definitely possible. Lots of proteins interact in various ways, and the genes behind these interaction networks will evolve together—a change in one gene will often lead to compensatory changes in other genes in the network. Over time, those changes may mean re-introducing the original gene will actually disrupt the network, with a negative impact on fitness.

That means the introduction of some Neanderthal genes into the modern human genome (or vice versa) would be disruptive and make carriers of them less fit. So they’d be selected against and lost over the ensuing generations. Of course, some segments would likely be lost at random—the genome’s pretty big, and the modern human population was likely large and growing, allowing its DNA to dilute out the influence of other human populations. Figuring out which influence is dominant can be challenging.

Neanderthals seemed to have a thing for modern human women Read More »

Have we leapt into commercial genetic testing without understanding it?

ethics, genetic testing, Genetics, Genomics, Science / Kris Guyer / February 21, 2026

A new book argues that tests might reshape human diversity even if they don’t work.

Daphne O. Martschenko and Sam Trejo both want to make the world a better, fairer, more equitable place. But they disagree on whether studying social genomics—elucidating any potential genetic contributions to behaviors ranging from mental illnesses to educational attainment to political affiliation—can help achieve this goal.

Martschenko’s argument is largely that genetic research and data have almost always been used thus far as a justification to further entrench extant social inequalities. But we know the solutions to many of the injustices in our world—trying to lift people out of poverty, for example—and we certainly don’t need more genetic research to implement them. Trejo’s point is largely that more information is generally better than less. We can’t foresee the benefits that could come from basic research, and this research is happening anyway, whether we like it or not, so we may as well try to harness it as best we can toward good and not ill.

Obviously, they’re both right. In What We Inherit: How New Technologies and Old Myths Are Shaping Our Genomic Future, we get to see how their collaboration can shed light on our rapidly advancing genetic capabilities.

An “adversarial collaboration”

Trejo is a (quantitative) sociologist at Princeton; Martschenko is a (qualitative) bioethicist at Stanford. He’s a he, and she’s a she; he looks white, she looks black; he’s East Coast, she’s West. On the surface, it seems clear that they would hold different opinions. But they still chose to spend 10 years writing this book in an “adversarial collaboration.” While they still disagree, by now at least they can really listen to and understand each other. In today’s world, that seems pretty worthwhile in and of itself.

The titular “What we inherit” refers to both actual DNA (Trejo’s field) and the myths surrounding it (Martschenko’s). There are two “genetic myths” that most concern them. One is the Destiny Myth: the notion, first promulgated by Francis Galton in his 1869 book Heredity Genius, that the effects of DNA can be separable from the effects of environment. He didn’t deny the effects of nurture; he just erroneously pitted it against nature, as if it were one versus the other instead of each impacting and working through the other. (The most powerful “genetic” determinant of educational attainment in his world was a Y chromosome.) His ideas reached their apotheosis in the forced sterilizations of the eugenics movement in the early 20th century in the US and, eventually, in the policies of Nazi Germany.

The other genetic myth the authors address is the Race Myth, “the false belief that DNA differences divide humans into discrete and biologically distinct racial groups.” (Humans can be genetically sorted by ancestry, but that’s not quite the same thing.) But they spend most of the book discussing polygenic scores, which sum up the impact of lots of small genetic influences. They cover what they are, their strengths and weaknesses, their past, present, and potential uses, and how and how much their use should be regulated. And of course, their ultimate question: Are they worth generating and studying at all?

One thing they agree on is that science education in this country is abysmal and needs to be improved immediately. Most people’s knowledge of genetics is stuck at Mendel and his green versus yellow, smooth versus wrinkled peas: dominant and recessive traits with manifestations that can be neatly traced in Punnet squares. Alas, most human traits are much more complicated than that, especially the interesting ones.

Polygenic scores: uses and abuses

Polygenic scores tally the contributions of many genes to particular traits to predict certain outcomes. There’s no single gene for height, depression, or heart disease; there are a bunch of genes that each make very small contributions to making an outcome more or less likely. Polygenic scores can’t tell you that someone will drop out of high school or get a PhD; they can just tell you that someone might be slightly more or less likely to do so. They are probabilistic, not deterministic, because people’s mental health and educational attainment and, yes, even height, are determined by environmental factors as well as genes.

Polygenic scores, besides only giving predictions, are (a) not that accurate by nature; (b) become less accurate for each trait if you select for more than one trait, like height and intelligence; and (c) are less accurate for those not of European descent, since most genetic studies have thus far been done only with Europeans. So right out of the gate, any potential benefits of the technology will be distributed unevenly.

Another thing that Martschenko and Trejo agree on is that the generation, sale, and use of polygenic scores must be regulated much more assiduously than they currently are to ensure that they are implemented responsibly and equitably. “While scientists and policymakers are guarding the front gate against gene editing, genetic embryo selection (using polygenic scores) is slipping in through the backdoor,” they write. Potential parents using IVF have long been able to choose which embryos to implant based on gender and the presence of very clearcut genetic markers for certain serious diseases. Now, they can choose which embryos they want to implant based on their polygenic scores.

In 2020, a company called Genomic Prediction started offering genomic scores for diabetes, skin cancer, high blood pressure, elevated cholesterol, intellectual disability, and “idiopathic short stature.” They’ve stopped advertising the last two “because it’s too controversial.” Not, mind you, because the effects are minor and the science is unreliable. The theoretical maximum polygenic score for height would make a difference of 2.5 inches, and that theoretical maximum has not been seen yet, even in studies of Europeans. Polygenic scores for most other traits lag far behind. (And that’s just one company; another called Herasight has since picked up the slack and claims to offer embryo selection based on intelligence.)

Remember, the more traits one selects for, the less accurate each prediction is. Moreover, many genes affect multiple biological processes, so a gene implicated in one undesirable trait may have as yet undefined impacts on other desirable ones.

And all of this is ignoring the potential impact of the child’s environment. The first couple who used genetic screening for their daughter opted for an embryo that had a reduced risk of developing heart disease; her risk was less than 1 percent lower than the three embryos they rejected. Feeding her vegetables and sticking her on a soccer team would have been cheaper and probably more impactful.

The risks of reduced genetic diversity

Almost every family I know has a kid who has taken growth hormones, and plenty of them get tutoring, too. These interventions are hardly equitably distributed. But if embryos are selected based on polygenic scores, the authors fear that a new form of social inequality can arise. While growth hormone injections affect only one individual, embryonic selection based on polygenic scores affects all of that embryo’s descendants going forward. So the chosen embryos’ progeny could eventually end up treated as a new class of optimized people whose status might be elevated simply because their parents could afford to comb through their embryonic genomes—regardless of whether their “genetic” capabilities are actually significantly different from everyone else’s.

While it is understandable that parents want to give their kids the best chance of success, eliminating traits that they find objectionable will make humanity as a whole more uniform and society as a whole poorer for the lack of heterogeneity. Everyone can benefit from exposure to people who are different from them; if everyone is bred to be tall, smart, and good-looking, how will we learn to tolerate otherness?

Polygenic embryo selection is currently illegal in the UK, Israel, and much of Europe. In 2024, the FDA made some noise about planning to regulate the market, but for now companies offering polygenic scores to the public fall under the same nonmedical category as nutritional supplements—i.e. not regulatable. These companies advertise scores for traits like musical ability and acrophobia, but only for “wellness” or “educational” purposes.

So Americans are largely at the mercy of corporations that want to profit off of them at least as much as they claim to want to help them. And because this is still in the private sector, people who have the most social and environmental advantages—wealthy people with European ancestry—are often the only ones who can afford to try to give their kids any genetic advantages that might be had, further entrenching those social inequalities and potentially creating genetic inequalities that didn’t exist before. Hopefully, these parents will just be funding the snake-oil phase of the process so that if we can ever generate enough data to make polygenic scores actually reliable at predicting anything meaningful, they will be inexpensive enough to be accessible to anyone who wants them.

Have we leapt into commercial genetic testing without understanding it? Read More »

Bringing the “functionally extinct” American chestnut back from the dead

American chestnut, Biology, extinction, genetic engineering, Genetics, Genomics, Invasive species, pathogens, plants, Science, trees / Tim Belzer / February 12, 2026

Wiped out in its native range by invasive pathogens, the trees may make a comeback.

Very few people alive today have seen the Appalachian forests as they existed a century ago. Even as state and national parks preserved ever more of the ecosystem, fungal pathogens from Asia nearly wiped out one of the dominant species of these forests, the American chestnut, killing an estimated 3 billion trees. While new saplings continue to sprout from the stumps of the former trees, the fungus persists, killing them before they can seed a new generation.

But thanks in part to trees planted in areas where the two fungi don’t grow well, the American chestnut isn’t extinct. And efforts to revive it in its native range have continued, despite the long generation times needed to breed resistant trees. In Thursday’s issue of Science, researchers describe their efforts to apply modern genomic techniques and exhaustive testing to identify the best route to restoring chestnuts to their native range.

Multiple paths to restoration

While the American chestnut is functionally extinct—it’s no longer a participant in the ecosystems it once dominated—it’s most certainly not extinct. Two Asian fungi that have killed it off in its native range; one causes chestnut blight, while a less common pathogen causes a root rot disease. Both prefer warmer, humid environments and persist there because they can grow asymptomatically on distantly related trees, such as oaks. Still, chestnuts planted outside the species’ original range—primarily in drier areas of western North America—have continued to thrive.

There is also a virus that attacks the chestnut blight fungus, allowing a few trees to survive in areas where that virus is common. Finally, a handful of trees have grown to maturity in the American chestnut’s original range. These trees, which the paper refers to as LSACs (large surviving American chestnuts), suggest that there might have been some low level of natural resistance within the now-vanished population.

Those trees are central to one of the efforts to restore the American chestnut. If enough of them have distinct means of resisting the fungi, interbreeding them might produce a strain that not only survives the fungi but can also thrive in the Appalachians.

A related approach took advantage of the fact that the American chestnut can produce fertile hybrids with the Chinese chestnut, which had co-evolved with the introduced fungi and were thus resistant to lethal infections. The hope was that continued back-breeding of these hybrids with American chestnuts would result in trees that were very similar to American chestnuts yet retained the fungal resistance of their Asian cousins.

Both efforts suffered from the same problem that faces any biologist working on trees: They are slow-growing and can take years to reach a size at which they produce seeds. The situation was further complicated by the fact that the American chestnut can’t pollinate itself, so you need at least two trees before any breeding is possible.

Concerned about what this might mean for the potential reintroduction of the chestnut into the Appalachians, a third project turned to biotechnology. Research had identified oxalic acid as a key factor in the blight’s virulence. Wheat naturally produces an enzyme that degrades oxalic acid, and researchers inserted the gene that encodes that enzyme into the American chestnut genome, creating a genetically modified tree that can potentially disarm the fungus’ attack.

Without understanding the nature of resistance or the effectiveness of the transgenic gene, there’s no way to know which method would be most effective. So researchers from the American Chestnut Foundation assembled a massive collaboration to examine all these options and determine what would be needed to reintroduce blight-resistant chestnuts into the wild.

Tracking resistance

The scale of the effort is immense. All told, the team infected over 4,000 individual trees with the blight fungus and tracked their growth in Appalachian nurseries for an average of over 14 years. The trees were scored for resistance on a zero-to-100 scale based on the damage caused by the infection. This data was combined with some serious lab work; the team produced the highest-quality chestnut genomes yet (of both American and Chinese species) and gathered biochemical data on how the trees respond to infection.

It quickly became apparent that there were significant differences in the growth rates of some of the resistant trees. When planted at sites where viruses kept the blight in check, the Chinese chestnuts grew more slowly than native trees, while hybrids grew at an intermediate rate. That could make a big difference, as rapid growth may have enabled the chestnut to reach its former dominance of the canopy.

Somewhat surprisingly, this slow growth turned out to be a problem for the genetically modified American chestnuts as well. By chance, the wheat gene ended up being inserted into a gene known to be important for the growth of other plants. It seems to be important in the chestnut as well; plants with two copies of the inserted genes survived at 16 percent of their expected rate, and those with a single copy grew 22 percent slower than unmodified trees.

That said, there was a lot of variability among the genetically modified trees, with 4 percent of the tested trees showing both high blight resistance and growth comparable to that of unmodified American chestnuts. It will be important to determine whether this collection of traits remains consistent in ensuing generations.

In a bit of good news, the progeny from surviving American chestnuts grew like American chestnuts. In less good news, among 143 of these trees, only seven had resistance levels of above 50 on the team’s 100-point scale. It’s possible that interbreeding these trees could further boost resistance, but it also poses the risk of creating a population that’s too inbred to thrive after reintroduction.

Root causes

The research team decided to use their testing to investigate the genetic basis of resistance. There’s a very practical reason for this: If resistance is mediated by just a handful of genes that each have large impacts, it should be possible to continue breeding resistant strains back to regular American chestnuts and selecting for resistance. But if there are many factors with relatively small impacts, it will require directed interbreeding of hybrids to maximize both resistance and DNA originating from the American chestnut.

The team completed the highest-quality chestnut genomes for both the American and Chinese species, identifying about 25,000 to 30,000 genes in the different assemblies. They then used this information for two types of genetic analysis: quantitative trait locus identification and genome-wide association. Both approaches aim to identify regions of the genome associated with specific properties and estimate their impact.

The work suggested that resistance arises from a relatively large number of sites, each with relatively minor effects. For example, the sites in the genome identified by quantitative trait analysis typically boosted resistance by about 10 points on the researchers’ 100-point scale. In the genome-wide analysis, 17 individual genetic differences were associated with about a quarter of the heritable resistance traits. All of this suggests that, for the hybrids (and likely for the weaker blight resistance found in surviving American chestnuts), directed breeding among surviving trees will be needed.

For the root rot fungus, in contrast, it looks like there are a limited number of important alleles with a large impact.

The researchers also took an alternative approach to identify resistance factors, comparing 100 chemicals produced by resistant and susceptible strains. Among the 41 chemicals detected at higher levels in the Chinese chestnut, the researchers found a metabolite, lupeol, that completely suppressed the growth of the fungal pathogen. Another, erythrodiol, limited its growth. If we can identify the genes involved in producing those chemicals, we could use that knowledge to guide directed breeding programs—or even engage in gene editing to increase their production.

The team’s current plan is to use genomic predictions to select hybrid seedlings for planting in test orchards, aiming to identify plants with high growth and resistance. From there, the process can be repeated. But even after the exhaustive exploration of resistance traits, the researchers seem to believe that all three approaches—selecting resistant American chestnuts, breeding hybrids derived from Chinese chestnuts, and directed genetic modification—can help bring the American chestnut back.

The researchers warn, though, that as environmental disturbances and invasive species continue to push some key species to the brink of extinction, we need to get better at this kind of species rescue operation.

Science, 2026. DOI: 10.1126/science.adw3225 (About DOIs).

Bringing the “functionally extinct” American chestnut back from the dead Read More »

Humans in southern Africa were an isolated population until recently

africa, Biology, evolution, Genetics, Genomics, human evolution, Science / Tim Belzer / December 3, 2025

Collectively, the genetic variants in this population are outside the range of previously described human diversity. That’s despite the fact that the present-day southern African hunter-gatherer populations are largely derived from southern African ancestors.

What’s distinct?

Estimates of the timing of when this ancient south African population branched off from any modern-day populations place the split at over 200,000 years ago, or roughly around the origin of modern humans themselves. But this wasn’t some odd, isolated group; estimates of population size based on the frequency of genetic variation suggest it was substantial.

Instead, the researchers suggest that climate and geography kept the group separate from other African populations and that southern Africa may have served as a climate refuge, providing a safe area from which modern humans could expand out to the rest of the continent when conditions were favorable. That’s consistent with the finding that some of the ancient populations in eastern and western Africa contain some southern African variants by around 5,000 years ago.

As far as genetic traits are concerned, the population looked like pretty much everyone else present at the time: brown eyes, high skin pigmentation, and no lactose tolerance. None of the older individuals had genetic resistance to malaria or sleeping sickness that are found in modern populations. In terms of changes that affect proteins, the most common are found in genes involved in immune function, a pattern that’s seen in many other human populations. More unusually, genes that affect kidney function also show a lot of variation.

So there’s nothing especially distinctive or modern apparent in this population, especially not in comparison to any other populations we know of in Africa at the same time. But they are unusual in that they suggest there was a large, stable, and isolated group from other populations present in Africa at the time. Over time, we’ll probably get additional evidence that fits this population into a coherent picture of human evolution. But for now, its presence is a bit of an enigma, given how often other populations intermingled in our past.

Nature, 2025. DOI: 10.1038/s41586-025-09811-4 (About DOIs).

Humans in southern Africa were an isolated population until recently Read More »

Many genes associated with dog behavior influence human personalities, too

Behavioral science, Biology, Dogs, Genetics, Genomics, golden retrievers, Science / Kris Guyer / November 26, 2025

Many dog breeds are noted for their personalities and behavioral traits, from the distinctive vocalizations of huskies to the herding of border collies. People have worked to identify the genes associated with many of these behaviors, taking advantage of the fact that dogs can interbreed. But that creates its own experimental challenges, as it can be difficult to separate some behaviors from physical traits distinctive to the breed—small dog breeds may seem more aggressive simply because they feel threatened more often.

To get around that, a team of researchers recently did the largest gene/behavior association study within a single dog breed. Taking advantage of a population of over 1,000 golden retrievers, they found a number of genes associated with behaviors within that breed. A high percentage of these genes turned out to correspond to regions of the human genome that have been associated with behavioral differences as well. But, in many cases, these associations have been with very different behaviors.

Gone to the dogs

The work, done by a team based largely at Cambridge University, utilized the Golden Retriever Lifetime Study, which involved over 3,000 owners of these dogs filling out annual surveys that included information on their dogs’ behavior. Over 1,000 of those owners also had blood samples obtained from their dogs and shipped in; the researchers used these samples to scan the dogs’ genomes for variants. Those were then compared to ratings of the dogs’ behavior on a range of issues, like fear or aggression directed toward strangers or other dogs.

Using the data, the researchers identified when different regions of the genome were frequently associated with specific variants. In total, 14 behavioral tendencies were examined, and 12 genomic regions were associated with specific behaviors, and another nine showed somewhat weaker associations. For many of these traits, it was difficult to find much because golden retrievers are notoriously friendly and mellow dogs, so they tended to score low on traits like aggression and fear.

That result was significant, as some of these same regions of the genome had been associated with very different behaviors in populations that were a mix of breeds. For example, two different regions associated with touch sensitivity in golden retrievers had been linked to a love of chasing and owner-directed aggression in a non-breed-specific study. That finding suggests that the studies were identifying genes that may be involved in setting the stage for behaviors, but were directed into specific outcomes by other genetic or environmental factors.

Many genes associated with dog behavior influence human personalities, too Read More »

AI trained on bacterial genomes produces never-before-seen proteins

AI, bacteria, Biology, Computer science, evolution, Genomics, large language model, Science / Kris Guyer / November 21, 2025

The researchers argue that this setup lets Evo “link nucleotide-level patterns to kilobase-scale genomic context.” In other words, if you prompt it with a large chunk of genomic DNA, Evo can interpret that as an LLM would interpret a query and produce an output that, in a genomic sense, is appropriate for that interpretation.

The researchers reasoned that, given the training on bacterial genomes, they could use a known gene as a prompt, and Evo should produce an output that includes regions that encode proteins with related functions. The key question is whether it would simply output the sequences for proteins we know about already, or whether it would come up with output that’s less predictable.

Novel proteins

To start testing the system, the researchers prompted it with fragments of the genes for known proteins and determined whether Evo could complete them. In one example, if given 30 percent of the sequence of a gene for a known protein, Evo was able to output 85 percent of the rest. When prompted with 80 percent of the sequence, it could return all of the missing sequence. When a single gene was deleted from a functional cluster, Evo could also correctly identify and restore the missing gene.

The large amount of training data also ensured that Evo correctly identified the most important regions of the protein. If it made changes to the sequence, they typically resided in the areas of the protein where variability is tolerated. In other words, its training had enabled the system to incorporate the rules of evolutionary limits on changes in known genes.

So, the researchers decided to test what happened when Evo was asked to output something new. To do so, they used bacterial toxins, which are typically encoded along with an anti-toxin that keeps the cell from killing itself whenever it activates the genes. There are a lot of examples of these out there, and they tend to evolve rapidly as part of an arms race between bacteria and their competitors. So, the team developed a toxin that was only mildly related to known ones, and had no known antitoxin, and fed its sequence to Evo as a prompt. And this time, they filtered out any responses that looked similar to known antitoxin genes.

AI trained on bacterial genomes produces never-before-seen proteins Read More »

World’s oldest RNA extracted from Ice Age woolly mammoth

ancient DNA, ancient RNA, Biology, evolution, Genomics, mammoths, paleontology, Science / Rejus Almole / November 14, 2025

A young woolly mammoth now known as Yuka was frozen in the Siberian permafrost for about 40,000 years before it was discovered by local tusk hunters in 2010. The hunters soon handed it over to scientists, who were excited to see its exquisite level of preservation, with skin, muscle tissue, and even reddish hair intact. Later research showed that, while full cloning was impossible, Yuka’s DNA was in such good condition that some cell nuclei could even begin limited activity when placed inside mouse eggs.

Now, a team has successfully sequenced Yuka’s RNA—a feat many researchers once thought impossible. Researchers at Stockholm University carefully ground up bits of muscle and other tissue from Yuka and nine other woolly mammoths, then used special chemical treatments to pull out any remaining RNA fragments, which are normally thought to be much too fragile to survive even a few hours after an organism has died. Scientists go to great lengths to extract RNA even from fresh samples, and most previous attempts with very old specimens have either failed or been contaminated.

A different view

The team used RNA-handling methods adapted for ancient, fragmented molecules. Their scientific séance allowed them to explore information that had never been accessible before, including which genes were active when Yuka died. In the creature’s final panicked moments, its muscles were tensing and its cells were signaling distress—perhaps unsurprising since Yuka is thought to have died as a result of a cave lion attack.

It’s an exquisite level of detail, and one that scientists can’t get from just analyzing DNA. “With RNA, you can access the actual biology of the cell or tissue happening in real time within the last moments of life of the organism,” said Emilio Mármol, a researcher who led the study. “In simple terms, studying DNA alone can give you lots of information about the whole evolutionary history and ancestry of the organism under study. “Obtaining this fragile and mostly forgotten layer of the cell biology in old tissues/specimens, you can get for the first time a full picture of the whole pipeline of life (from DNA to proteins, with RNA as an intermediate messenger).”

World’s oldest RNA extracted from Ice Age woolly mammoth Read More »

Genetically engineered bacteria break down industrial contaminants

biochemistry, Biology, enzymes, Genetics, Genomics, pollution, Science / Tim Belzer / May 8, 2025

Once that was done, the researchers started looking through the genomes of species that have been identified as breaking down industrial contaminants. The breakdown of complex molecules typically involves more than one enzyme, and the genes for these enzymes tend to end up clustered together so they can be produced as a single, large RNA that encodes all the proteins needed. This simplifies regulating their production, making it easy to ensure the bacteria only make the proteins if the molecule they break down is actually present. In this case, the clusters ranged from just three genes all the way up to 11.

Once nine of these gene clusters were identified, the DNA that would encode them was ordered and assembled into a single DNA molecule in yeast. The researchers took some time while ordering this DNA to better optimize the genes to be active and produce proteins in Vibrio natriegens, as opposed to whatever species the genes were normally used by.

From yeast, each of these individual gene clusters was inserted into Vibrio natriegens, creating different strains that could digest one of the following: benzene, toluene, phenol, naphthalene, biphenyl, DBF29, and dibenzothiophene (DBT). (Some of the nine clusters target the same contaminant.) Each of these bacterial strains was then put in a solution with the chemical they were engineered to digest. Five of the nine worked, giving researchers strains that could digest biphenyl, phenol, napthalene, DBF, and toluene.

Good, but limited

From there, the researchers developed a system that would enable them to iteratively insert a new gene cluster at the tail end of a previously inserted gene cluster. This allowed them to build up a cluster of clusters, eventually including all five of the ones that had shown activity in the earlier tests. Given two days, this single strain could remove about a quarter of the phenol, a third of the biphenyl, 30 percent of the DBF, all of the naphthalene, and nearly all of the toluene.

Genetically engineered bacteria break down industrial contaminants Read More »

DNA links modern pueblo dwellers to Chaco Canyon people

ancient DNA, Archeology, Biology, Genetics, Genomics, human biology, Science / Mike M. / May 1, 2025

A thousand years ago, the people living in Chaco Canyon were building massive structures of intricate masonry and trading with locations as far away as Mexico. Within a century, however, the area would be largely abandoned, with little indication that the same culture was re-established elsewhere. If the people of Chaco Canyon migrated to new homes, it’s unclear where they ended up.

Around the same time that construction expanded in Chaco Canyon, far smaller pueblos began appearing in the northern Rio Grande Valley hundreds of kilometers away. These have remained occupied to the present day in New Mexico; although their populations shrank dramatically after European contact, their relationship to the Chaco culture has remained ambiguous. Until now, that is. People from one of these communities, Picuris Pueblo, worked with ancient DNA specialists to show that they are the closest relatives of the Chaco people yet discovered, confirming aspects of the pueblo’s oral traditions.

A pueblo-driven study

The list of authors of the new paper describing this genetic connection includes members of the Pueblo government, including its present governor. That’s because the study was initiated by the members of the Pueblo, who worked with archeologists to get in contact with DNA specialists at the Center for GeoGenetics at the University of Copenhagen. In a press conference, members of the Pueblo said they’d been aware of the power of DNA studies via their use in criminal cases and ancestry services. The leaders of Picuris Pueblo felt that it could help them understand their origin and the nature of some of their oral history, which linked them to the wider Pueblo-building peoples.

After two years of discussions, the collaboration settled on a plan of research, and the ancient DNA specialists were given access to both ancient skeletons at Picuris Pueblo, as well as samples from present-day residents. These were used to generate complete genome sequences.

The first clear result is that there is a strong continuity in the population living at Picuris. The ancient skeletons range from 500 to 700 years old, and thus date back to roughly the time of European contact, with some predating it. They also share strong genetic connections to the people of Chaco Canyon, where DNA has also been obtained from remains. “No other sampled population, ancient or present-day, is more closely related to Ancestral Puebloans from Pueblo Bonito [in Chaco Canyon] than the Picuris individuals are,” the paper concludes.

DNA links modern pueblo dwellers to Chaco Canyon people Read More »

The fish with the genome 30 times larger than ours gets sequenced

Biology, evolution, Genetics, Genomics, lungfish, Science, tetrapods / Paul Patrick / August 14, 2024

Image of the front half of a fish, with a brown and cream pattern and long fins. — Enlarge / The African Lungfish, showing it’s thin, wispy fins.

When it was first discovered, the coelacanth caused a lot of excitement. It was a living example of a group of fish that was thought to only exist as fossils. And not just any group of fish. With their long, stalk-like fins, coelacanths and their kin are thought to include the ancestors of all vertebrates that aren’t fish—the tetrapods, or vertebrates with four limbs. Meaning, among a lot of other things, us.

Since then, however, evidence has piled up that we’re more closely related to lungfish, which live in freshwater and are found in Africa, Australia, and South America. But lungfish are a bit weird. The African and South American species have seen the limb-like fins of their ancestors reduced to thin, floppy strands. And getting some perspective on their evolutionary history has proven difficult because they have the largest genomes known in animals, with the South American lungfish genome containing over 90 billion base pairs. That’s 30 times the amount of DNA we have.

But new sequencing technology has made tackling that sort of challenge manageable, and an international collaboration has now completed the largest genome ever, one where all but one chromosome carry more DNA than is found in the human genome. The work points to a history where the South American lungfish has been adding 3 billion extra bases of DNA every 10 million years for the last 200 million years, all without adding a significant number of new genes. Instead, it seems to have lost the ability to keep junk DNA in check.

Going long

The work was enabled by a technology generically termed “long-read sequencing.” Most of the genomes that were completed were done using short reads, typically in the area of 100–200 base pairs long. The secret was to do enough sequencing that, on average, every base in the genome should be sequenced multiple times. Given that, a cleverly designed computer program could figure out where two bits of sequence overlapped and register that as a single, longer piece of sequence, repeating the process until the computer spit out long strings of contiguous bases.

The problem is that most non-microbial species have stretches of repeated sequence (think hundreds of copies of the bases G and A in a row) that were longer than a few hundred bases long—and nearly identical sequences that show up in multiple locations of the genome. These would be impossible to match to a unique location, and so the output of the genome assembly software would have lots of gaps of unknown length and sequence.

This creates extreme difficulty for genomes like that of the lungfish, which is filled with non-functional “junk” DNA, all of which is typically repetitive. The software tends to produce a genome that’s more gap than sequence.

Long-read technology gets around that by doing exactly what its name implies. Rather than being able to sequence fragments of 200 bases or so, it can generate sequences that are thousands of base pairs long, easily covering the entire repeat that would have otherwise created a gap. One early version of long-read technology involved stuffing long DNA molecules through pores and watching for different voltage changes across the pore as different bases passed through it. Another had a DNA copying enzyme make a duplicate of a long strand and watch for fluorescence changes as different bases were added. These early versions tended to be a bit error-prone but have since been improved, and several newer competing technologies are now on the market.

Back in 2021, researchers used this technology to complete the genome of the Australian lungfish—the one that maintains the limb-like fins of the ancestors that gave rise to tetrapods. Now they’re back with the genomes from African and South American species. These species seem to have gone their separate ways during the breakup of the supercontinent Gondwana, a process that started nearly 200 million years ago. And having the genomes of all three should give us some perspective on the features that are common to all lungfish species, and thus are more likely to have been shared with the distant ancestors that gave rise to tetrapods.

The fish with the genome 30 times larger than ours gets sequenced Read More »

Much of Neanderthal genetic diversity came from modern humans

Biology, evolution, Genetics, Genomics, human evolution, Neanderthals, Science / Rejus Almole / July 12, 2024

The basic outline of the interactions between modern humans and Neanderthals is now well established. The two came in contact as modern humans began their major expansion out of Africa, which occurred roughly 60,000 years ago. Humans picked up some Neanderthal DNA through interbreeding, while the Neanderthal population, always fairly small, was swept away by the waves of new arrivals.

But there are some aspects of this big-picture view that don’t entirely line up with the data. While it nicely explains the fact that Neanderthal sequences are far more common in non-African populations, it doesn’t account for the fact that every African population we’ve looked at has some DNA that matches up with Neanderthal DNA.

A study published on Thursday argues that much of this match came about because an early modern human population also left Africa and interbred with Neanderthals. But in this case, the result was to introduce modern human DNA to the Neanderthal population. The study shows that this DNA accounts for a lot of Neanderthals’ genetic diversity, suggesting that their population was even smaller than earlier estimates had suggested.

Out of Africa early

This study isn’t the first to suggest that modern humans and their genes met Neanderthals well in advance of our major out-of-Africa expansion. The key to understanding this is the genome of a Neanderthal from the Altai region of Siberia, which dates from roughly 120,000 years ago. That’s well before modern humans expanded out of Africa, yet its genome has some regions that have excellent matches to the human genome but are absent from the Denisovan lineage.

One explanation for this is that these are segments of Neanderthal DNA that were later picked up by the population that expanded out of Africa. The problem with that view is that most of these sequences also show up in African populations. So, researchers advanced the idea that an ancestral population of modern humans left Africa about 200,000 years ago, and some of its DNA was retained by Siberian Neanderthals. That’s consistent with some fossil finds that place anatomically modern humans in the Mideast at roughly the same time.

There is, however, an alternative explanation: Some of the population that expanded out of Africa 60,000 years ago and picked up Neanderthal DNA migrated back to Africa, taking the Neanderthal DNA with them. That has led to a small bit of the Neanderthal DNA persisting within African populations.

To sort this all out, a research team based at Princeton University focused on the Neanderthal DNA found in Africans, taking advantage of the fact that we now have a much larger array of completed human genomes (approximately 2,000 of them).

The work was based on a simple hypothesis. All of our work on Neanderthal DNA indicates that their population was relatively small, and thus had less genetic diversity than modern humans did. If that’s the case, then the addition of modern human DNA to the Neanderthal population should have boosted its genetic diversity. If so, then the stretches of “Neanderthal” DNA found in African populations should include some of the more diverse regions of the Neanderthal genome.

Much of Neanderthal genetic diversity came from modern humans Read More »