Graduate student uses data science to explore biodiversity

Using data science applied to plant and animal records in natural history museums, UO graduate student Jordan Rodriguez is finding new ways to study the evolution of key proteins.

As an undergraduate, Rodriguez embarked on a project to research the biases and limitations of biodiversity records from natural history collections and databases like iNaturalist. This work led to a recent publication in Nature Ecology and Evolution.

She is now a graduate student in the lab of biology professor Andrew Kern at UO, using machine learning approaches to track changes in protein diversity.

“I realized the statistical power of working with big data, but my first research experience really set the stage for understanding the hidden pitfalls of data,” Rodriguez said.

Having millions of data points can be extremely useful, she said, but only if you understand the limitations of the data.

Rodriguez’s journey to computational research began at the Ruth O’Brien Herbarium at Texas A&M University-Corpus Christi, where she helped digitize a collection of plant specimens. Along with biologist Barnabus Daru, now a professor at Stanford University, Rodriguez began exploring coverage gaps in different types of natural history data.

“We have access to an abundance of data on what species live where,” Rodriguez said, from legacy museum collections to field observations captured in online databases. “But something we had started to observe was that in areas generally known as biodiversity hotspots, like the Amazon rainforest, there seemed to be a disconnect between what the data was telling us and what the biology was telling us. .”

Most natural history documents fall into one of two categories. Supporting documents are physical specimens, such as those seen in museum and herbarium collections. Observation records are records of an observation without a physical specimen to back it up.

Thanks to the rise of smartphone apps like iNaturalist and eBird, there has been an explosion in sighting records in recent years. With these tools, anyone, scientist or not, can take a picture of a plant, insect or bird and document the sighting in a public database.

Rodriguez and Daru looked at more than a billion records and analyzed how observational and good data sets varied between different groups like plants, birds, and butterflies.

The different collection methods “lead to these interesting differences in how the separate datasets represent global biodiversity,” Rodriguez said.

Verified and observational data had gaps in coverage, Rodriguez and Daru report in their paper. Both types of datasets were more likely to report species in easily accessible areas: near roadsides, near airports, at lower elevations.

And they were both biased towards certain types of species. People are more likely to capture an image of a plant with a showy flower than the grass right next to it, Rodriguez said.

But the coverage gaps were larger for observational records, perhaps because supporting records are often collected more deliberately by researchers on field collection trips. Records with vouchers also had a richer representation over time, with a better balance between years and seasons. Citizen scientists are more likely to take photos of chance wildlife sightings on a warm, sunny day than in winter, Rodriguez noted.

Despite these drawbacks, sighting records still have their place, she said. They are especially useful for endangered animals and plant species, where it is beneficial to record a sighting without killing anything. And because they are easier to collect, scientists can access more data points. Observation and good records “work together,” Rodriguez said.

Rodriguez hopes her work will encourage scientists to think about the limitations of the dataset they are using and consider possible biases in their results. His recently published research points to specific ways in which these biases appear in natural history datasets of various groups of plants and animals. But the lessons apply to other data-driven fields.

Now at UO, Rodriguez is moving away from natural history research and instead focusing on population genetics, also using a big data approach.

The undergraduate research project “gave me experience developing methods and tools in bioinformatics, working with billions of data points and trying to understand statistics,” she said. . As a graduate student, “I knew I wanted to stay in a computationally focused lab.”

She recently joined Kern’s lab, a computational biology research group that is part of the UO Data Science Initiative and the College of Arts and Sciences. There she began an exploratory project applying artificial intelligence to biological data, to unravel the evolution of the full set of proteins in humans, chimpanzees, mice and rhesus monkeys.

Using machine learning tools similar to the technology behind ChatGPT, she hopes to learn more about how quickly proteins evolve in these animals.

“So much potential lies at the intersection of machine learning and evolutionary questions,” Rodriguez said.

Scientists have a wealth of genetic sequence data, and deep learning models might be able to uncover new information from it. Although such approaches require special skills in handling and understanding data, she noted, “this is the future of evolutionary research.”

—By Laurel Hamers, University Communications
—Top photo: Jordan Rodriguez

Related Article

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button