Genetic risk scores, designed to summarize an individual’s likelihood of developing specific health conditions, can be manipulated using mathematical techniques to expose sensitive details about their DNA. This vulnerability could potentially be exploited by entities such as health insurers seeking to reconstruct genetic data from summary reports, thereby uncovering health risks that a patient has not disclosed. Furthermore, individuals who share their risk scores anonymously online might be identified by extracting their underlying genetic data and cross-referencing it with public genealogy databases.
Polygenic risk scores aggregate the influence of numerous individual genetic variations, known as single-nucleotide polymorphisms (SNPs). These scores function as a condensed summary of potential health predispositions, employed by both academic researchers and commercial DNA testing services like 23andMe. Consequently, these scores are sometimes made public, often when individuals seek interpretations or advice regarding their results.
The process of decoding a polygenic risk score presents a significant computational challenge, akin to determining a phone number solely by knowing the sum of its digits. This complexity is rooted in a mathematical concept known as the knapsack problem, which is notoriously difficult to solve. Because of this perceived difficulty, polygenic risk scores have generally been considered to pose a low privacy risk.
However, the underlying SNPs within a risk score are each multiplied by a weight of considerable precision—often extending to sixteen decimal places—which signifies their individual contribution to the overall disease risk. This very precision renders smaller risk models susceptible to attack.
“Since the final polygenic risk score is constrained by a finite number of ways to arrive at that numerical value, and a statistically probable arrangement of the underlying SNPs, it can be deduced with a high degree of accuracy,” explained Gamze Gürsoy, a researcher at Columbia University in New York.
Gürsoy, alongside Kirill Nikitin, also from Columbia University, conducted an experiment involving 298 polygenic risk models that utilized 50 or fewer SNPs. They applied their method to genetic data from 2,353 individuals. By working backward, they calculated all possible genomes that could have generated each specific risk score. Their analysis involved filtering out genomes that contained a substantial number of uncommon genetic mutations. This systematic approach allowed them to reconstruct the donor’s genetic makeup with remarkable precision.
A key aspect of their methodology involved recognizing that a single SNP can be incorporated into multiple polygenic risk models. Gürsoy and Nikitin leveraged this overlap, creating a daisy-chain effect. Information revealed from analyzing smaller models was then used to aid in solving more complex ones. This interconnected analytical strategy proved highly effective in their pursuit of reconstructing the original genetic data.
Their research successfully reconstructed donor genotypes with an accuracy rate of 94.6 percent. This involved correctly predicting an average of 2,450 SNPs per individual. Further testing demonstrated that as few as 27 SNPs were sufficient to uniquely identify an individual within a dataset of half a million samples. The genetic profiles of family members could be predicted with an accuracy of up to 90 percent. Notably, individuals of African and East Asian descent were found to be more easily identified, a consequence of their underrepresentation in existing genetic databases.
According to Gürsoy, 447 small, high-precision models currently housed in a public database of polygenic scores are vulnerable to this type of attack. “We wanted to highlight that while the risk is generally low, under certain conditions, there can still be some data leakage,” Gürsoy stated. “This is something we should consider when designing research studies, particularly those involving vulnerable populations.”
Ying Wang at Massachusetts General Hospital offered a counterpoint, suggesting that existing data protection measures and computational limitations currently mitigate the risk of polygenic risk scores being exploited in this manner. “The findings may serve as a cautionary note, suggesting that small models should be regarded as potentially sensitive data in clinical reporting and during discussions regarding informed consent,” she commented.
Reference:
bioRxiv DOI: 10.64898/2026.02.16.706191
