The human genome has a wide range of variations, including Single Nucleotide Variation (SNV), Short Insertion/Deletion (Indel), and Structural Variation. If a mutation occurs more frequently than 1% in a population, it can be called a polymorphism, such as Single Nucleotide Polymorphism (SNP). Each individual's DNA sequence is inherited from the parent's germ cells and contains many sequence variations compared to existing standard human genome sequences. Most of the mutations are "neutral" mutations, which are neither beneficial nor harmful to human survival, but some mutations can seriously affect the health of the human body and cause disease.

In recent years, the rapid development of genome sequencing technology, especially the second generation of gene sequencing technology (NGS), not only greatly reduced the cost of sequencing, but also significantly improved the sequencing speed and maintained high accuracy for the effective recognition of the human genome. Mutations and mutations provide a fresh perspective on the potential health effects. At present, the sequencing of individuals can ensure the accuracy of 99.9% by using the correct analysis method under the premise of ensuring data quality and reasonable sequencing coverage depth (the whole sequencing is generally 30X). Then, how to construct a correct sequence analysis strategy to accurately and efficiently interpret sequences, identify and identify disease-related mutations, and put forward very high requirements in bioinformatics, computer science and other related fields.

Accurate medical care is inseparable from the accurate interpretation of sequencing data

The current demand for diagnostic analysis of genome sequencing has been quite different from that of a decade ago. The concept of " precise medical care " seems to be like a spring breeze in the past two years. Especially from the recent 2016 CSCO annual meeting, it is very different from previous years. A large number of gene sequencing companies have become the protagonists of the conference, which fully demonstrates the genetic testing. The auxiliary role in medicine has been widely and deeply recognized, and the premise of precision medicine is the accurate interpretation of sequencing data. Therefore, the ability to provide accurate data analysis is the key to winning this field.

Dr. Dong of 23GENEBANK Bioinformatics Department mentioned: “Our company from the beginning, in terms of technical preparation, is based on the sequence alignment and mutation analysis platform of the second generation sequencing technology. The platform is fully integrated with the latest The development of the field has improved the accuracy and breadth of mutation identification and labeling. It can be said that we are not a production-oriented enterprise in detection, but a company that analyzes and mines gene big data."

First, in terms of algorithms, 23GENEBANK introduced an improved algorithm based on Bayesian statistics to optimize the mutation identification process. This algorithm differs from other prior probability methods—they are often based on simplified statistical models such as diploid hypotheses and uniform copy numbers. The modified Bayesian statistical method models multiple allele loci in the sample, non-uniform copy number modeling, and obtains the most probable genotype by evaluating the probability of multiple genotypes at each base site, which improves the identification. Precision. In addition, the algorithm is mainly based on the alignment of sequence and reference genomes, avoiding the potential errors introduced when the sequences themselves are arranged.

Second, in the detection of structural variation (SV), the company's bioinformatics team integrated three different prediction strategies for three different structural variants:

The first is the depth of the read: the depth of the missing area is often lower than the normal area, and the depth of the repeated area is higher than the normal area;

The second is the split reads: Since the NGS reads sequence is randomly compared to the genome, if there are missing or repeated, then there will be several readings in the front part of the alignment in the genome, and the latter part will be compared. Another location in the genome;

The third is the read pairs: Generally, the distance between a pair of readings is 300-500 bp. If there is a missing or repeated, the distance of the paired pairs will change. The three methods are used together to achieve complementary advantages in the algorithm. While taking into account the reliability, the recognition rate of the potential SV is also improved.

Barrier Gate

Zhuhai Mingke Electronics Technology Co., Ltd , https://www.zhmkdz-electronics.com