资讯详情

position-correlation scoring feature(PCSF)

文章目录

  • PCSF
    • 来源1
    • 来源2

PCSF

位置关联打分(PCSF)特征

position-correlation scoring feature

来源1

2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approach

https://link.springer.com/article/10.1007/s12064-010-0114-8

The PWM can be constructed by counting the frequencies of oligonucleotides in conserved sites of training sequences. The probability p x i p_{xi} pxi of an oligonucleotide x x x at the ith site can be formulated as (Li and Lin 2006; Wasserman and Sandelin 2004; Kielbasa et al. 2005): p x = ( n x i b x i ) / ( N i B i ) p x i p_{x}=(n_{xi} b_{xi})/(N_i B_i)p_{xi} px=(nxi bxi​)/(Ni​+Bi​)pxi​ (2)

where n x i n_{xi} nxi​ and b x i b_{xi} bxi​ are real counts and pseudocounts of k-mer oligonucleotide x at the ith site, respectively. N i N_i Ni​ and B i B_i Bi​ are total number of real counts and pseudocounts at the ith site, respectively. If there are relatively few real counts, many k-mer variations may not be presented because of the small sample of sequences. The goal of adding pseudocounts is to obtain an improved estimate of the probability p x i p_{xi} pxi​ of k-mer oligonucleotide x at the ith site. A relatively few pseudocounts should be added when there is a good sampling of sequences, and more pseudocounts should be added when the data is sparser. One simple formula that has worked well in some studies is to make B i B_i Bi​ equal to √ N i √N_i √Ni​ and b x i b_{xi} bxi​ equal to p 0 √ N i p_0√N_i p0​√Ni​ ( p 0 p_0 p0​ is the average background frequency) in Eq. 2 (Wasserman and Sandelin 2004; Kielbasa et al. 2005), respectively. As N i N_i Ni​ increase, the influence of pseudocounts decrease because √ N i √Ni √Ni increase more slowly. Due to the existence of pseudocounts, the estimated probabilities are strictly positive (Kielbasa et al. 2005). Based on the probabilities p x i p_{xi} pxi​ , the PCSF of an arbitrary sequence can be defined as (Li and Lin 2006): F = ∑ i l n ( p x i / p 0 ) F=∑_iln(p_{xi}/p_0) F=i∑​ln(pxi​/p0​) (3)

where p 0 p_0 p0​ is average background probability of k-mer. The score F shows the degree of sequence closed to matrix resource.

来源2

2019-09_Mol Ther-Nucleic Acids_iProEP:A Computational Predictor for Predicting Promoter

https://www.sciencedirect.com/science/article/pii/S2162253119301611

通过对每个物种的启动子序列进行比对,我们可以构建一个位置相关评分矩阵position-correlation scoring matrix。PCSM中的每一行都由因子 p x i p_{xi} pxi​组成, p x i p_{xi} pxi​是启动子样本第i位的k-mer x的概率。 p x i p_{xi} pxi​可通过以下公式计算: p x i = n x i + b x i N i + B i p_{xi}=\frac{n_{xi}+b_{xi}}{N_i+B_i} pxi​=Ni​+Bi​nxi​+bxi​​ 其中 n x i n_{xi} nxi​是出现在第i位的x的实际计数,而 b x i b_{xi} bxi​是相应的伪计数。 N i N_i Ni​表示第i个位置上所有k-mers的实数之和(即正样本数),而 B i B_i Bi​是相应的伪计数之和。如果样本量不够大,当k增加时,一些k-mers将不存在。因此,伪计数可以改善对第i位k-mer x的概率 p x i p_{xi} pxi​的估计。 B i B_i Bi​和 b x i b_{xi} bxi​可以由下式给出: KaTeX parse error: Expected 'EOF', got '&' at position 5: B_i&̲= \sqrt{N_i},\\… 其中 p o po po为k-mer的背景频率,等于 1 / 4 k 1/4^k 1/4k。随着样品数N_i的增加,由于 N i \sqrt{N_i} Ni​ ​增长缓慢,伪计数的影响会减弱。 通过对LIN和LI的大量复杂的保守分析和ACC评价,筛选出了五个物种三聚体的一些保护位点。基于这些位点和PCSM,五个物种的正负样本的PCSF特征可以表示为: P C S F = [ f 1 f 2 . . . f i . . . f n ] PCSF=[f_1f_2...f_i...f_n] PCSF=[f1​f2​...fi​...fn​] 其中n是选定的保守位点的数量,每个元素定义为: f i = l n ( p x i / p o ) f_i=ln(p_{xi}/po) fi​=ln(pxi​/po) 在这个方程中, p o po po是每个三聚体的本底概率( p o = 1 / 4 3 po=1/4^3 po=1/43), p x i p_{xi} pxi​可以在PCSM的基础上得到。

标签: mers00002型细胞电阻仪

锐单商城拥有海量元器件数据手册IC替代型号,打造 电子元器件IC百科大全!

锐单商城 - 一站式电子元器件采购平台