PCSF

位置关联打分(PCSF)特征

position-correlation scoring feature

来源1

2020-11_Theory in Biosciences_Eukaryotic and prokaryotic promoter prediction using hybrid approach

https://link.springer.com/article/10.1007/s12064-010-0114-8

原文：

The PWM can be constructed by counting the frequencies of oligonucleotides in conserved sites of training sequences. The probability p x i p_{xi} pxi of an oligonucleotide x x x at the ith site can be formulated as (Li and Lin 2006; Wasserman and Sandelin 2004; Kielbasa et al. 2005): p x = ( n x i b x i ) / ( N i B i ) p x i p_{x}=(n_{xi} b_{xi})/(N_i B_i)p_{xi} px=(nxi bxi)/(Ni+Bi)pxi (2)

where n x i n_{xi} nxi and b x i b_{xi} bxi are real counts and pseudocounts of k-mer oligonucleotide x at the ith site, respectively. N i N_i Ni and B i B_i Bi are total number of real counts and pseudocounts at the ith site, respectively. If there are relatively few real counts, many k-mer variations may not be presented because of the small sample of sequences. The goal of adding pseudocounts is to obtain an improved estimate of the probability p x i p_{xi} pxi of k-mer oligonucleotide x at the ith site. A relatively few pseudocounts should be added when there is a good sampling of sequences, and more pseudocounts should be added when the data is sparser. One simple formula that has worked well in some studies is to make B i B_i Bi equal to √ N i √N_i √Ni and b x i b_{xi} bxi equal to p 0 √ N i p_0√N_i p0√Ni ( p 0 p_0 p0 is the average background frequency) in Eq. 2 (Wasserman and Sandelin 2004; Kielbasa et al. 2005), respectively. As N i N_i Ni increase, the influence of pseudocounts decrease because √ N i √Ni √Ni increase more slowly. Due to the existence of pseudocounts, the estimated probabilities are strictly positive (Kielbasa et al. 2005). Based on the probabilities p x i p_{xi} pxi , the PCSF of an arbitrary sequence can be defined as (Li and Lin 2006): F = ∑ i l n ( p x i / p 0 ) F=∑_iln(p_{xi}/p_0) F=i∑ln(pxi/p0) (3)

where p 0 p_0 p0 is average background probability of k-mer. The score F shows the degree of sequence closed to matrix resource.

来源2

2019-09_Mol Ther-Nucleic Acids_iProEP：A Computational Predictor for Predicting Promoter

https://www.sciencedirect.com/science/article/pii/S2162253119301611

通过对每个物种的启动子序列进行比对，我们可以构建一个位置相关评分矩阵position-correlation scoring matrix。PCSM中的每一行都由因子 p x i p_{xi} pxi组成， p x i p_{xi} pxi是启动子样本第i位的k-mer x的概率。 p x i p_{xi} pxi可通过以下公式计算： p x i = n x i + b x i N i + B i p_{xi}=\frac{n_{xi}+b_{xi}}{N_i+B_i} pxi=Ni+Binxi+bxi 其中 n x i n_{xi} nxi是出现在第i位的x的实际计数，而 b x i b_{xi} bxi是相应的伪计数。 N i N_i Ni表示第i个位置上所有k-mers的实数之和(即正样本数)，而 B i B_i Bi是相应的伪计数之和。如果样本量不够大，当k增加时，一些k-mers将不存在。因此，伪计数可以改善对第i位k-mer x的概率 p x i p_{xi} pxi的估计。 B i B_i Bi和 b x i b_{xi} bxi可以由下式给出: KaTeX parse error: Expected 'EOF', got '&' at position 5: B_i&̲= \sqrt{N_i},\\… 其中 p o po po为k-mer的背景频率，等于 1 / 4 k 1/4^k 1/4k。随着样品数N_i的增加，由于 N i \sqrt{N_i} Ni 增长缓慢，伪计数的影响会减弱。通过对LIN和LI的大量复杂的保守分析和ACC评价，筛选出了五个物种三聚体的一些保护位点。基于这些位点和PCSM，五个物种的正负样本的PCSF特征可以表示为: P C S F = [ f 1 f 2 . . . f i . . . f n ] PCSF=[f_1f_2...f_i...f_n] PCSF=[f1f2...fi...fn] 其中n是选定的保守位点的数量，每个元素定义为: f i = l n ( p x i / p o ) f_i=ln(p_{xi}/po) fi=ln(pxi/po) 在这个方程中， p o po po是每个三聚体的本底概率( p o = 1 / 4 3 po=1/4^3 po=1/43)， p x i p_{xi} pxi可以在PCSM的基础上得到。

资讯详情

position-correlation scoring feature(PCSF)

文章目录

PCSF

来源1

来源2

详细介绍电流互感器功能区别3CT SR ZCT

position-correlation scoring feature(PCSF)

文章目录

PCSF

来源1

来源2

详细介绍电流互感器功能区别3CT SR ZCT

最近热搜

历史搜索 清除历史记录

历史搜索清除历史记录