dynast.estimation.p_c

Module Contents

Functions

read_p_c(p_c_path: str, group_by: Optional[List[str]] = None) → Union[float, Dict[str, float], Dict[Tuple[str, Ellipsis], float]]

Read p_c CSV as a dictionary, with group_by columns as keys.

binomial_pmf(k: int, n: int, p: int) → float

Numbaized binomial PMF function for faster calculation.

expectation_maximization_nasc(values: numpy.ndarray, p_e: float, threshold: float = 0.01) → float

NASC-seq pipeline variant of the EM algorithm to estimate average

expectation_maximization(values: numpy.ndarray, p_e: float, p_c: float = 0.1, threshold: float = 0.01, max_iters: int = 300) → float

Run EM algorithm to estimate average conversion rate in labeled RNA.

estimate_p_c(df_aggregates: pandas.DataFrame, p_e: Union[float, Dict[str, float], Dict[Tuple[str, Ellipsis], float]], p_c_path: str, group_by: Optional[List[str]] = None, threshold: int = 1000, n_threads: int = 8, nasc: bool = False) → str

Estimate the average conversion rate in labeled RNA.

dynast.estimation.p_c.read_p_c(p_c_path: str, group_by: Optional[List[str]] = None) Union[float, Dict[str, float], Dict[Tuple[str, Ellipsis], float]][source]

Read p_c CSV as a dictionary, with group_by columns as keys.

Parameters
p_c_path

Path to CSV containing p_c values

group_by

Columns to group by, defaults to None

Returns

Dictionary with group_by columns as keys (tuple if multiple)

dynast.estimation.p_c.binomial_pmf(k: int, n: int, p: int) float[source]

Numbaized binomial PMF function for faster calculation.

Parameters
k

Number of successes

n

Number of trials

p

Probability of success

Returns

Probability of observing k successes in n trials with probability

of success p

dynast.estimation.p_c.expectation_maximization_nasc(values: numpy.ndarray, p_e: float, threshold: float = 0.01) float[source]

NASC-seq pipeline variant of the EM algorithm to estimate average conversion rate in labeled RNA.

Parameters
values

N x C Numpy array where N is the number of conversions, C is the nucleotide content, and the value at this position is the number of reads observed

p_e

Background mutation rate of unlabeled RNA

threshold

Filter threshold

Returns

Estimated conversion rate

dynast.estimation.p_c.expectation_maximization(values: numpy.ndarray, p_e: float, p_c: float = 0.1, threshold: float = 0.01, max_iters: int = 300) float[source]

Run EM algorithm to estimate average conversion rate in labeled RNA.

This function runs the following two steps. 1) Constructs a sparse matrix representation of values and filters out certain

indices that are expected to contain more than threshold proportion of unlabeled reads.

  1. Runs an EM algorithm that iteratively updates the filtered out data and stimation.

See https://doi.org/10.1093/bioinformatics/bty256.

Parameters
values

array of three columns encoding a sparse array in (row, column, value) format, zero-indexed, where row: number of conversions column: nucleotide content value: number of reads

p_e

Background mutation rate of unlabeled RNA

p_c

Initial p_c value

threshold

Filter threshold

max_iters

Maximum number of EM iterations

Returns

Estimated conversion rate

dynast.estimation.p_c.estimate_p_c(df_aggregates: pandas.DataFrame, p_e: Union[float, Dict[str, float], Dict[Tuple[str, Ellipsis], float]], p_c_path: str, group_by: Optional[List[str]] = None, threshold: int = 1000, n_threads: int = 8, nasc: bool = False) str[source]

Estimate the average conversion rate in labeled RNA.

Parameters
df_aggregates

Pandas dataframe containing aggregate values

p_e

Background mutation rate of unlabeled RNA

p_c_path

Path to output CSV containing p_c estimates

group_by

Columns to group by

threshold

Read count threshold

n_threads

Number of threads

nasc

Flag to indicate whether to use NASC-seq pipeline variant of the EM algorithm

Returns

Path to output CSV containing p_c estimates