dynast.preprocessing.conversion

Module Contents

Functions

read_counts(counts_path: str, *args, **kwargs) → pandas.DataFrame

Read counts CSV as a pandas dataframe.

complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) → pandas.DataFrame

Complement the counts in the counts dataframe according to gene strand.

subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) → pandas.DataFrame

Subset the given counts DataFrame to only contain reads of the desired key.

drop_multimappers(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None) → pandas.DataFrame

Drop multimappings that have the same read ID where

deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) → pandas.DataFrame

Deduplicate counts based on barcode, UMI, and gene.

drop_multimappers_part(counter: multiprocessing.Value, lock: multiprocessing.Lock, split_path: str, out_path: str, conversions: Optional[FrozenSet[str]] = None) → str

Helper function to parallelize drop_multimappers().

deduplicate_counts_part(counter: multiprocessing.Value, lock: multiprocessing.Lock, split_path: str, out_path: str, conversions: Optional[FrozenSet[str]], use_conversions: bool = True)

Helper function to parallelize deduplicate_multimappers().

split_counts_by_velocity(df_counts: pandas.DataFrame) → Dict[str, pandas.DataFrame]

Split the given counts dataframe by the velocity column.

count_no_conversions(alignments_path: str, counter: multiprocessing.Value, lock: multiprocessing.Lock, index: List[Tuple[int, int, int]], barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 10000) → str

Count reads that have no conversion.

count_conversions_part(conversions_path: str, alignments_path: str, counter: multiprocessing.Value, lock: multiprocessing.Lock, index: List[Tuple[int, int, int]], barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, temp_dir: Optional[str] = None, update_every: int = 10000) → str

Count the number of conversions of each read per barcode and gene, along with

count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) → str

Count the number of conversions of each read per barcode and gene, along with

Attributes

CONVERSIONS_PARSER

ALIGNMENTS_PARSER

CONVERSION_IDX

BASE_IDX

CONVERSION_COMPLEMENT

CONVERSION_COLUMNS

BASE_COLUMNS

COLUMNS

CSV_COLUMNS

dynast.preprocessing.conversion.CONVERSIONS_PARSER[source]
dynast.preprocessing.conversion.ALIGNMENTS_PARSER[source]
dynast.preprocessing.conversion.CONVERSION_IDX[source]
dynast.preprocessing.conversion.BASE_IDX[source]
dynast.preprocessing.conversion.CONVERSION_COMPLEMENT[source]
dynast.preprocessing.conversion.CONVERSION_COLUMNS[source]
dynast.preprocessing.conversion.BASE_COLUMNS[source]
dynast.preprocessing.conversion.COLUMNS[source]
dynast.preprocessing.conversion.CSV_COLUMNS[source]
dynast.preprocessing.conversion.read_counts(counts_path: str, *args, **kwargs) pandas.DataFrame[source]

Read counts CSV as a pandas dataframe.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters
counts_path

Path to CSV

Returns

Counts dataframe

dynast.preprocessing.conversion.complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) pandas.DataFrame[source]

Complement the counts in the counts dataframe according to gene strand.

Parameters
df_counts

Counts dataframe

gene_infos

Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf

Returns

counts dataframe with counts complemented for reads mapping to genes on the reverse strand

dynast.preprocessing.conversion.subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) pandas.DataFrame[source]

Subset the given counts DataFrame to only contain reads of the desired key.

Parameters
df_count

Counts dataframe

key

Read types to subset

Returns:s

Subset dataframe

dynast.preprocessing.conversion.drop_multimappers(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None) pandas.DataFrame[source]

Drop multimappings that have the same read ID where * some map to the transcriptome while some do not – drop non-transcriptome alignments * none map to the transcriptome AND aligned to multiple genes – drop all * none map to the transcriptome AND assigned multiple velocity types – set to ambiguous

TODO: This function can probably be removed because BAM parsing only considers primary alignments now.

Parameters
df_counts

Counts dataframe

conversions

Conversions to prioritize

Returns

Counts dataframe with multimappers appropriately filtered

dynast.preprocessing.conversion.deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) pandas.DataFrame[source]

Deduplicate counts based on barcode, UMI, and gene.

The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions

If conversions is not provided, reads that have larger sum of all conversions

Parameters
df_counts

Counts dataframe

conversions

Conversions to prioritize, defaults to None

use_conversions

Prioritize reads that have conversions first

Returns

Deduplicated counts dataframe

dynast.preprocessing.conversion.drop_multimappers_part(counter: multiprocessing.Value, lock: multiprocessing.Lock, split_path: str, out_path: str, conversions: Optional[FrozenSet[str]] = None) str[source]

Helper function to parallelize drop_multimappers().

dynast.preprocessing.conversion.deduplicate_counts_part(counter: multiprocessing.Value, lock: multiprocessing.Lock, split_path: str, out_path: str, conversions: Optional[FrozenSet[str]], use_conversions: bool = True)[source]

Helper function to parallelize deduplicate_multimappers().

dynast.preprocessing.conversion.split_counts_by_velocity(df_counts: pandas.DataFrame) Dict[str, pandas.DataFrame][source]

Split the given counts dataframe by the velocity column.

Parameters
df_counts

Counts dataframe

Returns

Dictionary containing velocity column values as keys and the subset dataframe as values

dynast.preprocessing.conversion.count_no_conversions(alignments_path: str, counter: multiprocessing.Value, lock: multiprocessing.Lock, index: List[Tuple[int, int, int]], barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 10000) str[source]

Count reads that have no conversion.

Parameters
alignments_path

Alignments CSV path

counter

Counter that keeps track of how many reads have been processed

lock

Semaphore for the counter so that multiple processes do not modify it at the same time

index

Index for conversions CSV

barcodes

List of barcodes to be considered. All barcodes are considered if not provided

temp_dir

Path to temporary directory

update_every

Update the counter every this many reads

Returns

Path to temporary counts CSV

dynast.preprocessing.conversion.count_conversions_part(conversions_path: str, alignments_path: str, counter: multiprocessing.Value, lock: multiprocessing.Lock, index: List[Tuple[int, int, int]], barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, temp_dir: Optional[str] = None, update_every: int = 10000) str[source]

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode and gene. This function is used exclusively for multiprocessing.

Parameters
conversions_path

Path to conversions CSV

alignments_path

Path to alignments information about reads

counter

Counter that keeps track of how many reads have been processed

lock

Semaphore for the counter so that multiple processes do not modify it at the same time

index

Index for conversions CSV

barcodes

List of barcodes to be considered. All barcodes are considered if not provided

snps

Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations

quality

Only count conversions with PHRED quality greater than this value

temp_dir

Path to temporary directory, defaults to None

update_every

Update the counter every this many reads

Returns

Path to temporary counts CSV

dynast.preprocessing.conversion.count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) str[source]

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.

Parameters
conversions_path

Path to conversions CSV

alignments_path

Path to alignments information about reads

index_path

Path to conversions index

counts_path

Path to write counts CSV

gene_infos

Dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf

barcodes

List of barcodes to be considered. All barcodes are considered if not provided

snps

Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations

conversions

Conversions to prioritize when deduplicating only applicable for UMI technologies

dedup_use_conversions

Prioritize reads that have at least one conversion when deduplicating

quality

Only count conversions with PHRED quality greater than this value

n_threads

Number of threads

temp_dir

Path to temporary directory

Returns

Path to counts CSV