dynast.preprocessing

Submodules

Package Contents

Functions

aggregate_counts(df_counts: pandas.DataFrame, aggregates_path: str, conversions: FrozenSet[str] = frozenset({'TC'})) → str

Aggregate conversion counts for each pair of bases.

calculate_mutation_rates(df_counts: pandas.DataFrame, rates_path: str, group_by: Optional[List[str]] = None) → str

Calculate mutation rate for each pair of bases.

merge_aggregates(*dfs: pandas.DataFrame) → pandas.DataFrame

Merge multiple aggregate dataframes into one.

read_aggregates(aggregates_path: str) → pandas.DataFrame

Read aggregates CSV as a pandas dataframe.

read_rates(rates_path: str) → pandas.DataFrame

Read mutation rates CSV as a pandas dataframe.

check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) → bool

Check whether BAM contains duplicates.

check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool

Check whether BAM contains secondary alignments.

check_bam_contains_unmapped(bam_path: str) → bool

Check whether BAM contains unmapped reads.

get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → Set[str]

Utility function to retrieve all read tags present in a BAM.

parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) → Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]]

Parse all reads in a BAM and extract conversion, content and alignment

read_alignments(alignments_path: str, *args, **kwargs) → pandas.DataFrame

Read alignments CSV as a pandas DataFrame.

read_conversions(conversions_path: str, *args, **kwargs) → pandas.DataFrame

Read conversions CSV as a pandas DataFrame.

select_alignments(df_alignments: pandas.DataFrame) → Set[Tuple[str, str]]

Select alignments among duplicates. This function performs preliminary

sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) → str

Sort and index BAM.

call_consensus(bam_path: str, out_path: str, gene_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, quality: int = 27, add_RS_RI: bool = False, temp_dir: Optional[str] = None, n_threads: int = 8) → str

Call consensus sequences from BAM.

complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) → pandas.DataFrame

Complement the counts in the counts dataframe according to gene strand.

count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) → str

Count the number of conversions of each read per barcode and gene, along with

deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) → pandas.DataFrame

Deduplicate counts based on barcode, UMI, and gene.

read_counts(counts_path: str, *args, **kwargs) → pandas.DataFrame

Read counts CSV as a pandas dataframe.

split_counts_by_velocity(df_counts: pandas.DataFrame) → Dict[str, pandas.DataFrame]

Split the given counts dataframe by the velocity column.

subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) → pandas.DataFrame

Subset the given counts DataFrame to only contain reads of the desired key.

calculate_coverage(bam_path: str, conversions: Dict[str, Set[int]], coverage_path: str, alignments: Optional[List[Tuple[str, int]]] = None, umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, velocity: bool = True) → str

Calculate coverage of each genomic position per barcode.

read_coverage(coverage_path: str) → Dict[str, Dict[int, int]]

Read coverage CSV as a dictionary.

detect_snps(conversions_path: str, index_path: str, coverage: Dict[str, Dict[int, int]], snps_path: str, alignments: Optional[List[Tuple[str, int]]] = None, conversions: Optional[FrozenSet[str]] = None, quality: int = 27, threshold: float = 0.5, min_coverage: int = 1, n_threads: int = 8) → str

Detect SNPs.

read_snp_csv(snp_csv: str) → Dict[str, Dict[str, Set[int]]]

Read a user-provided SNPs CSV

read_snps(snps_path: str) → Dict[str, Dict[str, Set[int]]]

Read SNPs CSV as a dictionary

Attributes

CONVERSION_COMPLEMENT

dynast.preprocessing.aggregate_counts(df_counts: pandas.DataFrame, aggregates_path: str, conversions: FrozenSet[str] = frozenset({'TC'})) str[source]

Aggregate conversion counts for each pair of bases.

Parameters
df_counts

Counts dataframe, with complemented reverse strand bases

aggregates_path

Path to write aggregate CSV

conversions

Conversion(s) in question

Returns

Path to aggregate CSV that was written

dynast.preprocessing.calculate_mutation_rates(df_counts: pandas.DataFrame, rates_path: str, group_by: Optional[List[str]] = None) str[source]

Calculate mutation rate for each pair of bases.

Parameters
df_counts

Counts dataframe, with complemented reverse strand bases

rates_path

Path to write rates CSV

group_by

Column(s) to group calculations by, defaults to None, which combines all rows

Returns

Path to rates CSV

dynast.preprocessing.merge_aggregates(*dfs: pandas.DataFrame) pandas.DataFrame[source]

Merge multiple aggregate dataframes into one.

Parameters
dfs

Dataframes to merge

Returns

Merged dataframe

dynast.preprocessing.read_aggregates(aggregates_path: str) pandas.DataFrame[source]

Read aggregates CSV as a pandas dataframe.

Parameters
aggregates_path

Path to aggregates CSV

Returns

Aggregates dataframe

dynast.preprocessing.read_rates(rates_path: str) pandas.DataFrame[source]

Read mutation rates CSV as a pandas dataframe.

Parameters
rates_path

Path to rates CSV

Returns

Rates dataframe

dynast.preprocessing.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) bool[source]

Check whether BAM contains duplicates.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns

Whether duplicates were detected

dynast.preprocessing.check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) bool[source]

Check whether BAM contains secondary alignments.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns

Whether secondary alignments were detected

dynast.preprocessing.check_bam_contains_unmapped(bam_path: str) bool[source]

Check whether BAM contains unmapped reads.

bam_path: Path to BAM

Returns

Whether unmapped reads were detected

dynast.preprocessing.get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) Set[str][source]

Utility function to retrieve all read tags present in a BAM.

Parameters
bam_path

Path to BAM

n_reads

Number of reads to consider

n_threads

Number of threads

Returns

Set of all tags found

dynast.preprocessing.parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]][source]

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

bam_path: Path to alignment BAM file conversions_path: Path to output information about reads that have conversions alignments_path: Path to alignments information about reads index_path: Path to conversions index no_index_path: Path to no conversions index gene_infos: Dictionary containing gene information, as returned by

ngs.gtf.genes_and_transcripts_from_gtf

transcript_infos: Dictionary containing transcript information,

as returned by ngs.gtf.genes_and_transcripts_from_gtf

strand: Strandedness of the sequencing protocol, defaults to forward,

may be one of the following: forward, reverse, unstranded

umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the

umi column

barcode_tag: BAM tag that encodes cell barcode, if not provided, NA

is output in the barcode column

gene_tag: BAM tag that encodes gene assignment, defaults to GX barcodes: List of barcodes to be considered. All barcodes are considered

if not provided

n_threads: Number of threads temp_dir: Path to temporary directory nasc: Flag to change behavior to match NASC-seq pipeline velocity: Whether or not to assign a velocity type to each read strict_exon_overlap: Whether to use a stricter algorithm to assign reads as spliced return_splits: return BAM splits for later reuse

Returns

(path to conversions, path to alignments, path to conversions index)

If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

dynast.preprocessing.read_alignments(alignments_path: str, *args, **kwargs) pandas.DataFrame[source]

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters
alignments_path

path to alignments CSV

Returns

Conversions dataframe

dynast.preprocessing.read_conversions(conversions_path: str, *args, **kwargs) pandas.DataFrame[source]

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters
conversions_path

Path to conversions CSV

Returns

Conversions dataframe

dynast.preprocessing.select_alignments(df_alignments: pandas.DataFrame) Set[Tuple[str, str]][source]

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters
df_alignments

Alignments dataframe

Returns

Set of (read_id, alignment index) that were selected

dynast.preprocessing.sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) str[source]

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters
bam_path

Path to alignment BAM file to sort

out_path

Path to output sorted BAM

n_threads

Number of threads

temp_dir

Path to temporary directory

Returns

Path to sorted and indexed BAM

dynast.preprocessing.call_consensus(bam_path: str, out_path: str, gene_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, quality: int = 27, add_RS_RI: bool = False, temp_dir: Optional[str] = None, n_threads: int = 8) str[source]

Call consensus sequences from BAM.

Parameters
bam_path

Path to BAM

out_path

Output BAM path

gene_infos

Gene information, as parsed from the GTF

strand

Protocol strandedness

umi_tag

BAM tag containing the UMI

barcode_tag

BAM tag containing the barcode

gene_tag

BAM tag containing the assigned gene

barcodes

List of barcodes to consider

quality

Quality threshold

add_RS_RI

Add RS and RI BAM tags for debugging

temp_dir

Temporary directory

n_threads

Number of threads

Returns

Path to sorted and indexed consensus BAM

dynast.preprocessing.CONVERSION_COMPLEMENT[source]
dynast.preprocessing.complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) pandas.DataFrame[source]

Complement the counts in the counts dataframe according to gene strand.

Parameters
df_counts

Counts dataframe

gene_infos

Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf

Returns

counts dataframe with counts complemented for reads mapping to genes on the reverse strand

dynast.preprocessing.count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) str[source]

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.

Parameters
conversions_path

Path to conversions CSV

alignments_path

Path to alignments information about reads

index_path

Path to conversions index

counts_path

Path to write counts CSV

gene_infos

Dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf

barcodes

List of barcodes to be considered. All barcodes are considered if not provided

snps

Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations

conversions

Conversions to prioritize when deduplicating only applicable for UMI technologies

dedup_use_conversions

Prioritize reads that have at least one conversion when deduplicating

quality

Only count conversions with PHRED quality greater than this value

n_threads

Number of threads

temp_dir

Path to temporary directory

Returns

Path to counts CSV

dynast.preprocessing.deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) pandas.DataFrame[source]

Deduplicate counts based on barcode, UMI, and gene.

The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions

If conversions is not provided, reads that have larger sum of all conversions

Parameters
df_counts

Counts dataframe

conversions

Conversions to prioritize, defaults to None

use_conversions

Prioritize reads that have conversions first

Returns

Deduplicated counts dataframe

dynast.preprocessing.read_counts(counts_path: str, *args, **kwargs) pandas.DataFrame[source]

Read counts CSV as a pandas dataframe.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters
counts_path

Path to CSV

Returns

Counts dataframe

dynast.preprocessing.split_counts_by_velocity(df_counts: pandas.DataFrame) Dict[str, pandas.DataFrame][source]

Split the given counts dataframe by the velocity column.

Parameters
df_counts

Counts dataframe

Returns

Dictionary containing velocity column values as keys and the subset dataframe as values

dynast.preprocessing.subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) pandas.DataFrame[source]

Subset the given counts DataFrame to only contain reads of the desired key.

Parameters
df_count

Counts dataframe

key

Read types to subset

Returns:s

Subset dataframe

dynast.preprocessing.calculate_coverage(bam_path: str, conversions: Dict[str, Set[int]], coverage_path: str, alignments: Optional[List[Tuple[str, int]]] = None, umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, velocity: bool = True) str[source]

Calculate coverage of each genomic position per barcode.

Parameters
bam_path

Path to alignment BAM file

conversions

Dictionary of contigs as keys and sets of genomic positions as values that indicates positions where conversions were observed

coverage_path

Path to write coverage CSV

alignments

Set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

umi_tag

BAM tag that encodes UMI, if not provided, NA is output in the umi column

barcode_tag

BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column

gene_tag

BAM tag that encodes gene assignment

barcodes

List of barcodes to be considered. All barcodes are considered if not provided

temp_dir

Path to temporary directory

velocity

Whether or not velocities were assigned

Returns

Path to coverage CSV

dynast.preprocessing.read_coverage(coverage_path: str) Dict[str, Dict[int, int]][source]

Read coverage CSV as a dictionary.

Parameters
coverage_path

Path to coverage CSV

Returns

Coverage as a nested dictionary

dynast.preprocessing.detect_snps(conversions_path: str, index_path: str, coverage: Dict[str, Dict[int, int]], snps_path: str, alignments: Optional[List[Tuple[str, int]]] = None, conversions: Optional[FrozenSet[str]] = None, quality: int = 27, threshold: float = 0.5, min_coverage: int = 1, n_threads: int = 8) str[source]

Detect SNPs.

Parameters
conversions_path

Path to conversions CSV

index_path

Path to conversions index

coverage

Dictionary containing genomic coverage

snps_path

Path to output SNPs

alignments

Set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.

conversions

Set of conversions to consider

quality

Only count conversions with PHRED quality greater than this value

threshold

Positions with conversions / coverage > threshold will be considered as SNPs

min_coverage

Only positions with at least this many mapping read_snps are considered

n_threads

Number of threads

Returns

Path to SNPs CSV

dynast.preprocessing.read_snp_csv(snp_csv: str) Dict[str, Dict[str, Set[int]]][source]

Read a user-provided SNPs CSV

Parameters
snp_csv

Path to SNPs CSV

Returns

Dictionary of contigs as keys and sets of genomic positions with SNPs as values

dynast.preprocessing.read_snps(snps_path: str) Dict[str, Dict[str, Set[int]]][source]

Read SNPs CSV as a dictionary

Parameters
snps_path

Path to SNPs CSV

Returns

Dictionary of contigs as keys and sets of genomic positions with SNPs as values