`dynast.preprocessing`

Submodules

Package Contents

Functions

`aggregate_counts`(df_counts: pandas.DataFrame, aggregates_path: str, conversions: FrozenSet[str] = frozenset({'TC'})) → str	Aggregate conversion counts for each pair of bases.
`calculate_mutation_rates`(df_counts: pandas.DataFrame, rates_path: str, group_by: Optional[List[str]] = None) → str	Calculate mutation rate for each pair of bases.
`merge_aggregates`(*dfs: pandas.DataFrame) → pandas.DataFrame	Merge multiple aggregate dataframes into one.
`read_aggregates`(aggregates_path: str) → pandas.DataFrame	Read aggregates CSV as a pandas dataframe.
`read_rates`(rates_path: str) → pandas.DataFrame	Read mutation rates CSV as a pandas dataframe.
`check_bam_contains_duplicate`(bam_path, n_reads=100000, n_threads=8) → bool	Check whether BAM contains duplicates.
`check_bam_contains_secondary`(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool	Check whether BAM contains secondary alignments.
`check_bam_contains_unmapped`(bam_path: str) → bool	Check whether BAM contains unmapped reads.
`get_tags_from_bam`(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → Set[str]	Utility function to retrieve all read tags present in a BAM.
`parse_all_reads`(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) → Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]]	Parse all reads in a BAM and extract conversion, content and alignment
`read_alignments`(alignments_path: str, args, *kwargs) → pandas.DataFrame	Read alignments CSV as a pandas DataFrame.
`read_conversions`(conversions_path: str, args, *kwargs) → pandas.DataFrame	Read conversions CSV as a pandas DataFrame.
`select_alignments`(df_alignments: pandas.DataFrame) → Set[Tuple[str, str]]	Select alignments among duplicates. This function performs preliminary
`sort_and_index_bam`(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) → str	Sort and index BAM.
`call_consensus`(bam_path: str, out_path: str, gene_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, quality: int = 27, add_RS_RI: bool = False, temp_dir: Optional[str] = None, n_threads: int = 8) → str	Call consensus sequences from BAM.
`complement_counts`(df_counts: pandas.DataFrame, gene_infos: dict) → pandas.DataFrame	Complement the counts in the counts dataframe according to gene strand.
`count_conversions`(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) → str	Count the number of conversions of each read per barcode and gene, along with
`deduplicate_counts`(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) → pandas.DataFrame	Deduplicate counts based on barcode, UMI, and gene.
`read_counts`(counts_path: str, args, *kwargs) → pandas.DataFrame	Read counts CSV as a pandas dataframe.
`split_counts_by_velocity`(df_counts: pandas.DataFrame) → Dict[str, pandas.DataFrame]	Split the given counts dataframe by the velocity column.
`subset_counts`(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) → pandas.DataFrame	Subset the given counts DataFrame to only contain reads of the desired key.
`calculate_coverage`(bam_path: str, conversions: Dict[str, Set[int]], coverage_path: str, alignments: Optional[List[Tuple[str, int]]] = None, umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, velocity: bool = True) → str	Calculate coverage of each genomic position per barcode.
`read_coverage`(coverage_path: str) → Dict[str, Dict[int, int]]	Read coverage CSV as a dictionary.
`detect_snps`(conversions_path: str, index_path: str, coverage: Dict[str, Dict[int, int]], snps_path: str, alignments: Optional[List[Tuple[str, int]]] = None, conversions: Optional[FrozenSet[str]] = None, quality: int = 27, threshold: float = 0.5, min_coverage: int = 1, n_threads: int = 8) → str	Detect SNPs.
`read_snp_csv`(snp_csv: str) → Dict[str, Dict[str, Set[int]]]	Read a user-provided SNPs CSV
`read_snps`(snps_path: str) → Dict[str, Dict[str, Set[int]]]	Read SNPs CSV as a dictionary

Attributes

CONVERSION_COMPLEMENT

dynast.preprocessing.aggregate_counts(df_counts: pandas.DataFrame, aggregates_path: str, conversions: FrozenSet[str] = frozenset({'TC'})) → str[source]

Aggregate conversion counts for each pair of bases.

Parameters

df_counts: Counts dataframe, with complemented reverse strand bases
aggregates_path: Path to write aggregate CSV
conversions: Conversion(s) in question

Returns

Path to aggregate CSV that was written

dynast.preprocessing.calculate_mutation_rates(df_counts: pandas.DataFrame, rates_path: str, group_by: Optional[List[str]] = None) → str[source]

Calculate mutation rate for each pair of bases.

Parameters

df_counts: Counts dataframe, with complemented reverse strand bases
rates_path: Path to write rates CSV
group_by: Column(s) to group calculations by, defaults to None, which combines all rows

Returns

Path to rates CSV

dynast.preprocessing.merge_aggregates(*dfs: pandas.DataFrame) → pandas.DataFrame[source]

Merge multiple aggregate dataframes into one.

Parameters

dfs: Dataframes to merge

Returns

Merged dataframe

dynast.preprocessing.read_aggregates(aggregates_path: str) → pandas.DataFrame[source]

Read aggregates CSV as a pandas dataframe.

Parameters

aggregates_path: Path to aggregates CSV

Returns

Aggregates dataframe

dynast.preprocessing.read_rates(rates_path: str) → pandas.DataFrame[source]

Read mutation rates CSV as a pandas dataframe.

Parameters

rates_path: Path to rates CSV

Returns

Rates dataframe

dynast.preprocessing.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) → bool[source]

Check whether BAM contains duplicates.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns: Whether duplicates were detected

dynast.preprocessing.check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool[source]

Check whether BAM contains secondary alignments.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns: Whether secondary alignments were detected

dynast.preprocessing.check_bam_contains_unmapped(bam_path: str) → bool[source]

Check whether BAM contains unmapped reads.

bam_path: Path to BAM

Returns: Whether unmapped reads were detected

dynast.preprocessing.get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → Set[str][source]

Utility function to retrieve all read tags present in a BAM.

Parameters

bam_path: Path to BAM
n_reads: Number of reads to consider
n_threads: Number of threads

Returns

Set of all tags found

dynast.preprocessing.parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) → Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]][source]

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

bam_path: Path to alignment BAM file conversions_path: Path to output information about reads that have conversions alignments_path: Path to alignments information about reads index_path: Path to conversions index no_index_path: Path to no conversions index gene_infos: Dictionary containing gene information, as returned by

ngs.gtf.genes_and_transcripts_from_gtf

transcript_infos: Dictionary containing transcript information,
as returned by ngs.gtf.genes_and_transcripts_from_gtf

strand: Strandedness of the sequencing protocol, defaults to forward,
may be one of the following: forward, reverse, unstranded

umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the
umi column

barcode_tag: BAM tag that encodes cell barcode, if not provided, NA
is output in the barcode column

gene_tag: BAM tag that encodes gene assignment, defaults to GX barcodes: List of barcodes to be considered. All barcodes are considered

if not provided

n_threads: Number of threads temp_dir: Path to temporary directory nasc: Flag to change behavior to match NASC-seq pipeline velocity: Whether or not to assign a velocity type to each read strict_exon_overlap: Whether to use a stricter algorithm to assign reads as spliced return_splits: return BAM splits for later reuse

Returns

(path to conversions, path to alignments, path to conversions index): If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

dynast.preprocessing.read_alignments(alignments_path: str, *args, **kwargs) → pandas.DataFrame[source]

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

alignments_path: path to alignments CSV

Returns

Conversions dataframe

dynast.preprocessing.read_conversions(conversions_path: str, *args, **kwargs) → pandas.DataFrame[source]

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

conversions_path: Path to conversions CSV

Returns

Conversions dataframe

dynast.preprocessing.select_alignments(df_alignments: pandas.DataFrame) → Set[Tuple[str, str]][source]

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters

df_alignments: Alignments dataframe

Returns

Set of (read_id, alignment index) that were selected

dynast.preprocessing.sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) → str[source]

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters

bam_path: Path to alignment BAM file to sort
out_path: Path to output sorted BAM
n_threads: Number of threads
temp_dir: Path to temporary directory

Returns

Path to sorted and indexed BAM

dynast.preprocessing.call_consensus(bam_path: str, out_path: str, gene_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, quality: int = 27, add_RS_RI: bool = False, temp_dir: Optional[str] = None, n_threads: int = 8) → str[source]

Call consensus sequences from BAM.

Parameters

bam_path: Path to BAM
out_path: Output BAM path
gene_infos: Gene information, as parsed from the GTF
strand: Protocol strandedness
umi_tag: BAM tag containing the UMI
barcode_tag: BAM tag containing the barcode
gene_tag: BAM tag containing the assigned gene
barcodes: List of barcodes to consider
quality: Quality threshold
add_RS_RI: Add RS and RI BAM tags for debugging
temp_dir: Temporary directory
n_threads: Number of threads

Returns

Path to sorted and indexed consensus BAM

dynast.preprocessing.CONVERSION_COMPLEMENT[source]

dynast.preprocessing.complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) → pandas.DataFrame[source]

Complement the counts in the counts dataframe according to gene strand.

Parameters

df_counts: Counts dataframe
gene_infos: Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf

Returns

counts dataframe with counts complemented for reads mapping to genes on the reverse strand

dynast.preprocessing.count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) → str[source]

Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.

Parameters

conversions_path: Path to conversions CSV
alignments_path: Path to alignments information about reads
index_path: Path to conversions index
counts_path: Path to write counts CSV
gene_infos: Dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
barcodes: List of barcodes to be considered. All barcodes are considered if not provided
snps: Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations
conversions: Conversions to prioritize when deduplicating only applicable for UMI technologies
dedup_use_conversions: Prioritize reads that have at least one conversion when deduplicating
quality: Only count conversions with PHRED quality greater than this value
n_threads: Number of threads
temp_dir: Path to temporary directory

Returns

Path to counts CSV

dynast.preprocessing.deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) → pandas.DataFrame[source]

Deduplicate counts based on barcode, UMI, and gene.

The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions

If conversions is not provided, reads that have larger sum of all conversions

Parameters

df_counts: Counts dataframe
conversions: Conversions to prioritize, defaults to None
use_conversions: Prioritize reads that have conversions first

Returns

Deduplicated counts dataframe

dynast.preprocessing.read_counts(counts_path: str, *args, **kwargs) → pandas.DataFrame[source]

Read counts CSV as a pandas dataframe.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

counts_path: Path to CSV

Returns

Counts dataframe

dynast.preprocessing.split_counts_by_velocity(df_counts: pandas.DataFrame) → Dict[str, pandas.DataFrame][source]

Split the given counts dataframe by the velocity column.

Parameters

df_counts: Counts dataframe

Returns

Dictionary containing velocity column values as keys and the subset dataframe as values

dynast.preprocessing.subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) → pandas.DataFrame[source]

Subset the given counts DataFrame to only contain reads of the desired key.

Parameters

df_count: Counts dataframe
key: Read types to subset

Returns:s: Subset dataframe

dynast.preprocessing.calculate_coverage(bam_path: str, conversions: Dict[str, Set[int]], coverage_path: str, alignments: Optional[List[Tuple[str, int]]] = None, umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, velocity: bool = True) → str[source]

Calculate coverage of each genomic position per barcode.

Parameters

bam_path: Path to alignment BAM file
conversions: Dictionary of contigs as keys and sets of genomic positions as values that indicates positions where conversions were observed
coverage_path: Path to write coverage CSV
alignments: Set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the umi column
barcode_tag: BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column
gene_tag: BAM tag that encodes gene assignment
barcodes: List of barcodes to be considered. All barcodes are considered if not provided
temp_dir: Path to temporary directory
velocity: Whether or not velocities were assigned

Returns

Path to coverage CSV

dynast.preprocessing.read_coverage(coverage_path: str) → Dict[str, Dict[int, int]][source]

Read coverage CSV as a dictionary.

Parameters

coverage_path: Path to coverage CSV

Returns

Coverage as a nested dictionary

dynast.preprocessing.detect_snps(conversions_path: str, index_path: str, coverage: Dict[str, Dict[int, int]], snps_path: str, alignments: Optional[List[Tuple[str, int]]] = None, conversions: Optional[FrozenSet[str]] = None, quality: int = 27, threshold: float = 0.5, min_coverage: int = 1, n_threads: int = 8) → str[source]

Detect SNPs.

Parameters

conversions_path: Path to conversions CSV
index_path: Path to conversions index
coverage: Dictionary containing genomic coverage
snps_path: Path to output SNPs
alignments: Set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
conversions: Set of conversions to consider
quality: Only count conversions with PHRED quality greater than this value
threshold: Positions with conversions / coverage > threshold will be considered as SNPs
min_coverage: Only positions with at least this many mapping read_snps are considered
n_threads: Number of threads

Returns

Path to SNPs CSV

dynast.preprocessing.read_snp_csv(snp_csv: str) → Dict[str, Dict[str, Set[int]]][source]

Read a user-provided SNPs CSV

Parameters

snp_csv: Path to SNPs CSV

Returns

Dictionary of contigs as keys and sets of genomic positions with SNPs as values

dynast.preprocessing.read_snps(snps_path: str) → Dict[str, Dict[str, Set[int]]][source]

Read SNPs CSV as a dictionary

Parameters

snps_path: Path to SNPs CSV

Returns

Dictionary of contigs as keys and sets of genomic positions with SNPs as values

dynast.preprocessing

Submodules

Package Contents

Functions

Attributes

`dynast.preprocessing`