dynast.preprocessing
Submodules
Package Contents
Functions
|
Aggregate conversion counts for each pair of bases. |
|
Calculate mutation rate for each pair of bases. |
|
Merge multiple aggregate dataframes into one. |
|
Read aggregates CSV as a pandas dataframe. |
|
Read mutation rates CSV as a pandas dataframe. |
|
Check whether BAM contains duplicates. |
|
Check whether BAM contains secondary alignments. |
|
Check whether BAM contains unmapped reads. |
|
Utility function to retrieve all read tags present in a BAM. |
|
Parse all reads in a BAM and extract conversion, content and alignment |
|
Read alignments CSV as a pandas DataFrame. |
|
Read conversions CSV as a pandas DataFrame. |
|
Select alignments among duplicates. This function performs preliminary |
|
Sort and index BAM. |
|
Call consensus sequences from BAM. |
|
Complement the counts in the counts dataframe according to gene strand. |
|
Count the number of conversions of each read per barcode and gene, along with |
|
Deduplicate counts based on barcode, UMI, and gene. |
|
Read counts CSV as a pandas dataframe. |
|
Split the given counts dataframe by the velocity column. |
|
Subset the given counts DataFrame to only contain reads of the desired key. |
|
Calculate coverage of each genomic position per barcode. |
|
Read coverage CSV as a dictionary. |
|
Detect SNPs. |
|
Read a user-provided SNPs CSV |
|
Read SNPs CSV as a dictionary |
Attributes
- dynast.preprocessing.aggregate_counts(df_counts: pandas.DataFrame, aggregates_path: str, conversions: FrozenSet[str] = frozenset({'TC'})) str [source]
Aggregate conversion counts for each pair of bases.
- Parameters
- df_counts
Counts dataframe, with complemented reverse strand bases
- aggregates_path
Path to write aggregate CSV
- conversions
Conversion(s) in question
- Returns
Path to aggregate CSV that was written
- dynast.preprocessing.calculate_mutation_rates(df_counts: pandas.DataFrame, rates_path: str, group_by: Optional[List[str]] = None) str [source]
Calculate mutation rate for each pair of bases.
- Parameters
- df_counts
Counts dataframe, with complemented reverse strand bases
- rates_path
Path to write rates CSV
- group_by
Column(s) to group calculations by, defaults to None, which combines all rows
- Returns
Path to rates CSV
- dynast.preprocessing.merge_aggregates(*dfs: pandas.DataFrame) pandas.DataFrame [source]
Merge multiple aggregate dataframes into one.
- Parameters
- dfs
Dataframes to merge
- Returns
Merged dataframe
- dynast.preprocessing.read_aggregates(aggregates_path: str) pandas.DataFrame [source]
Read aggregates CSV as a pandas dataframe.
- Parameters
- aggregates_path
Path to aggregates CSV
- Returns
Aggregates dataframe
- dynast.preprocessing.read_rates(rates_path: str) pandas.DataFrame [source]
Read mutation rates CSV as a pandas dataframe.
- Parameters
- rates_path
Path to rates CSV
- Returns
Rates dataframe
- dynast.preprocessing.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) bool [source]
Check whether BAM contains duplicates.
bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads
- Returns
Whether duplicates were detected
- dynast.preprocessing.check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) bool [source]
Check whether BAM contains secondary alignments.
bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads
- Returns
Whether secondary alignments were detected
- dynast.preprocessing.check_bam_contains_unmapped(bam_path: str) bool [source]
Check whether BAM contains unmapped reads.
bam_path: Path to BAM
- Returns
Whether unmapped reads were detected
- dynast.preprocessing.get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) Set[str] [source]
Utility function to retrieve all read tags present in a BAM.
- Parameters
- bam_path
Path to BAM
- n_reads
Number of reads to consider
- n_threads
Number of threads
- Returns
Set of all tags found
- dynast.preprocessing.parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]] [source]
Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.
bam_path: Path to alignment BAM file conversions_path: Path to output information about reads that have conversions alignments_path: Path to alignments information about reads index_path: Path to conversions index no_index_path: Path to no conversions index gene_infos: Dictionary containing gene information, as returned by
ngs.gtf.genes_and_transcripts_from_gtf
- transcript_infos: Dictionary containing transcript information,
as returned by ngs.gtf.genes_and_transcripts_from_gtf
- strand: Strandedness of the sequencing protocol, defaults to forward,
may be one of the following: forward, reverse, unstranded
- umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the
umi column
- barcode_tag: BAM tag that encodes cell barcode, if not provided, NA
is output in the barcode column
gene_tag: BAM tag that encodes gene assignment, defaults to GX barcodes: List of barcodes to be considered. All barcodes are considered
if not provided
n_threads: Number of threads temp_dir: Path to temporary directory nasc: Flag to change behavior to match NASC-seq pipeline velocity: Whether or not to assign a velocity type to each read strict_exon_overlap: Whether to use a stricter algorithm to assign reads as spliced return_splits: return BAM splits for later reuse
- Returns
- (path to conversions, path to alignments, path to conversions index)
If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.
- dynast.preprocessing.read_alignments(alignments_path: str, *args, **kwargs) pandas.DataFrame [source]
Read alignments CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
- alignments_path
path to alignments CSV
- Returns
Conversions dataframe
- dynast.preprocessing.read_conversions(conversions_path: str, *args, **kwargs) pandas.DataFrame [source]
Read conversions CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
- conversions_path
Path to conversions CSV
- Returns
Conversions dataframe
- dynast.preprocessing.select_alignments(df_alignments: pandas.DataFrame) Set[Tuple[str, str]] [source]
Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.
- Parameters
- df_alignments
Alignments dataframe
- Returns
Set of (read_id, alignment index) that were selected
- dynast.preprocessing.sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) str [source]
Sort and index BAM.
If the BAM is already sorted, the sorting step is skipped.
- Parameters
- bam_path
Path to alignment BAM file to sort
- out_path
Path to output sorted BAM
- n_threads
Number of threads
- temp_dir
Path to temporary directory
- Returns
Path to sorted and indexed BAM
- dynast.preprocessing.call_consensus(bam_path: str, out_path: str, gene_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, quality: int = 27, add_RS_RI: bool = False, temp_dir: Optional[str] = None, n_threads: int = 8) str [source]
Call consensus sequences from BAM.
- Parameters
- bam_path
Path to BAM
- out_path
Output BAM path
- gene_infos
Gene information, as parsed from the GTF
- strand
Protocol strandedness
- umi_tag
BAM tag containing the UMI
- barcode_tag
BAM tag containing the barcode
- gene_tag
BAM tag containing the assigned gene
- barcodes
List of barcodes to consider
- quality
Quality threshold
- add_RS_RI
Add RS and RI BAM tags for debugging
- temp_dir
Temporary directory
- n_threads
Number of threads
- Returns
Path to sorted and indexed consensus BAM
- dynast.preprocessing.complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) pandas.DataFrame [source]
Complement the counts in the counts dataframe according to gene strand.
- Parameters
- df_counts
Counts dataframe
- gene_infos
Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf
- Returns
counts dataframe with counts complemented for reads mapping to genes on the reverse strand
- dynast.preprocessing.count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) str [source]
Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.
- Parameters
- conversions_path
Path to conversions CSV
- alignments_path
Path to alignments information about reads
- index_path
Path to conversions index
- counts_path
Path to write counts CSV
- gene_infos
Dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
- barcodes
List of barcodes to be considered. All barcodes are considered if not provided
- snps
Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations
- conversions
Conversions to prioritize when deduplicating only applicable for UMI technologies
- dedup_use_conversions
Prioritize reads that have at least one conversion when deduplicating
- quality
Only count conversions with PHRED quality greater than this value
- n_threads
Number of threads
- temp_dir
Path to temporary directory
- Returns
Path to counts CSV
- dynast.preprocessing.deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) pandas.DataFrame [source]
Deduplicate counts based on barcode, UMI, and gene.
The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions
If conversions is not provided, reads that have larger sum of all conversions
- Parameters
- df_counts
Counts dataframe
- conversions
Conversions to prioritize, defaults to None
- use_conversions
Prioritize reads that have conversions first
- Returns
Deduplicated counts dataframe
- dynast.preprocessing.read_counts(counts_path: str, *args, **kwargs) pandas.DataFrame [source]
Read counts CSV as a pandas dataframe.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
- counts_path
Path to CSV
- Returns
Counts dataframe
- dynast.preprocessing.split_counts_by_velocity(df_counts: pandas.DataFrame) Dict[str, pandas.DataFrame] [source]
Split the given counts dataframe by the velocity column.
- Parameters
- df_counts
Counts dataframe
- Returns
Dictionary containing velocity column values as keys and the subset dataframe as values
- dynast.preprocessing.subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) pandas.DataFrame [source]
Subset the given counts DataFrame to only contain reads of the desired key.
- Parameters
- df_count
Counts dataframe
- key
Read types to subset
- Returns:s
Subset dataframe
- dynast.preprocessing.calculate_coverage(bam_path: str, conversions: Dict[str, Set[int]], coverage_path: str, alignments: Optional[List[Tuple[str, int]]] = None, umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, velocity: bool = True) str [source]
Calculate coverage of each genomic position per barcode.
- Parameters
- bam_path
Path to alignment BAM file
- conversions
Dictionary of contigs as keys and sets of genomic positions as values that indicates positions where conversions were observed
- coverage_path
Path to write coverage CSV
- alignments
Set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
- umi_tag
BAM tag that encodes UMI, if not provided, NA is output in the umi column
- barcode_tag
BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column
- gene_tag
BAM tag that encodes gene assignment
- barcodes
List of barcodes to be considered. All barcodes are considered if not provided
- temp_dir
Path to temporary directory
- velocity
Whether or not velocities were assigned
- Returns
Path to coverage CSV
- dynast.preprocessing.read_coverage(coverage_path: str) Dict[str, Dict[int, int]] [source]
Read coverage CSV as a dictionary.
- Parameters
- coverage_path
Path to coverage CSV
- Returns
Coverage as a nested dictionary
- dynast.preprocessing.detect_snps(conversions_path: str, index_path: str, coverage: Dict[str, Dict[int, int]], snps_path: str, alignments: Optional[List[Tuple[str, int]]] = None, conversions: Optional[FrozenSet[str]] = None, quality: int = 27, threshold: float = 0.5, min_coverage: int = 1, n_threads: int = 8) str [source]
Detect SNPs.
- Parameters
- conversions_path
Path to conversions CSV
- index_path
Path to conversions index
- coverage
Dictionary containing genomic coverage
- snps_path
Path to output SNPs
- alignments
Set of (read_id, alignment_index) tuples to process. All alignments are processed if this option is not provided.
- conversions
Set of conversions to consider
- quality
Only count conversions with PHRED quality greater than this value
- threshold
Positions with conversions / coverage > threshold will be considered as SNPs
- min_coverage
Only positions with at least this many mapping read_snps are considered
- n_threads
Number of threads
- Returns
Path to SNPs CSV