dynast.preprocessing.conversion
Module Contents
Functions
|
Read counts CSV as a pandas dataframe. |
|
Complement the counts in the counts dataframe according to gene strand. |
|
Subset the given counts DataFrame to only contain reads of the desired key. |
|
Drop multimappings that have the same read ID where |
|
Deduplicate counts based on barcode, UMI, and gene. |
|
Helper function to parallelize |
|
Helper function to parallelize |
|
Split the given counts dataframe by the velocity column. |
|
Count reads that have no conversion. |
|
Count the number of conversions of each read per barcode and gene, along with |
|
Count the number of conversions of each read per barcode and gene, along with |
Attributes
- dynast.preprocessing.conversion.read_counts(counts_path: str, *args, **kwargs) pandas.DataFrame [source]
Read counts CSV as a pandas dataframe.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
- counts_path
Path to CSV
- Returns
Counts dataframe
- dynast.preprocessing.conversion.complement_counts(df_counts: pandas.DataFrame, gene_infos: dict) pandas.DataFrame [source]
Complement the counts in the counts dataframe according to gene strand.
- Parameters
- df_counts
Counts dataframe
- gene_infos
Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf
- Returns
counts dataframe with counts complemented for reads mapping to genes on the reverse strand
- dynast.preprocessing.conversion.subset_counts(df_counts: pandas.DataFrame, key: typing_extensions.Literal[total, transcriptome, spliced, unspliced]) pandas.DataFrame [source]
Subset the given counts DataFrame to only contain reads of the desired key.
- Parameters
- df_count
Counts dataframe
- key
Read types to subset
- Returns:s
Subset dataframe
- dynast.preprocessing.conversion.drop_multimappers(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None) pandas.DataFrame [source]
Drop multimappings that have the same read ID where * some map to the transcriptome while some do not – drop non-transcriptome alignments * none map to the transcriptome AND aligned to multiple genes – drop all * none map to the transcriptome AND assigned multiple velocity types – set to ambiguous
TODO: This function can probably be removed because BAM parsing only considers primary alignments now.
- Parameters
- df_counts
Counts dataframe
- conversions
Conversions to prioritize
- Returns
Counts dataframe with multimappers appropriately filtered
- dynast.preprocessing.conversion.deduplicate_counts(df_counts: pandas.DataFrame, conversions: Optional[FrozenSet[str]] = None, use_conversions: bool = True) pandas.DataFrame [source]
Deduplicate counts based on barcode, UMI, and gene.
The order of priority is the following. 1. If use_conversions=True, reads that have at least one such conversion 2. Reads that align to the transcriptome (exon only) 3. Reads that have highest alignment score 4. If conversions is provided, reads that have a larger sum of such conversions
If conversions is not provided, reads that have larger sum of all conversions
- Parameters
- df_counts
Counts dataframe
- conversions
Conversions to prioritize, defaults to None
- use_conversions
Prioritize reads that have conversions first
- Returns
Deduplicated counts dataframe
- dynast.preprocessing.conversion.drop_multimappers_part(counter: multiprocessing.Value, lock: multiprocessing.Lock, split_path: str, out_path: str, conversions: Optional[FrozenSet[str]] = None) str [source]
Helper function to parallelize
drop_multimappers()
.
- dynast.preprocessing.conversion.deduplicate_counts_part(counter: multiprocessing.Value, lock: multiprocessing.Lock, split_path: str, out_path: str, conversions: Optional[FrozenSet[str]], use_conversions: bool = True)[source]
Helper function to parallelize
deduplicate_multimappers()
.
- dynast.preprocessing.conversion.split_counts_by_velocity(df_counts: pandas.DataFrame) Dict[str, pandas.DataFrame] [source]
Split the given counts dataframe by the velocity column.
- Parameters
- df_counts
Counts dataframe
- Returns
Dictionary containing velocity column values as keys and the subset dataframe as values
- dynast.preprocessing.conversion.count_no_conversions(alignments_path: str, counter: multiprocessing.Value, lock: multiprocessing.Lock, index: List[Tuple[int, int, int]], barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 10000) str [source]
Count reads that have no conversion.
- Parameters
- alignments_path
Alignments CSV path
- counter
Counter that keeps track of how many reads have been processed
- lock
Semaphore for the counter so that multiple processes do not modify it at the same time
- index
Index for conversions CSV
- barcodes
List of barcodes to be considered. All barcodes are considered if not provided
- temp_dir
Path to temporary directory
- update_every
Update the counter every this many reads
- Returns
Path to temporary counts CSV
- dynast.preprocessing.conversion.count_conversions_part(conversions_path: str, alignments_path: str, counter: multiprocessing.Value, lock: multiprocessing.Lock, index: List[Tuple[int, int, int]], barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, temp_dir: Optional[str] = None, update_every: int = 10000) str [source]
Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode and gene. This function is used exclusively for multiprocessing.
- Parameters
- conversions_path
Path to conversions CSV
- alignments_path
Path to alignments information about reads
- counter
Counter that keeps track of how many reads have been processed
- lock
Semaphore for the counter so that multiple processes do not modify it at the same time
- index
Index for conversions CSV
- barcodes
List of barcodes to be considered. All barcodes are considered if not provided
- snps
Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations
- quality
Only count conversions with PHRED quality greater than this value
- temp_dir
Path to temporary directory, defaults to None
- update_every
Update the counter every this many reads
- Returns
Path to temporary counts CSV
- dynast.preprocessing.conversion.count_conversions(conversions_path: str, alignments_path: str, index_path: str, counts_path: str, gene_infos: dict, barcodes: Optional[List[str]] = None, snps: Optional[Dict[str, Dict[str, Set[int]]]] = None, quality: int = 27, conversions: Optional[FrozenSet[str]] = None, dedup_use_conversions: bool = True, n_threads: int = 8, temp_dir: Optional[str] = None) str [source]
Count the number of conversions of each read per barcode and gene, along with the total nucleotide content of the region each read mapped to, also per barcode. When a duplicate UMI for a barcode is observed, the read with the greatest number of conversions is selected.
- Parameters
- conversions_path
Path to conversions CSV
- alignments_path
Path to alignments information about reads
- index_path
Path to conversions index
- counts_path
Path to write counts CSV
- gene_infos
Dictionary containing gene information, as returned by ngs.gtf.genes_and_transcripts_from_gtf
- barcodes
List of barcodes to be considered. All barcodes are considered if not provided
- snps
Dictionary of contig as keys and list of genomic positions as values that indicate SNP locations
- conversions
Conversions to prioritize when deduplicating only applicable for UMI technologies
- dedup_use_conversions
Prioritize reads that have at least one conversion when deduplicating
- quality
Only count conversions with PHRED quality greater than this value
- n_threads
Number of threads
- temp_dir
Path to temporary directory
- Returns
Path to counts CSV