dynast.preprocessing.bam
Module Contents
Functions
|
Read alignments CSV as a pandas DataFrame. |
|
Read conversions CSV as a pandas DataFrame. |
|
Select alignments among duplicates. This function performs preliminary |
|
Parse all reads mapped to a contig, outputing conversion |
|
Utility function to retrieve all read tags present in a BAM. |
|
Utility function to check if BAM tags exists in a BAM within the first |
|
Utility function to check if BAM has paired reads. |
|
Check whether BAM contains secondary alignments. |
|
Check whether BAM contains unmapped reads. |
|
Check whether BAM contains duplicates. |
|
Sort and index BAM. |
|
Split BAM into n parts. |
|
Parse all reads in a BAM and extract conversion, content and alignment |
Attributes
- dynast.preprocessing.bam.CONVERSION_CSV_COLUMNS = ['read_id', 'index', 'contig', 'genome_i', 'conversion', 'quality'][source]
- dynast.preprocessing.bam.ALIGNMENT_COLUMNS = ['read_id', 'index', 'barcode', 'umi', 'GX', 'A', 'C', 'G', 'T', 'velocity', 'transcriptome', 'score'][source]
- dynast.preprocessing.bam.read_alignments(alignments_path: str, *args, **kwargs) pandas.DataFrame [source]
Read alignments CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
- alignments_path
path to alignments CSV
- Returns
Conversions dataframe
- dynast.preprocessing.bam.read_conversions(conversions_path: str, *args, **kwargs) pandas.DataFrame [source]
Read conversions CSV as a pandas DataFrame.
Any additional arguments and keyword arguments are passed to pandas.read_csv.
- Parameters
- conversions_path
Path to conversions CSV
- Returns
Conversions dataframe
- dynast.preprocessing.bam.select_alignments(df_alignments: pandas.DataFrame) Set[Tuple[str, str]] [source]
Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.
- Parameters
- df_alignments
Alignments dataframe
- Returns
Set of (read_id, alignment index) that were selected
- dynast.preprocessing.bam.parse_read_contig(counter: multiprocessing.Value, lock: multiprocessing.Lock, bam_path: str, contig: str, gene_infos: Optional[dict] = None, transcript_infos: Optional[dict] = None, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 2000, nasc: bool = False, velocity: bool = True, strict_exon_overlap: bool = False) Tuple[str, str, str] [source]
Parse all reads mapped to a contig, outputing conversion information as temporary CSVs. This function is designed to be called as a separate process.
- Parameters
- counter
Counter that keeps track of how many reads have been processed
- lock
Semaphore for the counter so that multiple processes do not modify it at the same time
- bam_path
Path to alignment BAM file
- contig
Only reads that map to this contig will be processed
- gene_infos
Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True
- transcript_infos
Dictionary containing transcript information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True
- strand
Strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded
- umi_tag
BAM tag that encodes UMI, if not provided, NA is output in the umi column
- barcode_tag
BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column
- gene_tag
BAM tag that encodes gene assignment
- barcodes
List of barcodes to be considered. All barcodes are considered if not provided
- temp_dir
Path to temporary directory
- update_every
Update the counter every this many reads, defaults to 5000
- nasc
Flag to change behavior to match NASC-seq pipeline, defaults to False
- velocity
Whether or not to assign a velocity type to each read
- strict_exon_overlap
Whether to use a stricter algorithm to assin reads as spliced
- Returns
(path to conversions, path to conversions index, path to alignments)
- dynast.preprocessing.bam.get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) Set[str] [source]
Utility function to retrieve all read tags present in a BAM.
- Parameters
- bam_path
Path to BAM
- n_reads
Number of reads to consider
- n_threads
Number of threads
- Returns
Set of all tags found
- dynast.preprocessing.bam.check_bam_tags_exist(bam_path: str, tags: List[str], n_reads: int = 100000, n_threads: int = 8) Tuple[bool, List[str]] [source]
Utility function to check if BAM tags exists in a BAM within the first n_reads reads.
- Parameters
- bam_path
Path to BAM
- tags
Tags to check for
- n_reads
Number of reads to consider
- n_threads
Number of threads
- Returns
(whether all tags were found, list of not found tags)
- dynast.preprocessing.bam.check_bam_is_paired(bam_path: str, n_reads: int = 100000, n_threads: int = 8) bool [source]
Utility function to check if BAM has paired reads.
bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads
- Returns
Whether paired reads were detected
- dynast.preprocessing.bam.check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) bool [source]
Check whether BAM contains secondary alignments.
bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads
- Returns
Whether secondary alignments were detected
- dynast.preprocessing.bam.check_bam_contains_unmapped(bam_path: str) bool [source]
Check whether BAM contains unmapped reads.
bam_path: Path to BAM
- Returns
Whether unmapped reads were detected
- dynast.preprocessing.bam.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) bool [source]
Check whether BAM contains duplicates.
bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads
- Returns
Whether duplicates were detected
- dynast.preprocessing.bam.sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) str [source]
Sort and index BAM.
If the BAM is already sorted, the sorting step is skipped.
- Parameters
- bam_path
Path to alignment BAM file to sort
- out_path
Path to output sorted BAM
- n_threads
Number of threads
- temp_dir
Path to temporary directory
- Returns
Path to sorted and indexed BAM
- dynast.preprocessing.bam.split_bam(bam_path: str, n: int, n_threads: int = 8, temp_dir: Optional[str] = None) List[Tuple[str, int]] [source]
Split BAM into n parts.
- Parameters
- bam_path
Path to alignment BAM file
- n
Number of splits
- n_threads
Number of threads
- temp_dir
Path to temporary directory
- Returns
List of tuples containing (split BAM path, number of reads)
- dynast.preprocessing.bam.parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]] [source]
Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.
bam_path: Path to alignment BAM file conversions_path: Path to output information about reads that have conversions alignments_path: Path to alignments information about reads index_path: Path to conversions index no_index_path: Path to no conversions index gene_infos: Dictionary containing gene information, as returned by
ngs.gtf.genes_and_transcripts_from_gtf
- transcript_infos: Dictionary containing transcript information,
as returned by ngs.gtf.genes_and_transcripts_from_gtf
- strand: Strandedness of the sequencing protocol, defaults to forward,
may be one of the following: forward, reverse, unstranded
- umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the
umi column
- barcode_tag: BAM tag that encodes cell barcode, if not provided, NA
is output in the barcode column
gene_tag: BAM tag that encodes gene assignment, defaults to GX barcodes: List of barcodes to be considered. All barcodes are considered
if not provided
n_threads: Number of threads temp_dir: Path to temporary directory nasc: Flag to change behavior to match NASC-seq pipeline velocity: Whether or not to assign a velocity type to each read strict_exon_overlap: Whether to use a stricter algorithm to assign reads as spliced return_splits: return BAM splits for later reuse
- Returns
- (path to conversions, path to alignments, path to conversions index)
If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.