`dynast.preprocessing.bam`

Module Contents

Functions

`read_alignments`(alignments_path: str, args, *kwargs) → pandas.DataFrame	Read alignments CSV as a pandas DataFrame.
`read_conversions`(conversions_path: str, args, *kwargs) → pandas.DataFrame	Read conversions CSV as a pandas DataFrame.
`select_alignments`(df_alignments: pandas.DataFrame) → Set[Tuple[str, str]]	Select alignments among duplicates. This function performs preliminary
`parse_read_contig`(counter: multiprocessing.Value, lock: multiprocessing.Lock, bam_path: str, contig: str, gene_infos: Optional[dict] = None, transcript_infos: Optional[dict] = None, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 2000, nasc: bool = False, velocity: bool = True, strict_exon_overlap: bool = False) → Tuple[str, str, str]	Parse all reads mapped to a contig, outputing conversion
`get_tags_from_bam`(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → Set[str]	Utility function to retrieve all read tags present in a BAM.
`check_bam_tags_exist`(bam_path: str, tags: List[str], n_reads: int = 100000, n_threads: int = 8) → Tuple[bool, List[str]]	Utility function to check if BAM tags exists in a BAM within the first
`check_bam_is_paired`(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool	Utility function to check if BAM has paired reads.
`check_bam_contains_secondary`(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool	Check whether BAM contains secondary alignments.
`check_bam_contains_unmapped`(bam_path: str) → bool	Check whether BAM contains unmapped reads.
`check_bam_contains_duplicate`(bam_path, n_reads=100000, n_threads=8) → bool	Check whether BAM contains duplicates.
`sort_and_index_bam`(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) → str	Sort and index BAM.
`split_bam`(bam_path: str, n: int, n_threads: int = 8, temp_dir: Optional[str] = None) → List[Tuple[str, int]]	Split BAM into n parts.
`parse_all_reads`(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) → Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]]	Parse all reads in a BAM and extract conversion, content and alignment

Attributes

`CONVERSION_CSV_COLUMNS`
`ALIGNMENT_COLUMNS`

dynast.preprocessing.bam.CONVERSION_CSV_COLUMNS = ['read_id', 'index', 'contig', 'genome_i', 'conversion', 'quality'][source]

dynast.preprocessing.bam.ALIGNMENT_COLUMNS = ['read_id', 'index', 'barcode', 'umi', 'GX', 'A', 'C', 'G', 'T', 'velocity', 'transcriptome', 'score'][source]

dynast.preprocessing.bam.read_alignments(alignments_path: str, *args, **kwargs) → pandas.DataFrame[source]

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

alignments_path: path to alignments CSV

Returns

Conversions dataframe

dynast.preprocessing.bam.read_conversions(conversions_path: str, *args, **kwargs) → pandas.DataFrame[source]

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters

conversions_path: Path to conversions CSV

Returns

Conversions dataframe

dynast.preprocessing.bam.select_alignments(df_alignments: pandas.DataFrame) → Set[Tuple[str, str]][source]

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters

df_alignments: Alignments dataframe

Returns

Set of (read_id, alignment index) that were selected

dynast.preprocessing.bam.parse_read_contig(counter: multiprocessing.Value, lock: multiprocessing.Lock, bam_path: str, contig: str, gene_infos: Optional[dict] = None, transcript_infos: Optional[dict] = None, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 2000, nasc: bool = False, velocity: bool = True, strict_exon_overlap: bool = False) → Tuple[str, str, str][source]

Parse all reads mapped to a contig, outputing conversion information as temporary CSVs. This function is designed to be called as a separate process.

Parameters

counter: Counter that keeps track of how many reads have been processed
lock: Semaphore for the counter so that multiple processes do not modify it at the same time
bam_path: Path to alignment BAM file
contig: Only reads that map to this contig will be processed
gene_infos: Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True
transcript_infos: Dictionary containing transcript information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True
strand: Strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded
umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the umi column
barcode_tag: BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column
gene_tag: BAM tag that encodes gene assignment
barcodes: List of barcodes to be considered. All barcodes are considered if not provided
temp_dir: Path to temporary directory
update_every: Update the counter every this many reads, defaults to 5000
nasc: Flag to change behavior to match NASC-seq pipeline, defaults to False
velocity: Whether or not to assign a velocity type to each read
strict_exon_overlap: Whether to use a stricter algorithm to assin reads as spliced

Returns

(path to conversions, path to conversions index, path to alignments)

dynast.preprocessing.bam.get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → Set[str][source]

Utility function to retrieve all read tags present in a BAM.

Parameters

bam_path: Path to BAM
n_reads: Number of reads to consider
n_threads: Number of threads

Returns

Set of all tags found

dynast.preprocessing.bam.check_bam_tags_exist(bam_path: str, tags: List[str], n_reads: int = 100000, n_threads: int = 8) → Tuple[bool, List[str]][source]

Utility function to check if BAM tags exists in a BAM within the first n_reads reads.

Parameters

bam_path: Path to BAM
tags: Tags to check for
n_reads: Number of reads to consider
n_threads: Number of threads

Returns

(whether all tags were found, list of not found tags)

dynast.preprocessing.bam.check_bam_is_paired(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool[source]

Utility function to check if BAM has paired reads.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns: Whether paired reads were detected

dynast.preprocessing.bam.check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool[source]

Check whether BAM contains secondary alignments.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns: Whether secondary alignments were detected

dynast.preprocessing.bam.check_bam_contains_unmapped(bam_path: str) → bool[source]

Check whether BAM contains unmapped reads.

bam_path: Path to BAM

Returns: Whether unmapped reads were detected

dynast.preprocessing.bam.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) → bool[source]

Check whether BAM contains duplicates.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns: Whether duplicates were detected

dynast.preprocessing.bam.sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) → str[source]

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters

bam_path: Path to alignment BAM file to sort
out_path: Path to output sorted BAM
n_threads: Number of threads
temp_dir: Path to temporary directory

Returns

Path to sorted and indexed BAM

dynast.preprocessing.bam.split_bam(bam_path: str, n: int, n_threads: int = 8, temp_dir: Optional[str] = None) → List[Tuple[str, int]][source]

Split BAM into n parts.

Parameters

bam_path: Path to alignment BAM file
n: Number of splits
n_threads: Number of threads
temp_dir: Path to temporary directory

Returns

List of tuples containing (split BAM path, number of reads)

dynast.preprocessing.bam.parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) → Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]][source]

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

bam_path: Path to alignment BAM file conversions_path: Path to output information about reads that have conversions alignments_path: Path to alignments information about reads index_path: Path to conversions index no_index_path: Path to no conversions index gene_infos: Dictionary containing gene information, as returned by

ngs.gtf.genes_and_transcripts_from_gtf

transcript_infos: Dictionary containing transcript information,
as returned by ngs.gtf.genes_and_transcripts_from_gtf

strand: Strandedness of the sequencing protocol, defaults to forward,
may be one of the following: forward, reverse, unstranded

umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the
umi column

barcode_tag: BAM tag that encodes cell barcode, if not provided, NA
is output in the barcode column

gene_tag: BAM tag that encodes gene assignment, defaults to GX barcodes: List of barcodes to be considered. All barcodes are considered

if not provided

n_threads: Number of threads temp_dir: Path to temporary directory nasc: Flag to change behavior to match NASC-seq pipeline velocity: Whether or not to assign a velocity type to each read strict_exon_overlap: Whether to use a stricter algorithm to assign reads as spliced return_splits: return BAM splits for later reuse

Returns

(path to conversions, path to alignments, path to conversions index): If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.

dynast.preprocessing.bam

Module Contents

Functions

Attributes

`dynast.preprocessing.bam`