dynast.preprocessing.bam

Module Contents

Functions

read_alignments(alignments_path: str, *args, **kwargs) → pandas.DataFrame

Read alignments CSV as a pandas DataFrame.

read_conversions(conversions_path: str, *args, **kwargs) → pandas.DataFrame

Read conversions CSV as a pandas DataFrame.

select_alignments(df_alignments: pandas.DataFrame) → Set[Tuple[str, str]]

Select alignments among duplicates. This function performs preliminary

parse_read_contig(counter: multiprocessing.Value, lock: multiprocessing.Lock, bam_path: str, contig: str, gene_infos: Optional[dict] = None, transcript_infos: Optional[dict] = None, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 2000, nasc: bool = False, velocity: bool = True, strict_exon_overlap: bool = False) → Tuple[str, str, str]

Parse all reads mapped to a contig, outputing conversion

get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → Set[str]

Utility function to retrieve all read tags present in a BAM.

check_bam_tags_exist(bam_path: str, tags: List[str], n_reads: int = 100000, n_threads: int = 8) → Tuple[bool, List[str]]

Utility function to check if BAM tags exists in a BAM within the first

check_bam_is_paired(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool

Utility function to check if BAM has paired reads.

check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) → bool

Check whether BAM contains secondary alignments.

check_bam_contains_unmapped(bam_path: str) → bool

Check whether BAM contains unmapped reads.

check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) → bool

Check whether BAM contains duplicates.

sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) → str

Sort and index BAM.

split_bam(bam_path: str, n: int, n_threads: int = 8, temp_dir: Optional[str] = None) → List[Tuple[str, int]]

Split BAM into n parts.

parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) → Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]]

Parse all reads in a BAM and extract conversion, content and alignment

Attributes

CONVERSION_CSV_COLUMNS

ALIGNMENT_COLUMNS

dynast.preprocessing.bam.CONVERSION_CSV_COLUMNS = ['read_id', 'index', 'contig', 'genome_i', 'conversion', 'quality'][source]
dynast.preprocessing.bam.ALIGNMENT_COLUMNS = ['read_id', 'index', 'barcode', 'umi', 'GX', 'A', 'C', 'G', 'T', 'velocity', 'transcriptome', 'score'][source]
dynast.preprocessing.bam.read_alignments(alignments_path: str, *args, **kwargs) pandas.DataFrame[source]

Read alignments CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters
alignments_path

path to alignments CSV

Returns

Conversions dataframe

dynast.preprocessing.bam.read_conversions(conversions_path: str, *args, **kwargs) pandas.DataFrame[source]

Read conversions CSV as a pandas DataFrame.

Any additional arguments and keyword arguments are passed to pandas.read_csv.

Parameters
conversions_path

Path to conversions CSV

Returns

Conversions dataframe

dynast.preprocessing.bam.select_alignments(df_alignments: pandas.DataFrame) Set[Tuple[str, str]][source]

Select alignments among duplicates. This function performs preliminary deduplication and returns a list of tuples (read_id, alignment index) to use for coverage calculation and SNP detection.

Parameters
df_alignments

Alignments dataframe

Returns

Set of (read_id, alignment index) that were selected

dynast.preprocessing.bam.parse_read_contig(counter: multiprocessing.Value, lock: multiprocessing.Lock, bam_path: str, contig: str, gene_infos: Optional[dict] = None, transcript_infos: Optional[dict] = None, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, temp_dir: Optional[str] = None, update_every: int = 2000, nasc: bool = False, velocity: bool = True, strict_exon_overlap: bool = False) Tuple[str, str, str][source]

Parse all reads mapped to a contig, outputing conversion information as temporary CSVs. This function is designed to be called as a separate process.

Parameters
counter

Counter that keeps track of how many reads have been processed

lock

Semaphore for the counter so that multiple processes do not modify it at the same time

bam_path

Path to alignment BAM file

contig

Only reads that map to this contig will be processed

gene_infos

Dictionary containing gene information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True

transcript_infos

Dictionary containing transcript information, as returned by preprocessing.gtf.parse_gtf, required if velocity=True

strand

Strandedness of the sequencing protocol, defaults to forward, may be one of the following: forward, reverse, unstranded

umi_tag

BAM tag that encodes UMI, if not provided, NA is output in the umi column

barcode_tag

BAM tag that encodes cell barcode, if not provided, NA is output in the barcode column

gene_tag

BAM tag that encodes gene assignment

barcodes

List of barcodes to be considered. All barcodes are considered if not provided

temp_dir

Path to temporary directory

update_every

Update the counter every this many reads, defaults to 5000

nasc

Flag to change behavior to match NASC-seq pipeline, defaults to False

velocity

Whether or not to assign a velocity type to each read

strict_exon_overlap

Whether to use a stricter algorithm to assin reads as spliced

Returns

(path to conversions, path to conversions index, path to alignments)

dynast.preprocessing.bam.get_tags_from_bam(bam_path: str, n_reads: int = 100000, n_threads: int = 8) Set[str][source]

Utility function to retrieve all read tags present in a BAM.

Parameters
bam_path

Path to BAM

n_reads

Number of reads to consider

n_threads

Number of threads

Returns

Set of all tags found

dynast.preprocessing.bam.check_bam_tags_exist(bam_path: str, tags: List[str], n_reads: int = 100000, n_threads: int = 8) Tuple[bool, List[str]][source]

Utility function to check if BAM tags exists in a BAM within the first n_reads reads.

Parameters
bam_path

Path to BAM

tags

Tags to check for

n_reads

Number of reads to consider

n_threads

Number of threads

Returns

(whether all tags were found, list of not found tags)

dynast.preprocessing.bam.check_bam_is_paired(bam_path: str, n_reads: int = 100000, n_threads: int = 8) bool[source]

Utility function to check if BAM has paired reads.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns

Whether paired reads were detected

dynast.preprocessing.bam.check_bam_contains_secondary(bam_path: str, n_reads: int = 100000, n_threads: int = 8) bool[source]

Check whether BAM contains secondary alignments.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns

Whether secondary alignments were detected

dynast.preprocessing.bam.check_bam_contains_unmapped(bam_path: str) bool[source]

Check whether BAM contains unmapped reads.

bam_path: Path to BAM

Returns

Whether unmapped reads were detected

dynast.preprocessing.bam.check_bam_contains_duplicate(bam_path, n_reads=100000, n_threads=8) bool[source]

Check whether BAM contains duplicates.

bam_path: Path to BAM n_reads: Number of reads to consider n_threads: Number of threads

Returns

Whether duplicates were detected

dynast.preprocessing.bam.sort_and_index_bam(bam_path: str, out_path: str, n_threads: int = 8, temp_dir: Optional[str] = None) str[source]

Sort and index BAM.

If the BAM is already sorted, the sorting step is skipped.

Parameters
bam_path

Path to alignment BAM file to sort

out_path

Path to output sorted BAM

n_threads

Number of threads

temp_dir

Path to temporary directory

Returns

Path to sorted and indexed BAM

dynast.preprocessing.bam.split_bam(bam_path: str, n: int, n_threads: int = 8, temp_dir: Optional[str] = None) List[Tuple[str, int]][source]

Split BAM into n parts.

Parameters
bam_path

Path to alignment BAM file

n

Number of splits

n_threads

Number of threads

temp_dir

Path to temporary directory

Returns

List of tuples containing (split BAM path, number of reads)

dynast.preprocessing.bam.parse_all_reads(bam_path: str, conversions_path: str, alignments_path: str, index_path: str, gene_infos: dict, transcript_infos: dict, strand: typing_extensions.Literal[forward, reverse, unstranded] = 'forward', umi_tag: Optional[str] = None, barcode_tag: Optional[str] = None, gene_tag: str = 'GX', barcodes: Optional[List[str]] = None, n_threads: int = 8, temp_dir: Optional[str] = None, nasc: bool = False, control: bool = False, velocity: bool = True, strict_exon_overlap: bool = False, return_splits: bool = False) Union[Tuple[str, str, str], Tuple[str, str, str, List[Tuple[str, int]]]][source]

Parse all reads in a BAM and extract conversion, content and alignment information as CSVs.

bam_path: Path to alignment BAM file conversions_path: Path to output information about reads that have conversions alignments_path: Path to alignments information about reads index_path: Path to conversions index no_index_path: Path to no conversions index gene_infos: Dictionary containing gene information, as returned by

ngs.gtf.genes_and_transcripts_from_gtf

transcript_infos: Dictionary containing transcript information,

as returned by ngs.gtf.genes_and_transcripts_from_gtf

strand: Strandedness of the sequencing protocol, defaults to forward,

may be one of the following: forward, reverse, unstranded

umi_tag: BAM tag that encodes UMI, if not provided, NA is output in the

umi column

barcode_tag: BAM tag that encodes cell barcode, if not provided, NA

is output in the barcode column

gene_tag: BAM tag that encodes gene assignment, defaults to GX barcodes: List of barcodes to be considered. All barcodes are considered

if not provided

n_threads: Number of threads temp_dir: Path to temporary directory nasc: Flag to change behavior to match NASC-seq pipeline velocity: Whether or not to assign a velocity type to each read strict_exon_overlap: Whether to use a stricter algorithm to assign reads as spliced return_splits: return BAM splits for later reuse

Returns

(path to conversions, path to alignments, path to conversions index)

If return_splits is True, then there is an additional return value, which is a list of tuples containing split BAM paths and number of reads in each BAM.