Hierarchical TAD Identification¶

class tadlib.hitad.chromLev.Chrom(chrom, res, hicdata, replabel, maxapart=4000000)[source]¶

Chrom is defined to:

Hold Hi-C data within a certain chromosome
Identify hierarchical domains in 4 steps: 1.Calculate adaptive DIs. 2.Identify original candidate bounds by 5-state Gaussian mixture Hidden Markov Model using adaptive DIs as input. 3.Select TAD bounds from candidate bounds. 4.Recursively identify inner domain structures of each TAD.
Visualize any region of the chromosome. Hierarchical domains will be plotted as boxes along with the diagonal of the heatmap, and adaptive DI track will be placed on top of the heatmap.

Parameters

chromstr: Chromosome label.
resint: Resolution of the Hi-C data in base-pair unit.
hicdataCSR sparse matrix: Bin-level Hi-C matrix of the specified chromosome.
replabelstr: Biological replicate label.
maxapartint: Maximum allowable TAD size in base-pair unit. (Default: 4000000)

Attributes

chromstr: Chromosome label.
resint: Resolution in base-pair unit.
maxapartint: Maximum allowable TAD size.
replabelstr: Biological replicate label.
chromLenint: Total bin number of the chromosome.
rawMatrixsparse matrix in Compressed Sparse Row format: CSR sparse matrix is used to extract Hi-C data by slicing conveniently while guarantee low memory overhead.

Methods

`calDI`(self, windows, start)	Calculate Directionality Index (DI) for each bin with adaptive window size.
`callDomain`(self)	Direct API for our hierarchical domain identification pipeline:
`detectPeaks`(self, trends[, mph, mpd])	Detect peaks (local maxima) in a 1-D array intuitively (a peak must be greater than its immediate neighbors).
`fineDomain`(self)	Identify hierarchical domains within each TAD.
`getDomainList`(self, byregion)	Combine by-region domains into a single list.
`getSelfMatrix`(self, start, end)	Return the contact matrix of any given region.
`getWeightMatrix`(self, start, end[, bases])	Calculate weights for each intra-domain interaction by considering the genomic distance and the local interaction background.
`idxmatch`(self, domain)	Pair interactions of the given domain with the upstream and downstream interactions under the same genomic distance.
`iterCore`(self, minDomains, tmpDomains)	Calculate the mismatch ratio for the input two domain lists.
`maxCore`(self[, cache])	Perform TAD identification.
`maxscorepath`(self, domainlist, cache)	An implementation for our proposed algorithm to find the best separation solution at chromosomal/domain level, given bottom domain list and pre-computed scores.
`minCore`(self, regionDIs)	Output domain list for each gap-free region.
`minWindows`(self, start, end, maxw)	Estimate best window size for each bin of a given range.
`oriIter`(self, minDomains)	Iteratvely approximate adaptive window sizes and return the final bottom domain list which will be used in subsequent procedures.
`oriWindow`(self, P)	Estimate the most appropriate window size for current bin to best capture the local interaction bias direction.
`pipe`(self, seq, start)	Transform an observed sequence into a list of domains.
`plot`(self, start, end, Domains, figname[, …])	Given a genomic region and a domain list, plot corresponding contact heatmap and all domains (represented as diagonal squares) within the region.
`randomCheck`(self, seq[, pthre])	We use chi square test to test the randomness of a sequence by looking at the conversion frequency between neighbors in the sequence.
`refNoise`(self, domain)	Return noise level of a domain, which is simply defined as the zero entry ratio of the contact matrix.
`scoreCache`(self, domainlist[, cache])	Calculate and cache the TAD scores for any combinations of consecutive bottom domains.
`splitChrom`(self, DIs)	Split a chromosome into gap-free regions.
`stablescore`(self, domain[, bases])	Calculate TAD score for the given domain.
`subDomains`(self, domain, reflist[, clv, aM, …])	A recusive method (function) to identify inner domain hierarchy of a TAD (or a domain).
`toArrowhead`(self, start, end)	Perform Arrowhead transformation on contact matrix of a given genomic region.
`viterbi`(self, seq)	Find the most likely hidden state series using the viterbi algorithm.

calIS
preciseBound

calDI(self, windows, start)[source]¶

Calculate Directionality Index (DI) for each bin with adaptive window size.

Parameters

windows1-D numpy.ndarray, int32: Returned by tadlib.hitad.chromLev.Chrom.minWindows().
startint: Starting bin index, the window size of which is taken from the 1st place of windows.

Attributes

DIs1-D numpy ndarray, float: Calculated adaptive DI array, which has the same size as the input windows.

callDomain(self)[source]¶

Direct API for our hierarchical domain identification pipeline:

Adaptively estimate window size for each bin. (tadlib.hitad.chromLev.Chrom.minWindows())
Calculate adaptive DIs. (tadlib.hitad.chromLev.Chrom.calDI())
Iteratively correct adaptive window size and bottom boundary positions. (tadlib.hitad.chromLev.Chrom.oriIter())
Identify TADs based on bottom domains. (tadlib.hitad.chromLev.Chrom.maxCore())
Resolve domain hierarchy within each TAD. (tadlib.hitad.chromLev.Chrom.fineDomain())

detectPeaks(self, trends, mph=0, mpd=5)[source]¶

Detect peaks (local maxima) in a 1-D array intuitively (a peak must be greater than its immediate neighbors).

Parameters

trends1-D numpy ndarray: Data.
mphfloat: Only peaks that are greater than this value will be detected. (Default: 0)
mpdpositive integer: Only peaks whose indices are at least separated by this value will be reported. (Default: 5)

Returns

ind1-D numpy ndarray: Indices of peaks detected in trends.

fineDomain(self)[source]¶

Identify hierarchical domains within each TAD.

See also

tadlib.hitad.chromLev.Chrom.maxCore: identify TADs
tadlib.hitad.chromLev.Chrom.subDomains: resolve domain hierarchy within a given TAD

Attributes

hierDomainsdict: The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified hierarchical domain lists. Start and end of the domain are in base-pair unit.

getDomainList(self, byregion)[source]¶

Combine by-region domains into a single list.

Parameters

byregiondict: The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified domain lists.

Returns

DomainListlist: A merged domain list of all regions

getSelfMatrix(self, start, end)[source]¶

Return the contact matrix of any given region.

Parameters

start, endint: The region interval in base-pair unit.

Returns

Matrix2-D numpy ndarray, float: Sub contact matrix.

getWeightMatrix(self, start, end, bases=[])[source]¶

Calculate weights for each intra-domain interaction by considering the genomic distance and the local interaction background.

Parameters

start, endint: The domain interval in base-pair unit.
baseslist: List of the bottom domains within the given interval.

Returns

W2-D numpy.ndarray: The weight matrix. (An upper triangular matrix)

idxmatch(self, domain)[source]¶

Pair interactions of the given domain with the upstream and downstream interactions under the same genomic distance.

Parameters

domain[start, end]: Domain interval in base-pair unit.

Returns

cur_storetuple, (x-coordinates, y-coordinates, interaction frequencies): Interactions within the given domain.
up_storetuple, (x-coordinates, y-coordinates, interaction frequencies): Corresponding upstream interctions.
down_storetuple, (x-coordinates, y-coordinates, interaction frequencies): Corresponding downstream interactions.

iterCore(self, minDomains, tmpDomains)[source]¶

Calculate the mismatch ratio for the input two domain lists. Return 1 if minDomains is empty.

tadlib.hitad.chromLev.Chrom.oriIter() uses this method to determine whether to break the iteration.

Parameters

minDomainsdict: Target domains calculated by the last loop.
tmpDomainsdict: Query domains returned by the current loop.

Returns

tolfloat: Mismatch ratio.

maxCore(self, cache={})[source]¶

Perform TAD identification. We define TADs as domains to optimize chromosome separation(based on some objective function), which is solved by using an algorithm like dynamic programming implemented in tadlib.hitad.chromLev.Chrom.maxscorepath().

Parameters

cachedict: TAD scores for all combinations of consecutive bottom domains will be pre-computed and stored in this dict. The keys are tuples (start, end) representing merged domain intervals. (Default: {})

See also

tadlib.hitad.chromLev.Chrom.scoreCache: Pre-compute TAD scores for all combinations of consecutive bottom domains.
tadlib.hitad.chromLev.Chrom.maxscorepath: find the best TAD list

Attributes

maxDomainsdict: Optimized TADs.
cachedict: Cached TAD scores.

maxscorepath(self, domainlist, cache)[source]¶

An implementation for our proposed algorithm to find the best separation solution at chromosomal/domain level, given bottom domain list and pre-computed scores.

Parameters

domainlistlist of [start,end]: List of bottom domain intervals in base-pair unit.
cachedict: Pre-computed scores for any combinations of continuous bottom domains. The keys are intervals of continuous regions in (start,end) format, and the values are corresponding scores.

Returns

bestsset of (sidx, eidx): Each element indicates one merged domain of the solution, represented as (start index, end index) of the input domainlist.

minCore(self, regionDIs)[source]¶

Output domain list for each gap-free region.

Parameters

regionDIsdict: Gap-free regions and corresponding adaptive DI arrays.

Returns

minDomainsdict: Gap-free regions and corresponding identified bottom domain list. Different from tadlib.hitad.chromLev.Chrom.pipe(), the start and the end of a domain are in base-pair unit.

minWindows(self, start, end, maxw)[source]¶

Estimate best window size for each bin of a given range.

Parameters

start, endint: Specify range of the bin indices.
maxwint: Maximum allowable window size.

See also

tadlib.hitad.chromLev.Chrom.oriWindow: Window size estimation for a single bin.

Attributes

windows1-D numpy.ndarray, int32

oriIter(self, minDomains)[source]¶

Iteratvely approximate adaptive window sizes and return the final bottom domain list which will be used in subsequent procedures. For each loop, window sizes are updated according to the latest bottom domains and next loop will re-run the identification pipeline using new window sizes. The iteration terminates if domains between two consecutive loops are very similar (estimated by tadlib.hitad.chromLev.Chrom.iterCore()).

Parameters

minDomainsdict: Initial domains served as the target domain list for tadlib.hitad.chromLev.Chrom.iterCore() at the first iteration. We set it empty in our pipeline.

See also

tadlib.hitad.chromLev.Chrom.calDI: calculate DI values according to input window sizes
tadlib.hitad.chromLev.Chrom.iterCore: estimate the degree of divergence between two domain lists

Attributes

minDomainsdict: The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified bottom domain lists. Start and end of the domain are in base-pair unit.

oriWindow(self, P)[source]¶

Estimate the most appropriate window size for current bin to best capture the local interaction bias direction.

See also

tadlib.hitad.chromLev.Chrom.detectPeaks: detect peaks given a 1-D array
tadlib.hitad.chromLev.Chrom.randomCheck: randomness test for a two-valued (0-1) sequence

pipe(self, seq, start)[source]¶

Transform an observed sequence into a list of domains.

Parameters

seq1-D numbpy ndarray, float: Adaptive DI array for any region.
startint: Chromosome bin index of the seq start.

Returns

domainslist: List of domains in the format [start bin, end bin, noise level, hierarchical level].

See also

tadlib.hitad.chromLev.Chrom.refNoise: Calculate the noise level of a given domain
tadlib.hitad.aligner.BoundSet: where the meanings of the hierarchical level labels are explained in detail.

plot(self, start, end, Domains, figname, arrowhead=False, vmin=None, vmax=None)[source]¶

Given a genomic region and a domain list, plot corresponding contact heatmap and all domains (represented as diagonal squares) within the region.

Parameters

start, endint: The region interval.
Domainsdict: The keys are tuples representing gap-free regions, and values are corresponding identified domains.
fignamestr or None: If not None, the figure will be saved, otherwise it will only be shown in an interactive window. (Default: None)
arrowheadbool: If True, the Arrowhead transformed matrix will be plotted instead of the raw contact matrix. (Default: False)

randomCheck(self, seq, pthre=0.05)[source]¶

We use chi square test to test the randomness of a sequence by looking at the conversion frequency between neighbors in the sequence.

Parameters

seqstr: A string containing only ‘1’ or ‘0’. (e.g. ‘101000101’)
pthrefloat, 0-1: Significance level of the hypothesis tests.

Returns

rejectbool: True if we should reject the null hypothesis (the sequence is generated randomly) under the selected significance level.

refNoise(self, domain)[source]¶: Return noise level of a domain, which is simply defined as the zero entry ratio of the contact matrix.

scoreCache(self, domainlist, cache={})[source]¶

Calculate and cache the TAD scores for any combinations of consecutive bottom domains.

Parameters

domainlistlist of [start,end]: List of bottom domain intervals in base-pair unit.
cachedict: Cache of the TAD scores. The keys are in (start,end) format, intervals of continuous regions, and the values are corresponding calculated scores.

See also

tadlib.hitad.chromLev.Chrom.stablescore: Calculate TAD score of any genomic interval

splitChrom(self, DIs)[source]¶

Split a chromosome into gap-free regions. HMM learning and domain identification procedures will be performed on these regions separately.

Parameters

DIs1-D numpy ndarray, float: Adaptive DI array of the whole chromosome. Generally, we detect runs of zeros in the array as gaps, which will be cut off the chromosome, making entire chromosome pieces of gap-free regions.

Attributes

chromRegionsdict, {(start,end):DIs[start:end]}: The keys are gap-free regions, and the values are corresponding adaptive DI pieces.
gapbinsset: Set of bins (in base-pair unit) located in gap regions.

stablescore(self, domain, bases=[])[source]¶

Calculate TAD score for the given domain.

Parameters

domain[start, end]: Domain interval in base-pair unit.
baseslist: List of bottom domains within the given domain region.

Returns

inoutfloat: TAD score defined as the enrichment between intra-domain interaction frequencies and inter-domain interaction frequencies controlling for the impact of genomic distance.

See also

tadlib.hitad.chromLev.Chrom.idxmatch: Pair intra-domain and inter-domain interactions with the same genomic distance
tadlib.calfea.analyze.Core.longrange: Assign the weight value for each intra-domain interaction according to the genomic distance and the local interaction background.

subDomains(self, domain, reflist, clv=0, aM=None, W=None, subdomains={})[source]¶

A recusive method (function) to identify inner domain hierarchy of a TAD (or a domain). Sub-TADs of each level are defined as the best separation of the outer domain.

Parameters

domain[start, end]: Outer layer domain interval in base-pair unit.
reflistlist: List of bottom domains within the region of the outer domain.
clvint: Global domain level of the outer domain. The TADs have the level 0, the sub-TADs within a TAD have the level 1, and sub-sub-TADs within a sub-TAD have the level 2, and so forth. (Default: 0)
aM2-D numpy ndarray or None: Arrowhead matrix of the outer domain. (Default: None)
W2-D numpy ndarray or None: Weight matrix corresponding to the contact matrix of the outer domain. See tadlib.hitad.chromLev.Chrom.getWeightMatrix() for detailed calculation. (Default: None)
subdomainsdict: A container for domains of all hierarchy. The keys are domain intervals in base-pair unit, and values are corresponding global levels.

See also

tadlib.hitad.chromLev.Chrom.maxscorepath: find the best domain list optimally separating domain-level interactions.

toArrowhead(self, start, end)[source]¶

Perform Arrowhead transformation on contact matrix of a given genomic region.

Parameters

start, endint: The region interval.

Returns

A2-D numpy ndarray, float: Transformed matrix. (A symmetric matrix)

viterbi(self, seq)[source]¶

Find the most likely hidden state series using the viterbi algorithm.

Parameters

seq1-D numbpy ndarray, float: Adaptive DI array for any region.

Returns

pathlist: List of hidden state labels. Has the same length as the input seq.

class tadlib.hitad.chromLev.MultiReps(chrom, res, datasets)[source]¶

We define MultiReps to:

Hold Hi-C data of the same chromosome from different biological replicates at the same time
Provide an interface to identify hierarchical domains by using different replicate data independently and maintain the reproducible domains to form the final domain list.

Parameters

chromstr: Chromosome label.
resint: Hi-C data resolution in base-pair unit.
datasetsdict: The keys are unique replicate labels, and the values are constructed Chrom objects by using the Hi-C data of corresponding biological replicates.

Methods

`align`(self, tn, qn)	Construct hierarchical alignment between tn and qn.
`callDomain`(self)	Find reproducible domains between replicates by using our domain-level alignment strategy.
`conserved`(self, tn, qn)	Return conserved TAD pairs.
`getDomainList`(self, byregion)	Combine by-region domains into a single list.
`inner_changed`(self, tn, qn)	Return semi-conserved TAD pairs.
`merged`(self, tn, qn)	Return merged region pairs and merged TAD details.
`overlap`(self, ti, qi)	Calculate overlap ratio of any two regions.
`reproducible`(self, tg, qy)	Subprocess of callDomain for replicate alignment and reproducibility determination.
`split`(self, tn, qn)	Return split region pairs and split TAD details.

intersect

callDomain(self)[source]¶: Find reproducible domains between replicates by using our domain-level alignment strategy.

getDomainList(self, byregion)[source]¶

Combine by-region domains into a single list.

Each domain in the returned list is represented in the format [chromosome label, start bp, end bp, hierarchical level].

reproducible(self, tg, qy)[source]¶: Subprocess of callDomain for replicate alignment and reproducibility determination.

class tadlib.hitad.genomeLev.Genome(datasets, balance_type='weight', maxsize=4000000, cache=None, exclude=['chrM', 'chrY'], DIcol='DIs')[source]¶

Genome is built on top of tadlib.hitad.chromLev.Chrom. We use it to:

Load bin-level Hi-C data
Initialize, pickle and organize Chrom objects
Call hierarchical domains of each chromosome in parallel

Parameters

datasets2-level dict, {resolution(int):{biological_replicate_label(str):data_path,…}}: data_path indicates the cool URI under corresponding resolution and biological replicate label.
maxsizeint: Maximum allowable domain size in base-pair unit. (Default: 4000000)
cachestr or None: Cache folder path. If None, the folder returned by tempfile.gettempdir() will be used. (Default: None)

Attributes

data3-level dict. {chrom(str):{resolution(int):{biological_replicate_label(str):cachedfile,…}},…}: Different from the input datasets, it organizes data by chromosome label, and each bottom-level value indicates one pickled tadlib.hitad.chromLev.Chrom file under the cache folder.
Resultslist: Final consistent domain list merged from all chromosome results.

Methods

`callHierDomain`(self[, cpu_core])	Identify hierarchical domains of each chromosome independently and concurrently, find consistent domains between biological replicates, and finally combine results of all chromosomes.
`learning`(self[, cpu_core])	Prepare training data and learn HMM model parameters for each dataset.
`oriHMMParams`(self)	Set initial parameters for the Hidden Markov Model (HMM).
`wipeDisk`(self)	Remove catched (pickled) `tadlib.hitad.chromLev.Chrom` objects before exiting.

outputDomain
train_data

callHierDomain(self, cpu_core=1)[source]¶

Identify hierarchical domains of each chromosome independently and concurrently, find consistent domains between biological replicates, and finally combine results of all chromosomes.

Parameters

cpu_coreint: Number of processes to launch. For now, hitad only supports parallel computing on a quite high layer, that is, it simply allocates an uncompleted tadlib.hitad.chromLev.Chrom object to an idle processor and invokes its callDomain method.

learning(self, cpu_core=1)[source]¶

Prepare training data and learn HMM model parameters for each dataset.

Parameters

cpu_coreint: Number of processes to launch.

oriHMMParams(self)[source]¶

Set initial parameters for the Hidden Markov Model (HMM).

Attributes

HMMParamsdict: Has 3 keys: “A”, state transition matrix, “B” (emission probabilities), specifying parameters (Means, Variances, Weights) of the mixture Gaussian distributions for each hidden state, and “pi”, indicating the hidden state weights. This dict will be updated after learning procedure.

wipeDisk(self)[source]¶: Remove catched (pickled) tadlib.hitad.chromLev.Chrom objects before exiting.

Domain Loading and Aligning¶

class tadlib.hitad.aligner.DomainAligner(*args)[source]¶

This class is the work horse we define to:

Hold multiple tadlib.hitad.aligner.DomainSet instances at the same time.
Perform domain-based hierarchical alignment between any two tadlib.hitad.aligner.DomainSet.
Define and extract domain-level change types from alignment results between two tadlib.hitad.aligner.DomainSet.

Parameters

argstwo or more tadlib.hitad.aligner.DomainSet instances

Attributes

DomainSetsdict: Pool of tadlib.hitad.aligner.DomainSet. The keys are unique identifiers extracted from tadlib.hitad.aligner.DomainSet, and the values are corresponding tadlib.hitad.aligner.DomainSet instances.
Resultsdict: Container for alignment results between any pair of tadlib.hitad.aligner.DomainSet instances.

Methods

`align`(self, tn, qn)	Construct hierarchical alignment between tn and qn.
`conserved`(self, tn, qn)	Return conserved TAD pairs.
`inner_changed`(self, tn, qn)	Return semi-conserved TAD pairs.
`merged`(self, tn, qn)	Return merged region pairs and merged TAD details.
`overlap`(self, ti, qi)	Calculate overlap ratio of any two regions.
`split`(self, tn, qn)	Return split region pairs and split TAD details.

align(self, tn, qn)[source]¶

Construct hierarchical alignment between tn and qn.

Parameters

tn, qnstr: Unique identifiers of tadlib.hitad.aligner.DomainSet instances which are collected by arg during initialization.

Notes

The alignment results are organized in a hierarchical way in a dictionary. The keys are matched chromosome region (corresponds to either one TAD or several continuous TADs) pairs at the TAD level, and the values are tadlib.hitad.aligner.Container instances with info attribute set to be pair of detailed TAD lists; the keys of these tadlib.hitad.aligner.Container indicate domain levels, the values again are dictionaries containing sub-alignment results within the upper-layer TAD region.

You can access the results from Results attribute: self.Results[tn][qn] or self.Results[qn][tn].

conserved(self, tn, qn)[source]¶

Return conserved TAD pairs.

Parameters

tn, qnstr: Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns

pairsset of tuples: Each tuple has two elements (domain intervals), corresponding to tn and qn respectively.

inner_changed(self, tn, qn)[source]¶

Return semi-conserved TAD pairs.

Parameters

tn, qnstr: Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns

pairsset of tuples: Each tuple has two elements (domain intervals), corresponding to tn and qn respectively.

merged(self, tn, qn)[source]¶

Return merged region pairs and merged TAD details.

Parameters

tn, qnstr: Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns

pairsdict: The keys are merged region pairs in tuple, and the values are corresponding TAD list pairs within the region.

split(self, tn, qn)[source]¶

Return split region pairs and split TAD details.

Parameters

tn, qnstr: Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns

pairsdict: The keys are split region pairs in tuple, and the values are corresponding TAD list pairs within the region.

class tadlib.hitad.aligner.BoundAligner(*args)[source]¶

Based on our hierarchical domain alignment scheme, we also define several change types on boundary level between two datasets, including conserved TAD boundary, conserved sub-TAD boundary, disappeared TAD boundary, disappeared sub-TAD boundary, and TAD-to-sub-TAD boundary switch.

Boundaries are expressed in (chrom,pos) format in this class.

Parameters

argstwo or more tadlib.hitad.aligner.DomainSet instances

Attributes

byclassdict: Cache boundary pairs of each change type between datasets.

Methods

`align`(self, tn, qn)	Construct hierarchical alignment between tn and qn.
`all_in_one`(self, tn, qn, cache)	Parse domain alignment results between tn and qn and cache all detected cases of 6 defined change types.
`conserved`(self, tn, qn)	Return conserved TAD pairs.
`conserved_sub_bounds`(self, tn, qn)	Return pairs of conserved sub-TAD boundaries.
`conserved_tad_bounds`(self, tn, qn)	Return pairs of conserved TAD boundaries.
`disappeared_sub`(self, tn, qn)	Sub-TAD boundaries that exist in tn, but disappear in qn.
`disappeared_tad`(self, tn, qn)	TAD boundaries that exist in tn, but disappear in qn.
`inner_changed`(self, tn, qn)	Return semi-conserved TAD pairs.
`merged`(self, tn, qn)	Return merged region pairs and merged TAD details.
`overlap`(self, ti, qi)	Calculate overlap ratio of any two regions.
`split`(self, tn, qn)	Return split region pairs and split TAD details.
`sub2tad`(self, tn, qn)	Return sub-TAD to TAD switch cases.
`tad2sub`(self, tn, qn)	Return TAD to sub-TAD switch cases.

pairwise_alignment

all_in_one(self, tn, qn, cache)[source]¶

Parse domain alignment results between tn and qn and cache all detected cases of 6 defined change types.

Parameters

tn, qnstr: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.
cachedict: An empty dictionary.

conserved_sub_bounds(self, tn, qn)[source]¶

Return pairs of conserved sub-TAD boundaries.

Parameters

tn, qn: str: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns

pairsdict: Keys and values indicate sub-TAD boundaries in tn and qn, respectively.

conserved_tad_bounds(self, tn, qn)[source]¶

Return pairs of conserved TAD boundaries.

Parameters

tn, qn: str: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns

pairsdict: Keys and values indicate TAD boundaries in tn and qn, respectively.

disappeared_sub(self, tn, qn)[source]¶

Sub-TAD boundaries that exist in tn, but disappear in qn.

Parameters

tn, qn: str: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns

pairsset of tuples: Sub-TAD boundary positions in tn.

disappeared_tad(self, tn, qn)[source]¶

TAD boundaries that exist in tn, but disappear in qn.

Parameters

tn, qn: str: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns

pairsset of tuples: TAD boundary positions in tn.

sub2tad(self, tn, qn)[source]¶

Return sub-TAD to TAD switch cases.

Parameters

tn, qn: str: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns

pairsdict: Keys are sub-TAD boundaries in tn, and values indicate corresponding TAD boundaries in qn.

tad2sub(self, tn, qn)[source]¶

Return TAD to sub-TAD switch cases.

Parameters

tn, qn: str: Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns

pairsdict: Keys are TAD boundaries in tn, and values indicate corresponding sub-TAD boundaries in qn.

class tadlib.hitad.aligner.DomainSet(en, domainlist, res, hier=True)[source]¶

Parse and hold a hierarchical domain set.

Parameters

enstr: Unique identifier for input domain set.
domainlistlist: List of domains. See tadlib.hitad.aligner.BoundSet for details.
resint: Resolution of the Hi-C data in base-pair unit.
hierbool: Whether domainlist contains multiple-level domains or not. (Default: True)

Attributes

resint: Resolution of the Hi-C data in base-pair unit.
levsset of int: All possible domain levels contained in domainlist.
bychromsdict: Bychromosomal rearrangement of domainlist. The keys are chromosome labels(1,2,…,22,X,Y), and the values are list of [start,end,level].
pretreedict: Nested domain list within any domain interval. Returned by tadlib.hitad.aligner.DomainSet.NestedDomains().
subpooldict: Domain list within any domain interval. (tadlib.hitad.aligner.DomainSet.NestedDomains())
lidxdict: The smallest indices of left domain boundaries in a by-chromosomal domain list. (tadlib.hitad.aligner.DomainSet.NestedDomains())
ridxdict: The largest indices of right domain boundaries in a by-chromosomal domain list. (tadlib.hitad.aligner.DomainSet.NestedDomains())
Domainsdict: Store each TAD and its nested domains as a tree. Each node in the tree indicates one domain, in particular, the root corresponds to the TAD, and the leaves correspond to bottom domains.

Methods

`NestedDomains`(self, bychroms)	Pre-parse domain lists for accelerating subsequent calculations.
`genDomainTree`(self, node, pretree, cur)	Recursively generate a tree/sub-tree taking node as starting point.
`getBottoms`(self)	Link bottom domains to corresponding outer TADs.
`getregion`(self, chrom, start, end[, lev])	Extract all domains (or domains at specific level) within a given region.

NestedDomains(self, bychroms)[source]¶

Pre-parse domain lists for accelerating subsequent calculations.

Parameters

bychromsdict: By-chromosomal domain lists.

Returns

tmpdictdict: Nested domain list within any domain interval. If a domain have no nested domains, then its value is an empty list.
subpooldict: Domain list within any domain interval. Different from tmpdict, if a domain have no nested domains, the value is a list only containing itself.
lidxdict: The smallest indices of left domain boundaries in a by-chromosomal domain list.
ridxdict: The largest indices of right domain boundaries in a by-chromosomal domain list.

genDomainTree(self, node, pretree, cur)[source]¶

Recursively generate a tree/sub-tree taking node as starting point.

Parameters

nodeNode: A dict-like container for current domain.
pretreedict: Nested domain list within any domain interval. (tadlib.hitad.aligner.DomainSet.NestedDomains())
curlist: Nested domain list within current domain interval.

getBottoms(self)[source]¶

Link bottom domains to corresponding outer TADs.

Attributes

bottomsdict: Used to quickly retrieve (bottom domain, TAD) pairs.

getregion(self, chrom, start, end, lev=None)[source]¶

Extract all domains (or domains at specific level) within a given region.

Parameters

chromstr: Chromosome label.
start, endint: Domain interval in base-pair unit.
levint or None: Specify the desired domain level. (Default: None, domains of all levels will be returned)

Returns

rdomainslist: Sorted domain list. Each element corresponds to one domain in the format [chrom,start,end,level].

class tadlib.hitad.aligner.BoundSet(en, domainlist, res)[source]¶

As the name suggests, we use BoundSet to hold all bounds of a domain list.

Parameters

enstr: Unique identifier for current set of bounds.
domainlistlist: List of the domains. Each domain is represented by [chrom,start,end,level]. I think chrom, start and end are self-explanatory, all you need to keep in mind is that start and *end should be in base-pair unit. level indicates the hierarchical level of the domain. In our work, TAD is denoted as 0, sub-TAD is denoted as 1, and subsequent domain level is denoted as 2, etc.
resint: Resolution of the Hi-C data in base-pair unit.

Attributes

Labelstr: Unique identifier.
boundclassdict: The keys are bound representations (chrom,pos), and the values indicate corresponding hierarchical level notations. The Level of a bound is determined by the domain with the lowest level notation. For example, if we have two domains, [‘1’,100000,500000,0] and [‘1’,100000,200000,1], according to our definition, the level of (‘1’,200000) is 1, but the level of (‘1’,100000) is 0.
Boundslist: Sorted bound list. This attribute can be used as the reference bound list in tadlib.hitad.aligner.SingleBound.align() directly.

class tadlib.hitad.aligner.SingleDomain(chrom, start, end)[source]¶

We use SingleDomain to:

Represent a single domain (chrom, start, end).
Map the domain to another domain set

Parameters

chromstr: Chromosome label.
start, endint: Interval of the domain in base-pair unit.

Attributes

chromstr: Chromosome label.
intervallist: [start, end]
cachedict: Container for matched details.

Methods

`align`(self, qy)	Find the domain D in qy maximizing the overlap ratio.
`overlap`(self, ti, qi)	Calculate overlap ratio of any two regions.

align(self, qy)[source]¶

Find the domain D in qy maximizing the overlap ratio. Binary search method is used internally for accelerating the search process.

Parameters

qya tadlib.hitad.aligner.DomainSet instance: Reference domain set. (Recall sequence mapping and reference genome)

Notes

The matched details are stored in cache using the unique identifier of qy (qy.Label) as the key, the value is also a dict with 2 keys: hitdomain records the matched domain interval in (chrom,start,end) format, and hitoverlap records the overlap ratio between the hitdomain and current query domain.

overlap(self, ti, qi)[source]¶

Calculate overlap ratio of any two regions.

Parameters

ti, qilist: Interval ([start,end]) of the region.

Returns

ORfloat, 0-1: Overlap ratio.

class tadlib.hitad.aligner.SingleBound(chrom, pos)[source]¶

SingleBound is defined to:

Represent a single bound (chrom, pos)
Map the bound to a pool of bounds

Parameters

chromstr: Chromosome label.
posint: Bound position on the chromosome.

Attributes

chromstr: Chromosome label.
posint: Position on the chromosome.
cachedict: Container for matched details.

Methods

align(self, qn, qb, tol)

Map the bound to qb using the binary search method.

align(self, qn, qb, tol)[source]¶

Map the bound to qb using the binary search method.

Parameters

qnstr: Unique identifier for qb.
qblist of tuples: Reference bound (remember reference genome?) list. Each element is a tuple (chrom, pos) representing a single bound. And the list must be sorted in advance for binary search.
tolint: Mismatch tolerance. If the genomic distance between the bound and the best hit is less than this value, we say we have found a match, otherwise the bound is missed in qb.

Notes

Internally, the matched details will be stored in cache under the key qn, the value is also a dict recording the matched bound index (midx) and the indices of the matched neighbors (nindices).

class tadlib.hitad.aligner.Container(info)[source]¶

Dict-like. Used in the organizing of domain alignment results.

Parameters

infolist: Pair of domain lists from two domain sets.

Methods

`clear`()
`copy`()
`fromkeys`(iterable[, value])	Create a new dictionary with keys from iterable and values set to value.
`get`(self, key[, default])	Return the value for key if key is in the dictionary, else default.
`items`()
`keys`()
`pop`()	If key is not found, d is returned if given, otherwise KeyError is raised
`popitem`()	2-tuple; but raise KeyError if D is empty.
`setdefault`(self, key[, default])	Insert key with a value of default if key is not in the dictionary.
`update`()	If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
`values`()

class tadlib.hitad.aligner.Node(bounds=None)[source]¶

Dick-like. We use it to represent nodes of a hierarchical domain tree in DomainSet.

Parameters

boundslist or None: Domain interval represented by [chrom,start,end].

Methods

`clear`()
`copy`()
`fromkeys`(iterable[, value])	Create a new dictionary with keys from iterable and values set to value.
`get`(self, key[, default])	Return the value for key if key is in the dictionary, else default.
`items`()
`keys`()
`pop`()	If key is not found, d is returned if given, otherwise KeyError is raised
`popitem`()	2-tuple; but raise KeyError if D is empty.
`setdefault`(self, key[, default])	Insert key with a value of default if key is not in the dictionary.
`update`()	If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
`values`()

tadlib.hitad.aligner.readHierDomain(domainfile, pre='')[source]¶

Load hierarchical domain list from a text file.

The source file should contain 4 columns indicating chromosome label (1,2,…,X,Y), domain start (bp), domain end (bp), and hierarchical level (0,1,2,…), respectively.

In our paper, TAD is denoted as level 0, sub-TAD is denoted as level 1, and subsequent domain level is denoted as level 2, etc.

Parameters

domainfilestr: Domain file path.

Returns

domainlistlist: Each element of the list indicates one domain represented by [chrom,start,end,level].

tadlib.hitad.aligner.readPlainDomain(domainfile, pre='chr')[source]¶

Load domain list from a text file.

The source file should contain 3 columns indicating chromosome name, domain start (bp) and domain end (bp), respectively.

Parameters

domainfilestr: Domain file path.
prestr: Leading string of the chromosome name. (Default: chr)

Returns

domainlistlist: Each element indicates one domain represented by [chrom(leading string removed),start,end].

See also

tadlib.hitad.aligner.hierFormat: parse hierarchical relationships between domains

tadlib.hitad.aligner.hierFormat(domainlist)[source]¶

Resolve the nested/hierarchical relationships between domains, and transform the input [chrom,start,end] format domains into a format including hierarchical level information.

Parameters

domainlistlist: Domains with the format [chrom,start,end].

Returns

domainlistlist: Domains with the format [chrom,start,end,level].

TADLib

Navigation

Hierarchical TAD Identification¶

Domain Loading and Aligning¶