Hierarchical TAD Identification¶
-
class
tadlib.hitad.chromLev.
Chrom
(chrom, res, hicdata, replabel, maxapart=4000000)[source]¶ Chrom is defined to:
Hold Hi-C data within a certain chromosome
Identify hierarchical domains in 4 steps: 1.Calculate adaptive DIs. 2.Identify original candidate bounds by 5-state Gaussian mixture Hidden Markov Model using adaptive DIs as input. 3.Select TAD bounds from candidate bounds. 4.Recursively identify inner domain structures of each TAD.
Visualize any region of the chromosome. Hierarchical domains will be plotted as boxes along with the diagonal of the heatmap, and adaptive DI track will be placed on top of the heatmap.
- Parameters
- chromstr
Chromosome label.
- resint
Resolution of the Hi-C data in base-pair unit.
- hicdataCSR sparse matrix
Bin-level Hi-C matrix of the specified chromosome.
- replabelstr
Biological replicate label.
- maxapartint
Maximum allowable TAD size in base-pair unit. (Default: 4000000)
- Attributes
- chromstr
Chromosome label.
- resint
Resolution in base-pair unit.
- maxapartint
Maximum allowable TAD size.
- replabelstr
Biological replicate label.
- chromLenint
Total bin number of the chromosome.
- rawMatrixsparse matrix in Compressed Sparse Row format
CSR sparse matrix is used to extract Hi-C data by slicing conveniently while guarantee low memory overhead.
Methods
calDI
(self, windows, start)Calculate Directionality Index (DI) for each bin with adaptive window size.
callDomain
(self)Direct API for our hierarchical domain identification pipeline:
detectPeaks
(self, trends[, mph, mpd])Detect peaks (local maxima) in a 1-D array intuitively (a peak must be greater than its immediate neighbors).
fineDomain
(self)Identify hierarchical domains within each TAD.
getDomainList
(self, byregion)Combine by-region domains into a single list.
getSelfMatrix
(self, start, end)Return the contact matrix of any given region.
getWeightMatrix
(self, start, end[, bases])Calculate weights for each intra-domain interaction by considering the genomic distance and the local interaction background.
idxmatch
(self, domain)Pair interactions of the given domain with the upstream and downstream interactions under the same genomic distance.
iterCore
(self, minDomains, tmpDomains)Calculate the mismatch ratio for the input two domain lists.
maxCore
(self[, cache])Perform TAD identification.
maxscorepath
(self, domainlist, cache)An implementation for our proposed algorithm to find the best separation solution at chromosomal/domain level, given bottom domain list and pre-computed scores.
minCore
(self, regionDIs)Output domain list for each gap-free region.
minWindows
(self, start, end, maxw)Estimate best window size for each bin of a given range.
oriIter
(self, minDomains)Iteratvely approximate adaptive window sizes and return the final bottom domain list which will be used in subsequent procedures.
oriWindow
(self, P)Estimate the most appropriate window size for current bin to best capture the local interaction bias direction.
pipe
(self, seq, start)Transform an observed sequence into a list of domains.
plot
(self, start, end, Domains, figname[, …])Given a genomic region and a domain list, plot corresponding contact heatmap and all domains (represented as diagonal squares) within the region.
randomCheck
(self, seq[, pthre])We use chi square test to test the randomness of a sequence by looking at the conversion frequency between neighbors in the sequence.
refNoise
(self, domain)Return noise level of a domain, which is simply defined as the zero entry ratio of the contact matrix.
scoreCache
(self, domainlist[, cache])Calculate and cache the TAD scores for any combinations of consecutive bottom domains.
splitChrom
(self, DIs)Split a chromosome into gap-free regions.
stablescore
(self, domain[, bases])Calculate TAD score for the given domain.
subDomains
(self, domain, reflist[, clv, aM, …])A recusive method (function) to identify inner domain hierarchy of a TAD (or a domain).
toArrowhead
(self, start, end)Perform Arrowhead transformation on contact matrix of a given genomic region.
viterbi
(self, seq)Find the most likely hidden state series using the viterbi algorithm.
calIS
preciseBound
-
calDI
(self, windows, start)[source]¶ Calculate Directionality Index (DI) for each bin with adaptive window size.
- Parameters
- windows1-D numpy.ndarray, int32
Returned by
tadlib.hitad.chromLev.Chrom.minWindows()
.- startint
Starting bin index, the window size of which is taken from the 1st place of windows.
- Attributes
- DIs1-D numpy ndarray, float
Calculated adaptive DI array, which has the same size as the input windows.
-
callDomain
(self)[source]¶ Direct API for our hierarchical domain identification pipeline:
Adaptively estimate window size for each bin. (
tadlib.hitad.chromLev.Chrom.minWindows()
)Calculate adaptive DIs. (
tadlib.hitad.chromLev.Chrom.calDI()
)Iteratively correct adaptive window size and bottom boundary positions. (
tadlib.hitad.chromLev.Chrom.oriIter()
)Identify TADs based on bottom domains. (
tadlib.hitad.chromLev.Chrom.maxCore()
)Resolve domain hierarchy within each TAD. (
tadlib.hitad.chromLev.Chrom.fineDomain()
)
-
detectPeaks
(self, trends, mph=0, mpd=5)[source]¶ Detect peaks (local maxima) in a 1-D array intuitively (a peak must be greater than its immediate neighbors).
- Parameters
- trends1-D numpy ndarray
Data.
- mphfloat
Only peaks that are greater than this value will be detected. (Default: 0)
- mpdpositive integer
Only peaks whose indices are at least separated by this value will be reported. (Default: 5)
- Returns
- ind1-D numpy ndarray
Indices of peaks detected in trends.
-
fineDomain
(self)[source]¶ Identify hierarchical domains within each TAD.
See also
tadlib.hitad.chromLev.Chrom.maxCore
identify TADs
tadlib.hitad.chromLev.Chrom.subDomains
resolve domain hierarchy within a given TAD
- Attributes
- hierDomainsdict
The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified hierarchical domain lists. Start and end of the domain are in base-pair unit.
-
getDomainList
(self, byregion)[source]¶ Combine by-region domains into a single list.
- Parameters
- byregiondict
The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified domain lists.
- Returns
- DomainListlist
A merged domain list of all regions
-
getSelfMatrix
(self, start, end)[source]¶ Return the contact matrix of any given region.
- Parameters
- start, endint
The region interval in base-pair unit.
- Returns
- Matrix2-D numpy ndarray, float
Sub contact matrix.
-
getWeightMatrix
(self, start, end, bases=[])[source]¶ Calculate weights for each intra-domain interaction by considering the genomic distance and the local interaction background.
- Parameters
- start, endint
The domain interval in base-pair unit.
- baseslist
List of the bottom domains within the given interval.
- Returns
- W2-D numpy.ndarray
The weight matrix. (An upper triangular matrix)
-
idxmatch
(self, domain)[source]¶ Pair interactions of the given domain with the upstream and downstream interactions under the same genomic distance.
- Parameters
- domain[start, end]
Domain interval in base-pair unit.
- Returns
- cur_storetuple, (x-coordinates, y-coordinates, interaction frequencies)
Interactions within the given domain.
- up_storetuple, (x-coordinates, y-coordinates, interaction frequencies)
Corresponding upstream interctions.
- down_storetuple, (x-coordinates, y-coordinates, interaction frequencies)
Corresponding downstream interactions.
-
iterCore
(self, minDomains, tmpDomains)[source]¶ Calculate the mismatch ratio for the input two domain lists. Return 1 if minDomains is empty.
tadlib.hitad.chromLev.Chrom.oriIter()
uses this method to determine whether to break the iteration.- Parameters
- minDomainsdict
Target domains calculated by the last loop.
- tmpDomainsdict
Query domains returned by the current loop.
- Returns
- tolfloat
Mismatch ratio.
-
maxCore
(self, cache={})[source]¶ Perform TAD identification. We define TADs as domains to optimize chromosome separation(based on some objective function), which is solved by using an algorithm like dynamic programming implemented in
tadlib.hitad.chromLev.Chrom.maxscorepath()
.- Parameters
- cachedict
TAD scores for all combinations of consecutive bottom domains will be pre-computed and stored in this dict. The keys are tuples (start, end) representing merged domain intervals. (Default: {})
See also
tadlib.hitad.chromLev.Chrom.scoreCache
Pre-compute TAD scores for all combinations of consecutive bottom domains.
tadlib.hitad.chromLev.Chrom.maxscorepath
find the best TAD list
- Attributes
- maxDomainsdict
Optimized TADs.
- cachedict
Cached TAD scores.
-
maxscorepath
(self, domainlist, cache)[source]¶ An implementation for our proposed algorithm to find the best separation solution at chromosomal/domain level, given bottom domain list and pre-computed scores.
- Parameters
- domainlistlist of [start,end]
List of bottom domain intervals in base-pair unit.
- cachedict
Pre-computed scores for any combinations of continuous bottom domains. The keys are intervals of continuous regions in (start,end) format, and the values are corresponding scores.
- Returns
- bestsset of (sidx, eidx)
Each element indicates one merged domain of the solution, represented as (start index, end index) of the input domainlist.
-
minCore
(self, regionDIs)[source]¶ Output domain list for each gap-free region.
- Parameters
- regionDIsdict
Gap-free regions and corresponding adaptive DI arrays.
- Returns
- minDomainsdict
Gap-free regions and corresponding identified bottom domain list. Different from
tadlib.hitad.chromLev.Chrom.pipe()
, the start and the end of a domain are in base-pair unit.
-
minWindows
(self, start, end, maxw)[source]¶ Estimate best window size for each bin of a given range.
- Parameters
- start, endint
Specify range of the bin indices.
- maxwint
Maximum allowable window size.
See also
tadlib.hitad.chromLev.Chrom.oriWindow
Window size estimation for a single bin.
- Attributes
- windows1-D numpy.ndarray, int32
-
oriIter
(self, minDomains)[source]¶ Iteratvely approximate adaptive window sizes and return the final bottom domain list which will be used in subsequent procedures. For each loop, window sizes are updated according to the latest bottom domains and next loop will re-run the identification pipeline using new window sizes. The iteration terminates if domains between two consecutive loops are very similar (estimated by
tadlib.hitad.chromLev.Chrom.iterCore()
).- Parameters
- minDomainsdict
Initial domains served as the target domain list for
tadlib.hitad.chromLev.Chrom.iterCore()
at the first iteration. We set it empty in our pipeline.
See also
tadlib.hitad.chromLev.Chrom.calDI
calculate DI values according to input window sizes
tadlib.hitad.chromLev.Chrom.iterCore
estimate the degree of divergence between two domain lists
- Attributes
- minDomainsdict
The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified bottom domain lists. Start and end of the domain are in base-pair unit.
-
oriWindow
(self, P)[source]¶ Estimate the most appropriate window size for current bin to best capture the local interaction bias direction.
See also
tadlib.hitad.chromLev.Chrom.detectPeaks
detect peaks given a 1-D array
tadlib.hitad.chromLev.Chrom.randomCheck
randomness test for a two-valued (0-1) sequence
-
pipe
(self, seq, start)[source]¶ Transform an observed sequence into a list of domains.
- Parameters
- seq1-D numbpy ndarray, float
Adaptive DI array for any region.
- startint
Chromosome bin index of the seq start.
- Returns
- domainslist
List of domains in the format
[start bin, end bin, noise level, hierarchical level]
.
See also
tadlib.hitad.chromLev.Chrom.refNoise
Calculate the noise level of a given domain
tadlib.hitad.aligner.BoundSet
where the meanings of the hierarchical level labels are explained in detail.
-
plot
(self, start, end, Domains, figname, arrowhead=False, vmin=None, vmax=None)[source]¶ Given a genomic region and a domain list, plot corresponding contact heatmap and all domains (represented as diagonal squares) within the region.
- Parameters
- start, endint
The region interval.
- Domainsdict
The keys are tuples representing gap-free regions, and values are corresponding identified domains.
- fignamestr or None
If not None, the figure will be saved, otherwise it will only be shown in an interactive window. (Default: None)
- arrowheadbool
If True, the Arrowhead transformed matrix will be plotted instead of the raw contact matrix. (Default: False)
-
randomCheck
(self, seq, pthre=0.05)[source]¶ We use chi square test to test the randomness of a sequence by looking at the conversion frequency between neighbors in the sequence.
- Parameters
- seqstr
A string containing only ‘1’ or ‘0’. (e.g. ‘101000101’)
- pthrefloat, 0-1
Significance level of the hypothesis tests.
- Returns
- rejectbool
True if we should reject the null hypothesis (the sequence is generated randomly) under the selected significance level.
-
refNoise
(self, domain)[source]¶ Return noise level of a domain, which is simply defined as the zero entry ratio of the contact matrix.
-
scoreCache
(self, domainlist, cache={})[source]¶ Calculate and cache the TAD scores for any combinations of consecutive bottom domains.
- Parameters
- domainlistlist of [start,end]
List of bottom domain intervals in base-pair unit.
- cachedict
Cache of the TAD scores. The keys are in (start,end) format, intervals of continuous regions, and the values are corresponding calculated scores.
See also
tadlib.hitad.chromLev.Chrom.stablescore
Calculate TAD score of any genomic interval
-
splitChrom
(self, DIs)[source]¶ Split a chromosome into gap-free regions. HMM learning and domain identification procedures will be performed on these regions separately.
- Parameters
- DIs1-D numpy ndarray, float
Adaptive DI array of the whole chromosome. Generally, we detect runs of zeros in the array as gaps, which will be cut off the chromosome, making entire chromosome pieces of gap-free regions.
- Attributes
- chromRegionsdict, {(start,end):DIs[start:end]}
The keys are gap-free regions, and the values are corresponding adaptive DI pieces.
- gapbinsset
Set of bins (in base-pair unit) located in gap regions.
-
stablescore
(self, domain, bases=[])[source]¶ Calculate TAD score for the given domain.
- Parameters
- domain[start, end]
Domain interval in base-pair unit.
- baseslist
List of bottom domains within the given domain region.
- Returns
- inoutfloat
TAD score defined as the enrichment between intra-domain interaction frequencies and inter-domain interaction frequencies controlling for the impact of genomic distance.
See also
tadlib.hitad.chromLev.Chrom.idxmatch
Pair intra-domain and inter-domain interactions with the same genomic distance
tadlib.calfea.analyze.Core.longrange
Assign the weight value for each intra-domain interaction according to the genomic distance and the local interaction background.
-
subDomains
(self, domain, reflist, clv=0, aM=None, W=None, subdomains={})[source]¶ A recusive method (function) to identify inner domain hierarchy of a TAD (or a domain). Sub-TADs of each level are defined as the best separation of the outer domain.
- Parameters
- domain[start, end]
Outer layer domain interval in base-pair unit.
- reflistlist
List of bottom domains within the region of the outer domain.
- clvint
Global domain level of the outer domain. The TADs have the level 0, the sub-TADs within a TAD have the level 1, and sub-sub-TADs within a sub-TAD have the level 2, and so forth. (Default: 0)
- aM2-D numpy ndarray or None
Arrowhead matrix of the outer domain. (Default: None)
- W2-D numpy ndarray or None
Weight matrix corresponding to the contact matrix of the outer domain. See
tadlib.hitad.chromLev.Chrom.getWeightMatrix()
for detailed calculation. (Default: None)- subdomainsdict
A container for domains of all hierarchy. The keys are domain intervals in base-pair unit, and values are corresponding global levels.
See also
tadlib.hitad.chromLev.Chrom.maxscorepath
find the best domain list optimally separating domain-level interactions.
-
class
tadlib.hitad.chromLev.
MultiReps
(chrom, res, datasets)[source]¶ We define MultiReps to:
Hold Hi-C data of the same chromosome from different biological replicates at the same time
Provide an interface to identify hierarchical domains by using different replicate data independently and maintain the reproducible domains to form the final domain list.
- Parameters
- chromstr
Chromosome label.
- resint
Hi-C data resolution in base-pair unit.
- datasetsdict
The keys are unique replicate labels, and the values are constructed Chrom objects by using the Hi-C data of corresponding biological replicates.
Methods
align
(self, tn, qn)Construct hierarchical alignment between tn and qn.
callDomain
(self)Find reproducible domains between replicates by using our domain-level alignment strategy.
conserved
(self, tn, qn)Return conserved TAD pairs.
getDomainList
(self, byregion)Combine by-region domains into a single list.
inner_changed
(self, tn, qn)Return semi-conserved TAD pairs.
merged
(self, tn, qn)Return merged region pairs and merged TAD details.
overlap
(self, ti, qi)Calculate overlap ratio of any two regions.
reproducible
(self, tg, qy)Subprocess of callDomain for replicate alignment and reproducibility determination.
split
(self, tn, qn)Return split region pairs and split TAD details.
intersect
-
callDomain
(self)[source]¶ Find reproducible domains between replicates by using our domain-level alignment strategy.
-
class
tadlib.hitad.genomeLev.
Genome
(datasets, balance_type='weight', maxsize=4000000, cache=None, exclude=['chrM', 'chrY'], DIcol='DIs')[source]¶ Genome is built on top of
tadlib.hitad.chromLev.Chrom
. We use it to:Load bin-level Hi-C data
Initialize, pickle and organize Chrom objects
Call hierarchical domains of each chromosome in parallel
- Parameters
- datasets2-level dict, {resolution(int):{biological_replicate_label(str):data_path,…}}
data_path indicates the cool URI under corresponding resolution and biological replicate label.
- maxsizeint
Maximum allowable domain size in base-pair unit. (Default: 4000000)
- cachestr or None
Cache folder path. If None, the folder returned by
tempfile.gettempdir()
will be used. (Default: None)
- Attributes
- data3-level dict. {chrom(str):{resolution(int):{biological_replicate_label(str):cachedfile,…}},…}
Different from the input datasets, it organizes data by chromosome label, and each bottom-level value indicates one pickled
tadlib.hitad.chromLev.Chrom
file under the cache folder.- Resultslist
Final consistent domain list merged from all chromosome results.
Methods
callHierDomain
(self[, cpu_core])Identify hierarchical domains of each chromosome independently and concurrently, find consistent domains between biological replicates, and finally combine results of all chromosomes.
learning
(self[, cpu_core])Prepare training data and learn HMM model parameters for each dataset.
oriHMMParams
(self)Set initial parameters for the Hidden Markov Model (HMM).
wipeDisk
(self)Remove catched (pickled)
tadlib.hitad.chromLev.Chrom
objects before exiting.outputDomain
train_data
-
callHierDomain
(self, cpu_core=1)[source]¶ Identify hierarchical domains of each chromosome independently and concurrently, find consistent domains between biological replicates, and finally combine results of all chromosomes.
- Parameters
- cpu_coreint
Number of processes to launch. For now, hitad only supports parallel computing on a quite high layer, that is, it simply allocates an uncompleted
tadlib.hitad.chromLev.Chrom
object to an idle processor and invokes its callDomain method.
-
learning
(self, cpu_core=1)[source]¶ Prepare training data and learn HMM model parameters for each dataset.
- Parameters
- cpu_coreint
Number of processes to launch.
-
oriHMMParams
(self)[source]¶ Set initial parameters for the Hidden Markov Model (HMM).
- Attributes
- HMMParamsdict
Has 3 keys: “A”, state transition matrix, “B” (emission probabilities), specifying parameters (Means, Variances, Weights) of the mixture Gaussian distributions for each hidden state, and “pi”, indicating the hidden state weights. This dict will be updated after learning procedure.
-
wipeDisk
(self)[source]¶ Remove catched (pickled)
tadlib.hitad.chromLev.Chrom
objects before exiting.
Domain Loading and Aligning¶
-
class
tadlib.hitad.aligner.
DomainAligner
(*args)[source]¶ This class is the work horse we define to:
Hold multiple
tadlib.hitad.aligner.DomainSet
instances at the same time.Perform domain-based hierarchical alignment between any two
tadlib.hitad.aligner.DomainSet
.Define and extract domain-level change types from alignment results between two
tadlib.hitad.aligner.DomainSet
.
- Parameters
- argstwo or more
tadlib.hitad.aligner.DomainSet
instances
- argstwo or more
- Attributes
- DomainSetsdict
Pool of
tadlib.hitad.aligner.DomainSet
. The keys are unique identifiers extracted fromtadlib.hitad.aligner.DomainSet
, and the values are correspondingtadlib.hitad.aligner.DomainSet
instances.- Resultsdict
Container for alignment results between any pair of
tadlib.hitad.aligner.DomainSet
instances.
Methods
align
(self, tn, qn)Construct hierarchical alignment between tn and qn.
conserved
(self, tn, qn)Return conserved TAD pairs.
inner_changed
(self, tn, qn)Return semi-conserved TAD pairs.
merged
(self, tn, qn)Return merged region pairs and merged TAD details.
overlap
(self, ti, qi)Calculate overlap ratio of any two regions.
split
(self, tn, qn)Return split region pairs and split TAD details.
-
align
(self, tn, qn)[source]¶ Construct hierarchical alignment between tn and qn.
- Parameters
- tn, qnstr
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances which are collected by arg during initialization.
Notes
The alignment results are organized in a hierarchical way in a dictionary. The keys are matched chromosome region (corresponds to either one TAD or several continuous TADs) pairs at the TAD level, and the values are
tadlib.hitad.aligner.Container
instances with info attribute set to be pair of detailed TAD lists; the keys of thesetadlib.hitad.aligner.Container
indicate domain levels, the values again are dictionaries containing sub-alignment results within the upper-layer TAD region.You can access the results from Results attribute:
self.Results[tn][qn]
orself.Results[qn][tn]
.
-
conserved
(self, tn, qn)[source]¶ Return conserved TAD pairs.
- Parameters
- tn, qnstr
Unique identifiers of
tadlib.hitad.aligner.DomainSet
.
- Returns
- pairsset of tuples
Each tuple has two elements (domain intervals), corresponding to tn and qn respectively.
-
inner_changed
(self, tn, qn)[source]¶ Return semi-conserved TAD pairs.
- Parameters
- tn, qnstr
Unique identifiers of
tadlib.hitad.aligner.DomainSet
.
- Returns
- pairsset of tuples
Each tuple has two elements (domain intervals), corresponding to tn and qn respectively.
-
merged
(self, tn, qn)[source]¶ Return merged region pairs and merged TAD details.
- Parameters
- tn, qnstr
Unique identifiers of
tadlib.hitad.aligner.DomainSet
.
- Returns
- pairsdict
The keys are merged region pairs in tuple, and the values are corresponding TAD list pairs within the region.
-
split
(self, tn, qn)[source]¶ Return split region pairs and split TAD details.
- Parameters
- tn, qnstr
Unique identifiers of
tadlib.hitad.aligner.DomainSet
.
- Returns
- pairsdict
The keys are split region pairs in tuple, and the values are corresponding TAD list pairs within the region.
-
class
tadlib.hitad.aligner.
BoundAligner
(*args)[source]¶ Based on our hierarchical domain alignment scheme, we also define several change types on boundary level between two datasets, including conserved TAD boundary, conserved sub-TAD boundary, disappeared TAD boundary, disappeared sub-TAD boundary, and TAD-to-sub-TAD boundary switch.
Boundaries are expressed in (chrom,pos) format in this class.
- Parameters
- argstwo or more
tadlib.hitad.aligner.DomainSet
instances
- argstwo or more
- Attributes
- byclassdict
Cache boundary pairs of each change type between datasets.
Methods
align
(self, tn, qn)Construct hierarchical alignment between tn and qn.
all_in_one
(self, tn, qn, cache)Parse domain alignment results between tn and qn and cache all detected cases of 6 defined change types.
conserved
(self, tn, qn)Return conserved TAD pairs.
conserved_sub_bounds
(self, tn, qn)Return pairs of conserved sub-TAD boundaries.
conserved_tad_bounds
(self, tn, qn)Return pairs of conserved TAD boundaries.
disappeared_sub
(self, tn, qn)Sub-TAD boundaries that exist in tn, but disappear in qn.
disappeared_tad
(self, tn, qn)TAD boundaries that exist in tn, but disappear in qn.
inner_changed
(self, tn, qn)Return semi-conserved TAD pairs.
merged
(self, tn, qn)Return merged region pairs and merged TAD details.
overlap
(self, ti, qi)Calculate overlap ratio of any two regions.
split
(self, tn, qn)Return split region pairs and split TAD details.
sub2tad
(self, tn, qn)Return sub-TAD to TAD switch cases.
tad2sub
(self, tn, qn)Return TAD to sub-TAD switch cases.
pairwise_alignment
-
all_in_one
(self, tn, qn, cache)[source]¶ Parse domain alignment results between tn and qn and cache all detected cases of 6 defined change types.
- Parameters
- tn, qnstr
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.- cachedict
An empty dictionary.
-
conserved_sub_bounds
(self, tn, qn)[source]¶ Return pairs of conserved sub-TAD boundaries.
- Parameters
- tn, qn: str
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.
- Returns
- pairsdict
Keys and values indicate sub-TAD boundaries in tn and qn, respectively.
-
conserved_tad_bounds
(self, tn, qn)[source]¶ Return pairs of conserved TAD boundaries.
- Parameters
- tn, qn: str
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.
- Returns
- pairsdict
Keys and values indicate TAD boundaries in tn and qn, respectively.
-
disappeared_sub
(self, tn, qn)[source]¶ Sub-TAD boundaries that exist in tn, but disappear in qn.
- Parameters
- tn, qn: str
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.
- Returns
- pairsset of tuples
Sub-TAD boundary positions in tn.
-
disappeared_tad
(self, tn, qn)[source]¶ TAD boundaries that exist in tn, but disappear in qn.
- Parameters
- tn, qn: str
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.
- Returns
- pairsset of tuples
TAD boundary positions in tn.
-
sub2tad
(self, tn, qn)[source]¶ Return sub-TAD to TAD switch cases.
- Parameters
- tn, qn: str
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.
- Returns
- pairsdict
Keys are sub-TAD boundaries in tn, and values indicate corresponding TAD boundaries in qn.
-
tad2sub
(self, tn, qn)[source]¶ Return TAD to sub-TAD switch cases.
- Parameters
- tn, qn: str
Unique identifiers of
tadlib.hitad.aligner.DomainSet
instances.
- Returns
- pairsdict
Keys are TAD boundaries in tn, and values indicate corresponding sub-TAD boundaries in qn.
-
class
tadlib.hitad.aligner.
DomainSet
(en, domainlist, res, hier=True)[source]¶ Parse and hold a hierarchical domain set.
- Parameters
- enstr
Unique identifier for input domain set.
- domainlistlist
List of domains. See
tadlib.hitad.aligner.BoundSet
for details.- resint
Resolution of the Hi-C data in base-pair unit.
- hierbool
Whether domainlist contains multiple-level domains or not. (Default: True)
- Attributes
- resint
Resolution of the Hi-C data in base-pair unit.
- levsset of int
All possible domain levels contained in domainlist.
- bychromsdict
Bychromosomal rearrangement of domainlist. The keys are chromosome labels(1,2,…,22,X,Y), and the values are list of [start,end,level].
- pretreedict
Nested domain list within any domain interval. Returned by
tadlib.hitad.aligner.DomainSet.NestedDomains()
.- subpooldict
Domain list within any domain interval. (
tadlib.hitad.aligner.DomainSet.NestedDomains()
)- lidxdict
The smallest indices of left domain boundaries in a by-chromosomal domain list. (
tadlib.hitad.aligner.DomainSet.NestedDomains()
)- ridxdict
The largest indices of right domain boundaries in a by-chromosomal domain list. (
tadlib.hitad.aligner.DomainSet.NestedDomains()
)- Domainsdict
Store each TAD and its nested domains as a tree. Each node in the tree indicates one domain, in particular, the root corresponds to the TAD, and the leaves correspond to bottom domains.
Methods
NestedDomains
(self, bychroms)Pre-parse domain lists for accelerating subsequent calculations.
genDomainTree
(self, node, pretree, cur)Recursively generate a tree/sub-tree taking node as starting point.
getBottoms
(self)Link bottom domains to corresponding outer TADs.
getregion
(self, chrom, start, end[, lev])Extract all domains (or domains at specific level) within a given region.
-
NestedDomains
(self, bychroms)[source]¶ Pre-parse domain lists for accelerating subsequent calculations.
- Parameters
- bychromsdict
By-chromosomal domain lists.
- Returns
- tmpdictdict
Nested domain list within any domain interval. If a domain have no nested domains, then its value is an empty list.
- subpooldict
Domain list within any domain interval. Different from tmpdict, if a domain have no nested domains, the value is a list only containing itself.
- lidxdict
The smallest indices of left domain boundaries in a by-chromosomal domain list.
- ridxdict
The largest indices of right domain boundaries in a by-chromosomal domain list.
-
genDomainTree
(self, node, pretree, cur)[source]¶ Recursively generate a tree/sub-tree taking node as starting point.
- Parameters
- nodeNode
A dict-like container for current domain.
- pretreedict
Nested domain list within any domain interval. (
tadlib.hitad.aligner.DomainSet.NestedDomains()
)- curlist
Nested domain list within current domain interval.
-
getBottoms
(self)[source]¶ Link bottom domains to corresponding outer TADs.
- Attributes
- bottomsdict
Used to quickly retrieve (bottom domain, TAD) pairs.
-
getregion
(self, chrom, start, end, lev=None)[source]¶ Extract all domains (or domains at specific level) within a given region.
- Parameters
- chromstr
Chromosome label.
- start, endint
Domain interval in base-pair unit.
- levint or None
Specify the desired domain level. (Default: None, domains of all levels will be returned)
- Returns
- rdomainslist
Sorted domain list. Each element corresponds to one domain in the format
[chrom,start,end,level]
.
-
class
tadlib.hitad.aligner.
BoundSet
(en, domainlist, res)[source]¶ As the name suggests, we use BoundSet to hold all bounds of a domain list.
- Parameters
- enstr
Unique identifier for current set of bounds.
- domainlistlist
List of the domains. Each domain is represented by
[chrom,start,end,level]
. I think chrom, start and end are self-explanatory, all you need to keep in mind is that start and *end should be in base-pair unit. level indicates the hierarchical level of the domain. In our work, TAD is denoted as 0, sub-TAD is denoted as 1, and subsequent domain level is denoted as 2, etc.- resint
Resolution of the Hi-C data in base-pair unit.
- Attributes
- Labelstr
Unique identifier.
- boundclassdict
The keys are bound representations (chrom,pos), and the values indicate corresponding hierarchical level notations. The Level of a bound is determined by the domain with the lowest level notation. For example, if we have two domains, [‘1’,100000,500000,0] and [‘1’,100000,200000,1], according to our definition, the level of (‘1’,200000) is 1, but the level of (‘1’,100000) is 0.
- Boundslist
Sorted bound list. This attribute can be used as the reference bound list in
tadlib.hitad.aligner.SingleBound.align()
directly.
-
class
tadlib.hitad.aligner.
SingleDomain
(chrom, start, end)[source]¶ We use SingleDomain to:
Represent a single domain (chrom, start, end).
Map the domain to another domain set
- Parameters
- chromstr
Chromosome label.
- start, endint
Interval of the domain in base-pair unit.
- Attributes
- chromstr
Chromosome label.
- intervallist
[start, end]
- cachedict
Container for matched details.
Methods
align
(self, qy)Find the domain D in qy maximizing the overlap ratio.
overlap
(self, ti, qi)Calculate overlap ratio of any two regions.
-
align
(self, qy)[source]¶ Find the domain D in qy maximizing the overlap ratio. Binary search method is used internally for accelerating the search process.
- Parameters
- qya
tadlib.hitad.aligner.DomainSet
instance Reference domain set. (Recall sequence mapping and reference genome)
- qya
Notes
The matched details are stored in cache using the unique identifier of qy (
qy.Label
) as the key, the value is also a dict with 2 keys: hitdomain records the matched domain interval in (chrom,start,end) format, and hitoverlap records the overlap ratio between the hitdomain and current query domain.
-
class
tadlib.hitad.aligner.
SingleBound
(chrom, pos)[source]¶ SingleBound is defined to:
Represent a single bound (chrom, pos)
Map the bound to a pool of bounds
- Parameters
- chromstr
Chromosome label.
- posint
Bound position on the chromosome.
- Attributes
- chromstr
Chromosome label.
- posint
Position on the chromosome.
- cachedict
Container for matched details.
Methods
align
(self, qn, qb, tol)Map the bound to qb using the binary search method.
-
align
(self, qn, qb, tol)[source]¶ Map the bound to qb using the binary search method.
- Parameters
- qnstr
Unique identifier for qb.
- qblist of tuples
Reference bound (remember reference genome?) list. Each element is a tuple (chrom, pos) representing a single bound. And the list must be sorted in advance for binary search.
- tolint
Mismatch tolerance. If the genomic distance between the bound and the best hit is less than this value, we say we have found a match, otherwise the bound is missed in qb.
Notes
Internally, the matched details will be stored in cache under the key qn, the value is also a dict recording the matched bound index (midx) and the indices of the matched neighbors (nindices).
-
class
tadlib.hitad.aligner.
Container
(info)[source]¶ Dict-like. Used in the organizing of domain alignment results.
- Parameters
- infolist
Pair of domain lists from two domain sets.
Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(self, key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
()If key is not found, d is returned if given, otherwise KeyError is raised
popitem
()2-tuple; but raise KeyError if D is empty.
setdefault
(self, key[, default])Insert key with a value of default if key is not in the dictionary.
update
()If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()
-
class
tadlib.hitad.aligner.
Node
(bounds=None)[source]¶ Dick-like. We use it to represent nodes of a hierarchical domain tree in DomainSet.
- Parameters
- boundslist or None
Domain interval represented by [chrom,start,end].
Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(self, key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
()If key is not found, d is returned if given, otherwise KeyError is raised
popitem
()2-tuple; but raise KeyError if D is empty.
setdefault
(self, key[, default])Insert key with a value of default if key is not in the dictionary.
update
()If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()
-
tadlib.hitad.aligner.
readHierDomain
(domainfile, pre='')[source]¶ Load hierarchical domain list from a text file.
The source file should contain 4 columns indicating chromosome label (1,2,…,X,Y), domain start (bp), domain end (bp), and hierarchical level (0,1,2,…), respectively.
In our paper, TAD is denoted as level 0, sub-TAD is denoted as level 1, and subsequent domain level is denoted as level 2, etc.
- Parameters
- domainfilestr
Domain file path.
- Returns
- domainlistlist
Each element of the list indicates one domain represented by [chrom,start,end,level].
-
tadlib.hitad.aligner.
readPlainDomain
(domainfile, pre='chr')[source]¶ Load domain list from a text file.
The source file should contain 3 columns indicating chromosome name, domain start (bp) and domain end (bp), respectively.
- Parameters
- domainfilestr
Domain file path.
- prestr
Leading string of the chromosome name. (Default: chr)
- Returns
- domainlistlist
Each element indicates one domain represented by [chrom(leading string removed),start,end].
See also
tadlib.hitad.aligner.hierFormat
parse hierarchical relationships between domains
-
tadlib.hitad.aligner.
hierFormat
(domainlist)[source]¶ Resolve the nested/hierarchical relationships between domains, and transform the input [chrom,start,end] format domains into a format including hierarchical level information.
- Parameters
- domainlistlist
Domains with the format [chrom,start,end].
- Returns
- domainlistlist
Domains with the format [chrom,start,end,level].