Hierarchical TAD Identification

class tadlib.hitad.chromLev.Chrom(chrom, res, hicdata, replabel, maxapart=4000000)[source]

Chrom is defined to:

  • Hold Hi-C data within a certain chromosome

  • Identify hierarchical domains in 4 steps: 1.Calculate adaptive DIs. 2.Identify original candidate bounds by 5-state Gaussian mixture Hidden Markov Model using adaptive DIs as input. 3.Select TAD bounds from candidate bounds. 4.Recursively identify inner domain structures of each TAD.

  • Visualize any region of the chromosome. Hierarchical domains will be plotted as boxes along with the diagonal of the heatmap, and adaptive DI track will be placed on top of the heatmap.

Parameters
chromstr

Chromosome label.

resint

Resolution of the Hi-C data in base-pair unit.

hicdataCSR sparse matrix

Bin-level Hi-C matrix of the specified chromosome.

replabelstr

Biological replicate label.

maxapartint

Maximum allowable TAD size in base-pair unit. (Default: 4000000)

Attributes
chromstr

Chromosome label.

resint

Resolution in base-pair unit.

maxapartint

Maximum allowable TAD size.

replabelstr

Biological replicate label.

chromLenint

Total bin number of the chromosome.

rawMatrixsparse matrix in Compressed Sparse Row format

CSR sparse matrix is used to extract Hi-C data by slicing conveniently while guarantee low memory overhead.

Methods

calDI(self, windows, start)

Calculate Directionality Index (DI) for each bin with adaptive window size.

callDomain(self)

Direct API for our hierarchical domain identification pipeline:

detectPeaks(self, trends[, mph, mpd])

Detect peaks (local maxima) in a 1-D array intuitively (a peak must be greater than its immediate neighbors).

fineDomain(self)

Identify hierarchical domains within each TAD.

getDomainList(self, byregion)

Combine by-region domains into a single list.

getSelfMatrix(self, start, end)

Return the contact matrix of any given region.

getWeightMatrix(self, start, end[, bases])

Calculate weights for each intra-domain interaction by considering the genomic distance and the local interaction background.

idxmatch(self, domain)

Pair interactions of the given domain with the upstream and downstream interactions under the same genomic distance.

iterCore(self, minDomains, tmpDomains)

Calculate the mismatch ratio for the input two domain lists.

maxCore(self[, cache])

Perform TAD identification.

maxscorepath(self, domainlist, cache)

An implementation for our proposed algorithm to find the best separation solution at chromosomal/domain level, given bottom domain list and pre-computed scores.

minCore(self, regionDIs)

Output domain list for each gap-free region.

minWindows(self, start, end, maxw)

Estimate best window size for each bin of a given range.

oriIter(self, minDomains)

Iteratvely approximate adaptive window sizes and return the final bottom domain list which will be used in subsequent procedures.

oriWindow(self, P)

Estimate the most appropriate window size for current bin to best capture the local interaction bias direction.

pipe(self, seq, start)

Transform an observed sequence into a list of domains.

plot(self, start, end, Domains, figname[, …])

Given a genomic region and a domain list, plot corresponding contact heatmap and all domains (represented as diagonal squares) within the region.

randomCheck(self, seq[, pthre])

We use chi square test to test the randomness of a sequence by looking at the conversion frequency between neighbors in the sequence.

refNoise(self, domain)

Return noise level of a domain, which is simply defined as the zero entry ratio of the contact matrix.

scoreCache(self, domainlist[, cache])

Calculate and cache the TAD scores for any combinations of consecutive bottom domains.

splitChrom(self, DIs)

Split a chromosome into gap-free regions.

stablescore(self, domain[, bases])

Calculate TAD score for the given domain.

subDomains(self, domain, reflist[, clv, aM, …])

A recusive method (function) to identify inner domain hierarchy of a TAD (or a domain).

toArrowhead(self, start, end)

Perform Arrowhead transformation on contact matrix of a given genomic region.

viterbi(self, seq)

Find the most likely hidden state series using the viterbi algorithm.

calIS

preciseBound

calDI(self, windows, start)[source]

Calculate Directionality Index (DI) for each bin with adaptive window size.

Parameters
windows1-D numpy.ndarray, int32

Returned by tadlib.hitad.chromLev.Chrom.minWindows().

startint

Starting bin index, the window size of which is taken from the 1st place of windows.

Attributes
DIs1-D numpy ndarray, float

Calculated adaptive DI array, which has the same size as the input windows.

callDomain(self)[source]

Direct API for our hierarchical domain identification pipeline:

detectPeaks(self, trends, mph=0, mpd=5)[source]

Detect peaks (local maxima) in a 1-D array intuitively (a peak must be greater than its immediate neighbors).

Parameters
trends1-D numpy ndarray

Data.

mphfloat

Only peaks that are greater than this value will be detected. (Default: 0)

mpdpositive integer

Only peaks whose indices are at least separated by this value will be reported. (Default: 5)

Returns
ind1-D numpy ndarray

Indices of peaks detected in trends.

fineDomain(self)[source]

Identify hierarchical domains within each TAD.

See also

tadlib.hitad.chromLev.Chrom.maxCore

identify TADs

tadlib.hitad.chromLev.Chrom.subDomains

resolve domain hierarchy within a given TAD

Attributes
hierDomainsdict

The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified hierarchical domain lists. Start and end of the domain are in base-pair unit.

getDomainList(self, byregion)[source]

Combine by-region domains into a single list.

Parameters
byregiondict

The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified domain lists.

Returns
DomainListlist

A merged domain list of all regions

getSelfMatrix(self, start, end)[source]

Return the contact matrix of any given region.

Parameters
start, endint

The region interval in base-pair unit.

Returns
Matrix2-D numpy ndarray, float

Sub contact matrix.

getWeightMatrix(self, start, end, bases=[])[source]

Calculate weights for each intra-domain interaction by considering the genomic distance and the local interaction background.

Parameters
start, endint

The domain interval in base-pair unit.

baseslist

List of the bottom domains within the given interval.

Returns
W2-D numpy.ndarray

The weight matrix. (An upper triangular matrix)

idxmatch(self, domain)[source]

Pair interactions of the given domain with the upstream and downstream interactions under the same genomic distance.

Parameters
domain[start, end]

Domain interval in base-pair unit.

Returns
cur_storetuple, (x-coordinates, y-coordinates, interaction frequencies)

Interactions within the given domain.

up_storetuple, (x-coordinates, y-coordinates, interaction frequencies)

Corresponding upstream interctions.

down_storetuple, (x-coordinates, y-coordinates, interaction frequencies)

Corresponding downstream interactions.

iterCore(self, minDomains, tmpDomains)[source]

Calculate the mismatch ratio for the input two domain lists. Return 1 if minDomains is empty.

tadlib.hitad.chromLev.Chrom.oriIter() uses this method to determine whether to break the iteration.

Parameters
minDomainsdict

Target domains calculated by the last loop.

tmpDomainsdict

Query domains returned by the current loop.

Returns
tolfloat

Mismatch ratio.

maxCore(self, cache={})[source]

Perform TAD identification. We define TADs as domains to optimize chromosome separation(based on some objective function), which is solved by using an algorithm like dynamic programming implemented in tadlib.hitad.chromLev.Chrom.maxscorepath().

Parameters
cachedict

TAD scores for all combinations of consecutive bottom domains will be pre-computed and stored in this dict. The keys are tuples (start, end) representing merged domain intervals. (Default: {})

See also

tadlib.hitad.chromLev.Chrom.scoreCache

Pre-compute TAD scores for all combinations of consecutive bottom domains.

tadlib.hitad.chromLev.Chrom.maxscorepath

find the best TAD list

Attributes
maxDomainsdict

Optimized TADs.

cachedict

Cached TAD scores.

maxscorepath(self, domainlist, cache)[source]

An implementation for our proposed algorithm to find the best separation solution at chromosomal/domain level, given bottom domain list and pre-computed scores.

Parameters
domainlistlist of [start,end]

List of bottom domain intervals in base-pair unit.

cachedict

Pre-computed scores for any combinations of continuous bottom domains. The keys are intervals of continuous regions in (start,end) format, and the values are corresponding scores.

Returns
bestsset of (sidx, eidx)

Each element indicates one merged domain of the solution, represented as (start index, end index) of the input domainlist.

minCore(self, regionDIs)[source]

Output domain list for each gap-free region.

Parameters
regionDIsdict

Gap-free regions and corresponding adaptive DI arrays.

Returns
minDomainsdict

Gap-free regions and corresponding identified bottom domain list. Different from tadlib.hitad.chromLev.Chrom.pipe(), the start and the end of a domain are in base-pair unit.

minWindows(self, start, end, maxw)[source]

Estimate best window size for each bin of a given range.

Parameters
start, endint

Specify range of the bin indices.

maxwint

Maximum allowable window size.

See also

tadlib.hitad.chromLev.Chrom.oriWindow

Window size estimation for a single bin.

Attributes
windows1-D numpy.ndarray, int32
oriIter(self, minDomains)[source]

Iteratvely approximate adaptive window sizes and return the final bottom domain list which will be used in subsequent procedures. For each loop, window sizes are updated according to the latest bottom domains and next loop will re-run the identification pipeline using new window sizes. The iteration terminates if domains between two consecutive loops are very similar (estimated by tadlib.hitad.chromLev.Chrom.iterCore()).

Parameters
minDomainsdict

Initial domains served as the target domain list for tadlib.hitad.chromLev.Chrom.iterCore() at the first iteration. We set it empty in our pipeline.

See also

tadlib.hitad.chromLev.Chrom.calDI

calculate DI values according to input window sizes

tadlib.hitad.chromLev.Chrom.iterCore

estimate the degree of divergence between two domain lists

Attributes
minDomainsdict

The keys are tuples representing gap-free regions of the chromosome, and the values are corresponding identified bottom domain lists. Start and end of the domain are in base-pair unit.

oriWindow(self, P)[source]

Estimate the most appropriate window size for current bin to best capture the local interaction bias direction.

See also

tadlib.hitad.chromLev.Chrom.detectPeaks

detect peaks given a 1-D array

tadlib.hitad.chromLev.Chrom.randomCheck

randomness test for a two-valued (0-1) sequence

pipe(self, seq, start)[source]

Transform an observed sequence into a list of domains.

Parameters
seq1-D numbpy ndarray, float

Adaptive DI array for any region.

startint

Chromosome bin index of the seq start.

Returns
domainslist

List of domains in the format [start bin, end bin, noise level, hierarchical level].

See also

tadlib.hitad.chromLev.Chrom.refNoise

Calculate the noise level of a given domain

tadlib.hitad.aligner.BoundSet

where the meanings of the hierarchical level labels are explained in detail.

plot(self, start, end, Domains, figname, arrowhead=False, vmin=None, vmax=None)[source]

Given a genomic region and a domain list, plot corresponding contact heatmap and all domains (represented as diagonal squares) within the region.

Parameters
start, endint

The region interval.

Domainsdict

The keys are tuples representing gap-free regions, and values are corresponding identified domains.

fignamestr or None

If not None, the figure will be saved, otherwise it will only be shown in an interactive window. (Default: None)

arrowheadbool

If True, the Arrowhead transformed matrix will be plotted instead of the raw contact matrix. (Default: False)

randomCheck(self, seq, pthre=0.05)[source]

We use chi square test to test the randomness of a sequence by looking at the conversion frequency between neighbors in the sequence.

Parameters
seqstr

A string containing only ‘1’ or ‘0’. (e.g. ‘101000101’)

pthrefloat, 0-1

Significance level of the hypothesis tests.

Returns
rejectbool

True if we should reject the null hypothesis (the sequence is generated randomly) under the selected significance level.

refNoise(self, domain)[source]

Return noise level of a domain, which is simply defined as the zero entry ratio of the contact matrix.

scoreCache(self, domainlist, cache={})[source]

Calculate and cache the TAD scores for any combinations of consecutive bottom domains.

Parameters
domainlistlist of [start,end]

List of bottom domain intervals in base-pair unit.

cachedict

Cache of the TAD scores. The keys are in (start,end) format, intervals of continuous regions, and the values are corresponding calculated scores.

See also

tadlib.hitad.chromLev.Chrom.stablescore

Calculate TAD score of any genomic interval

splitChrom(self, DIs)[source]

Split a chromosome into gap-free regions. HMM learning and domain identification procedures will be performed on these regions separately.

Parameters
DIs1-D numpy ndarray, float

Adaptive DI array of the whole chromosome. Generally, we detect runs of zeros in the array as gaps, which will be cut off the chromosome, making entire chromosome pieces of gap-free regions.

Attributes
chromRegionsdict, {(start,end):DIs[start:end]}

The keys are gap-free regions, and the values are corresponding adaptive DI pieces.

gapbinsset

Set of bins (in base-pair unit) located in gap regions.

stablescore(self, domain, bases=[])[source]

Calculate TAD score for the given domain.

Parameters
domain[start, end]

Domain interval in base-pair unit.

baseslist

List of bottom domains within the given domain region.

Returns
inoutfloat

TAD score defined as the enrichment between intra-domain interaction frequencies and inter-domain interaction frequencies controlling for the impact of genomic distance.

See also

tadlib.hitad.chromLev.Chrom.idxmatch

Pair intra-domain and inter-domain interactions with the same genomic distance

tadlib.calfea.analyze.Core.longrange

Assign the weight value for each intra-domain interaction according to the genomic distance and the local interaction background.

subDomains(self, domain, reflist, clv=0, aM=None, W=None, subdomains={})[source]

A recusive method (function) to identify inner domain hierarchy of a TAD (or a domain). Sub-TADs of each level are defined as the best separation of the outer domain.

Parameters
domain[start, end]

Outer layer domain interval in base-pair unit.

reflistlist

List of bottom domains within the region of the outer domain.

clvint

Global domain level of the outer domain. The TADs have the level 0, the sub-TADs within a TAD have the level 1, and sub-sub-TADs within a sub-TAD have the level 2, and so forth. (Default: 0)

aM2-D numpy ndarray or None

Arrowhead matrix of the outer domain. (Default: None)

W2-D numpy ndarray or None

Weight matrix corresponding to the contact matrix of the outer domain. See tadlib.hitad.chromLev.Chrom.getWeightMatrix() for detailed calculation. (Default: None)

subdomainsdict

A container for domains of all hierarchy. The keys are domain intervals in base-pair unit, and values are corresponding global levels.

See also

tadlib.hitad.chromLev.Chrom.maxscorepath

find the best domain list optimally separating domain-level interactions.

toArrowhead(self, start, end)[source]

Perform Arrowhead transformation on contact matrix of a given genomic region.

Parameters
start, endint

The region interval.

Returns
A2-D numpy ndarray, float

Transformed matrix. (A symmetric matrix)

viterbi(self, seq)[source]

Find the most likely hidden state series using the viterbi algorithm.

Parameters
seq1-D numbpy ndarray, float

Adaptive DI array for any region.

Returns
pathlist

List of hidden state labels. Has the same length as the input seq.

class tadlib.hitad.chromLev.MultiReps(chrom, res, datasets)[source]

We define MultiReps to:

  • Hold Hi-C data of the same chromosome from different biological replicates at the same time

  • Provide an interface to identify hierarchical domains by using different replicate data independently and maintain the reproducible domains to form the final domain list.

Parameters
chromstr

Chromosome label.

resint

Hi-C data resolution in base-pair unit.

datasetsdict

The keys are unique replicate labels, and the values are constructed Chrom objects by using the Hi-C data of corresponding biological replicates.

Methods

align(self, tn, qn)

Construct hierarchical alignment between tn and qn.

callDomain(self)

Find reproducible domains between replicates by using our domain-level alignment strategy.

conserved(self, tn, qn)

Return conserved TAD pairs.

getDomainList(self, byregion)

Combine by-region domains into a single list.

inner_changed(self, tn, qn)

Return semi-conserved TAD pairs.

merged(self, tn, qn)

Return merged region pairs and merged TAD details.

overlap(self, ti, qi)

Calculate overlap ratio of any two regions.

reproducible(self, tg, qy)

Subprocess of callDomain for replicate alignment and reproducibility determination.

split(self, tn, qn)

Return split region pairs and split TAD details.

intersect

callDomain(self)[source]

Find reproducible domains between replicates by using our domain-level alignment strategy.

getDomainList(self, byregion)[source]

Combine by-region domains into a single list.

Each domain in the returned list is represented in the format [chromosome label, start bp, end bp, hierarchical level].

reproducible(self, tg, qy)[source]

Subprocess of callDomain for replicate alignment and reproducibility determination.

class tadlib.hitad.genomeLev.Genome(datasets, balance_type='weight', maxsize=4000000, cache=None, exclude=['chrM', 'chrY'], DIcol='DIs')[source]

Genome is built on top of tadlib.hitad.chromLev.Chrom. We use it to:

  • Load bin-level Hi-C data

  • Initialize, pickle and organize Chrom objects

  • Call hierarchical domains of each chromosome in parallel

Parameters
datasets2-level dict, {resolution(int):{biological_replicate_label(str):data_path,…}}

data_path indicates the cool URI under corresponding resolution and biological replicate label.

maxsizeint

Maximum allowable domain size in base-pair unit. (Default: 4000000)

cachestr or None

Cache folder path. If None, the folder returned by tempfile.gettempdir() will be used. (Default: None)

Attributes
data3-level dict. {chrom(str):{resolution(int):{biological_replicate_label(str):cachedfile,…}},…}

Different from the input datasets, it organizes data by chromosome label, and each bottom-level value indicates one pickled tadlib.hitad.chromLev.Chrom file under the cache folder.

Resultslist

Final consistent domain list merged from all chromosome results.

Methods

callHierDomain(self[, cpu_core])

Identify hierarchical domains of each chromosome independently and concurrently, find consistent domains between biological replicates, and finally combine results of all chromosomes.

learning(self[, cpu_core])

Prepare training data and learn HMM model parameters for each dataset.

oriHMMParams(self)

Set initial parameters for the Hidden Markov Model (HMM).

wipeDisk(self)

Remove catched (pickled) tadlib.hitad.chromLev.Chrom objects before exiting.

outputDomain

train_data

callHierDomain(self, cpu_core=1)[source]

Identify hierarchical domains of each chromosome independently and concurrently, find consistent domains between biological replicates, and finally combine results of all chromosomes.

Parameters
cpu_coreint

Number of processes to launch. For now, hitad only supports parallel computing on a quite high layer, that is, it simply allocates an uncompleted tadlib.hitad.chromLev.Chrom object to an idle processor and invokes its callDomain method.

learning(self, cpu_core=1)[source]

Prepare training data and learn HMM model parameters for each dataset.

Parameters
cpu_coreint

Number of processes to launch.

oriHMMParams(self)[source]

Set initial parameters for the Hidden Markov Model (HMM).

Attributes
HMMParamsdict

Has 3 keys: “A”, state transition matrix, “B” (emission probabilities), specifying parameters (Means, Variances, Weights) of the mixture Gaussian distributions for each hidden state, and “pi”, indicating the hidden state weights. This dict will be updated after learning procedure.

wipeDisk(self)[source]

Remove catched (pickled) tadlib.hitad.chromLev.Chrom objects before exiting.

Domain Loading and Aligning

class tadlib.hitad.aligner.DomainAligner(*args)[source]

This class is the work horse we define to:

  1. Hold multiple tadlib.hitad.aligner.DomainSet instances at the same time.

  2. Perform domain-based hierarchical alignment between any two tadlib.hitad.aligner.DomainSet.

  3. Define and extract domain-level change types from alignment results between two tadlib.hitad.aligner.DomainSet.

Parameters
argstwo or more tadlib.hitad.aligner.DomainSet instances
Attributes
DomainSetsdict

Pool of tadlib.hitad.aligner.DomainSet. The keys are unique identifiers extracted from tadlib.hitad.aligner.DomainSet, and the values are corresponding tadlib.hitad.aligner.DomainSet instances.

Resultsdict

Container for alignment results between any pair of tadlib.hitad.aligner.DomainSet instances.

Methods

align(self, tn, qn)

Construct hierarchical alignment between tn and qn.

conserved(self, tn, qn)

Return conserved TAD pairs.

inner_changed(self, tn, qn)

Return semi-conserved TAD pairs.

merged(self, tn, qn)

Return merged region pairs and merged TAD details.

overlap(self, ti, qi)

Calculate overlap ratio of any two regions.

split(self, tn, qn)

Return split region pairs and split TAD details.

align(self, tn, qn)[source]

Construct hierarchical alignment between tn and qn.

Parameters
tn, qnstr

Unique identifiers of tadlib.hitad.aligner.DomainSet instances which are collected by arg during initialization.

Notes

The alignment results are organized in a hierarchical way in a dictionary. The keys are matched chromosome region (corresponds to either one TAD or several continuous TADs) pairs at the TAD level, and the values are tadlib.hitad.aligner.Container instances with info attribute set to be pair of detailed TAD lists; the keys of these tadlib.hitad.aligner.Container indicate domain levels, the values again are dictionaries containing sub-alignment results within the upper-layer TAD region.

You can access the results from Results attribute: self.Results[tn][qn] or self.Results[qn][tn].

conserved(self, tn, qn)[source]

Return conserved TAD pairs.

Parameters
tn, qnstr

Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns
pairsset of tuples

Each tuple has two elements (domain intervals), corresponding to tn and qn respectively.

inner_changed(self, tn, qn)[source]

Return semi-conserved TAD pairs.

Parameters
tn, qnstr

Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns
pairsset of tuples

Each tuple has two elements (domain intervals), corresponding to tn and qn respectively.

merged(self, tn, qn)[source]

Return merged region pairs and merged TAD details.

Parameters
tn, qnstr

Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns
pairsdict

The keys are merged region pairs in tuple, and the values are corresponding TAD list pairs within the region.

split(self, tn, qn)[source]

Return split region pairs and split TAD details.

Parameters
tn, qnstr

Unique identifiers of tadlib.hitad.aligner.DomainSet.

Returns
pairsdict

The keys are split region pairs in tuple, and the values are corresponding TAD list pairs within the region.

class tadlib.hitad.aligner.BoundAligner(*args)[source]

Based on our hierarchical domain alignment scheme, we also define several change types on boundary level between two datasets, including conserved TAD boundary, conserved sub-TAD boundary, disappeared TAD boundary, disappeared sub-TAD boundary, and TAD-to-sub-TAD boundary switch.

Boundaries are expressed in (chrom,pos) format in this class.

Parameters
argstwo or more tadlib.hitad.aligner.DomainSet instances
Attributes
byclassdict

Cache boundary pairs of each change type between datasets.

Methods

align(self, tn, qn)

Construct hierarchical alignment between tn and qn.

all_in_one(self, tn, qn, cache)

Parse domain alignment results between tn and qn and cache all detected cases of 6 defined change types.

conserved(self, tn, qn)

Return conserved TAD pairs.

conserved_sub_bounds(self, tn, qn)

Return pairs of conserved sub-TAD boundaries.

conserved_tad_bounds(self, tn, qn)

Return pairs of conserved TAD boundaries.

disappeared_sub(self, tn, qn)

Sub-TAD boundaries that exist in tn, but disappear in qn.

disappeared_tad(self, tn, qn)

TAD boundaries that exist in tn, but disappear in qn.

inner_changed(self, tn, qn)

Return semi-conserved TAD pairs.

merged(self, tn, qn)

Return merged region pairs and merged TAD details.

overlap(self, ti, qi)

Calculate overlap ratio of any two regions.

split(self, tn, qn)

Return split region pairs and split TAD details.

sub2tad(self, tn, qn)

Return sub-TAD to TAD switch cases.

tad2sub(self, tn, qn)

Return TAD to sub-TAD switch cases.

pairwise_alignment

all_in_one(self, tn, qn, cache)[source]

Parse domain alignment results between tn and qn and cache all detected cases of 6 defined change types.

Parameters
tn, qnstr

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

cachedict

An empty dictionary.

conserved_sub_bounds(self, tn, qn)[source]

Return pairs of conserved sub-TAD boundaries.

Parameters
tn, qn: str

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns
pairsdict

Keys and values indicate sub-TAD boundaries in tn and qn, respectively.

conserved_tad_bounds(self, tn, qn)[source]

Return pairs of conserved TAD boundaries.

Parameters
tn, qn: str

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns
pairsdict

Keys and values indicate TAD boundaries in tn and qn, respectively.

disappeared_sub(self, tn, qn)[source]

Sub-TAD boundaries that exist in tn, but disappear in qn.

Parameters
tn, qn: str

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns
pairsset of tuples

Sub-TAD boundary positions in tn.

disappeared_tad(self, tn, qn)[source]

TAD boundaries that exist in tn, but disappear in qn.

Parameters
tn, qn: str

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns
pairsset of tuples

TAD boundary positions in tn.

sub2tad(self, tn, qn)[source]

Return sub-TAD to TAD switch cases.

Parameters
tn, qn: str

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns
pairsdict

Keys are sub-TAD boundaries in tn, and values indicate corresponding TAD boundaries in qn.

tad2sub(self, tn, qn)[source]

Return TAD to sub-TAD switch cases.

Parameters
tn, qn: str

Unique identifiers of tadlib.hitad.aligner.DomainSet instances.

Returns
pairsdict

Keys are TAD boundaries in tn, and values indicate corresponding sub-TAD boundaries in qn.

class tadlib.hitad.aligner.DomainSet(en, domainlist, res, hier=True)[source]

Parse and hold a hierarchical domain set.

Parameters
enstr

Unique identifier for input domain set.

domainlistlist

List of domains. See tadlib.hitad.aligner.BoundSet for details.

resint

Resolution of the Hi-C data in base-pair unit.

hierbool

Whether domainlist contains multiple-level domains or not. (Default: True)

Attributes
resint

Resolution of the Hi-C data in base-pair unit.

levsset of int

All possible domain levels contained in domainlist.

bychromsdict

Bychromosomal rearrangement of domainlist. The keys are chromosome labels(1,2,…,22,X,Y), and the values are list of [start,end,level].

pretreedict

Nested domain list within any domain interval. Returned by tadlib.hitad.aligner.DomainSet.NestedDomains().

subpooldict

Domain list within any domain interval. (tadlib.hitad.aligner.DomainSet.NestedDomains())

lidxdict

The smallest indices of left domain boundaries in a by-chromosomal domain list. (tadlib.hitad.aligner.DomainSet.NestedDomains())

ridxdict

The largest indices of right domain boundaries in a by-chromosomal domain list. (tadlib.hitad.aligner.DomainSet.NestedDomains())

Domainsdict

Store each TAD and its nested domains as a tree. Each node in the tree indicates one domain, in particular, the root corresponds to the TAD, and the leaves correspond to bottom domains.

Methods

NestedDomains(self, bychroms)

Pre-parse domain lists for accelerating subsequent calculations.

genDomainTree(self, node, pretree, cur)

Recursively generate a tree/sub-tree taking node as starting point.

getBottoms(self)

Link bottom domains to corresponding outer TADs.

getregion(self, chrom, start, end[, lev])

Extract all domains (or domains at specific level) within a given region.

NestedDomains(self, bychroms)[source]

Pre-parse domain lists for accelerating subsequent calculations.

Parameters
bychromsdict

By-chromosomal domain lists.

Returns
tmpdictdict

Nested domain list within any domain interval. If a domain have no nested domains, then its value is an empty list.

subpooldict

Domain list within any domain interval. Different from tmpdict, if a domain have no nested domains, the value is a list only containing itself.

lidxdict

The smallest indices of left domain boundaries in a by-chromosomal domain list.

ridxdict

The largest indices of right domain boundaries in a by-chromosomal domain list.

genDomainTree(self, node, pretree, cur)[source]

Recursively generate a tree/sub-tree taking node as starting point.

Parameters
nodeNode

A dict-like container for current domain.

pretreedict

Nested domain list within any domain interval. (tadlib.hitad.aligner.DomainSet.NestedDomains())

curlist

Nested domain list within current domain interval.

getBottoms(self)[source]

Link bottom domains to corresponding outer TADs.

Attributes
bottomsdict

Used to quickly retrieve (bottom domain, TAD) pairs.

getregion(self, chrom, start, end, lev=None)[source]

Extract all domains (or domains at specific level) within a given region.

Parameters
chromstr

Chromosome label.

start, endint

Domain interval in base-pair unit.

levint or None

Specify the desired domain level. (Default: None, domains of all levels will be returned)

Returns
rdomainslist

Sorted domain list. Each element corresponds to one domain in the format [chrom,start,end,level].

class tadlib.hitad.aligner.BoundSet(en, domainlist, res)[source]

As the name suggests, we use BoundSet to hold all bounds of a domain list.

Parameters
enstr

Unique identifier for current set of bounds.

domainlistlist

List of the domains. Each domain is represented by [chrom,start,end,level]. I think chrom, start and end are self-explanatory, all you need to keep in mind is that start and *end should be in base-pair unit. level indicates the hierarchical level of the domain. In our work, TAD is denoted as 0, sub-TAD is denoted as 1, and subsequent domain level is denoted as 2, etc.

resint

Resolution of the Hi-C data in base-pair unit.

Attributes
Labelstr

Unique identifier.

boundclassdict

The keys are bound representations (chrom,pos), and the values indicate corresponding hierarchical level notations. The Level of a bound is determined by the domain with the lowest level notation. For example, if we have two domains, [‘1’,100000,500000,0] and [‘1’,100000,200000,1], according to our definition, the level of (‘1’,200000) is 1, but the level of (‘1’,100000) is 0.

Boundslist

Sorted bound list. This attribute can be used as the reference bound list in tadlib.hitad.aligner.SingleBound.align() directly.

class tadlib.hitad.aligner.SingleDomain(chrom, start, end)[source]

We use SingleDomain to:

  1. Represent a single domain (chrom, start, end).

  2. Map the domain to another domain set

Parameters
chromstr

Chromosome label.

start, endint

Interval of the domain in base-pair unit.

Attributes
chromstr

Chromosome label.

intervallist

[start, end]

cachedict

Container for matched details.

Methods

align(self, qy)

Find the domain D in qy maximizing the overlap ratio.

overlap(self, ti, qi)

Calculate overlap ratio of any two regions.

align(self, qy)[source]

Find the domain D in qy maximizing the overlap ratio. Binary search method is used internally for accelerating the search process.

Parameters
qya tadlib.hitad.aligner.DomainSet instance

Reference domain set. (Recall sequence mapping and reference genome)

Notes

The matched details are stored in cache using the unique identifier of qy (qy.Label) as the key, the value is also a dict with 2 keys: hitdomain records the matched domain interval in (chrom,start,end) format, and hitoverlap records the overlap ratio between the hitdomain and current query domain.

overlap(self, ti, qi)[source]

Calculate overlap ratio of any two regions.

Parameters
ti, qilist

Interval ([start,end]) of the region.

Returns
ORfloat, 0-1

Overlap ratio.

class tadlib.hitad.aligner.SingleBound(chrom, pos)[source]

SingleBound is defined to:

  • Represent a single bound (chrom, pos)

  • Map the bound to a pool of bounds

Parameters
chromstr

Chromosome label.

posint

Bound position on the chromosome.

Attributes
chromstr

Chromosome label.

posint

Position on the chromosome.

cachedict

Container for matched details.

Methods

align(self, qn, qb, tol)

Map the bound to qb using the binary search method.

align(self, qn, qb, tol)[source]

Map the bound to qb using the binary search method.

Parameters
qnstr

Unique identifier for qb.

qblist of tuples

Reference bound (remember reference genome?) list. Each element is a tuple (chrom, pos) representing a single bound. And the list must be sorted in advance for binary search.

tolint

Mismatch tolerance. If the genomic distance between the bound and the best hit is less than this value, we say we have found a match, otherwise the bound is missed in qb.

Notes

Internally, the matched details will be stored in cache under the key qn, the value is also a dict recording the matched bound index (midx) and the indices of the matched neighbors (nindices).

class tadlib.hitad.aligner.Container(info)[source]

Dict-like. Used in the organizing of domain alignment results.

Parameters
infolist

Pair of domain lists from two domain sets.

Methods

clear()

copy()

fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

get(self, key[, default])

Return the value for key if key is in the dictionary, else default.

items()

keys()

pop()

If key is not found, d is returned if given, otherwise KeyError is raised

popitem()

2-tuple; but raise KeyError if D is empty.

setdefault(self, key[, default])

Insert key with a value of default if key is not in the dictionary.

update()

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

class tadlib.hitad.aligner.Node(bounds=None)[source]

Dick-like. We use it to represent nodes of a hierarchical domain tree in DomainSet.

Parameters
boundslist or None

Domain interval represented by [chrom,start,end].

Methods

clear()

copy()

fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

get(self, key[, default])

Return the value for key if key is in the dictionary, else default.

items()

keys()

pop()

If key is not found, d is returned if given, otherwise KeyError is raised

popitem()

2-tuple; but raise KeyError if D is empty.

setdefault(self, key[, default])

Insert key with a value of default if key is not in the dictionary.

update()

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

tadlib.hitad.aligner.readHierDomain(domainfile, pre='')[source]

Load hierarchical domain list from a text file.

The source file should contain 4 columns indicating chromosome label (1,2,…,X,Y), domain start (bp), domain end (bp), and hierarchical level (0,1,2,…), respectively.

In our paper, TAD is denoted as level 0, sub-TAD is denoted as level 1, and subsequent domain level is denoted as level 2, etc.

Parameters
domainfilestr

Domain file path.

Returns
domainlistlist

Each element of the list indicates one domain represented by [chrom,start,end,level].

tadlib.hitad.aligner.readPlainDomain(domainfile, pre='chr')[source]

Load domain list from a text file.

The source file should contain 3 columns indicating chromosome name, domain start (bp) and domain end (bp), respectively.

Parameters
domainfilestr

Domain file path.

prestr

Leading string of the chromosome name. (Default: chr)

Returns
domainlistlist

Each element indicates one domain represented by [chrom(leading string removed),start,end].

See also

tadlib.hitad.aligner.hierFormat

parse hierarchical relationships between domains

tadlib.hitad.aligner.hierFormat(domainlist)[source]

Resolve the nested/hierarchical relationships between domains, and transform the input [chrom,start,end] format domains into a format including hierarchical level information.

Parameters
domainlistlist

Domains with the format [chrom,start,end].

Returns
domainlistlist

Domains with the format [chrom,start,end,level].