TAD Data Parsing

tadlib.calfea.analyze.load_TAD(source_fil, chromname=None, cols=[0, 1, 2])[source]

Load TAD data from a TXT file.

Parameters
sourcestr

Path to the TAD file.

chromnameNone or str

Template of chromosome names. Suppose chromname = 'chr', then the source data should be as follows:

chr1    0   350000
chr1    350000  800000
chr1    800000  1450000

Default: None

colslist of int (length: 3)

Which columns to read, with 0 being the first. For example, cols = [0, 1, 2] will extract the 1st, 2nd and 3rd columns. (Default: [0, 1, 2])

Returns
dataNumpy Structured Array

The parsed TAD intervals are contained in a numpy structured array containing 3 fields: “chr”, “start” and “end”.

Intra-TAD Interaction Analysis

class tadlib.calfea.analyze.Core(matrix, left=0)[source]

Interaction analysis at TAD level.

High IFs off the diagonal region can be identified using tadlib.calfea.analyze.Core.longrange(). tadlib.calfea.analyze.Core.DBSCAN() performs a density-based clustering algorithm to detect aggregation patterns in those IFs. Furthermore, two structural features, called AP (Aggregation Preference) and Coverage in our original research, can be calculated by tadlib.calfea.analyze.Core.gdensity() and tadlib.calfea.analyze.Core.totalCover() respectively.

Parameters
matrixnumpy.ndarray, (ndim = 2)

Interaction matrix of a TAD.

leftint

Starting point of TAD. For example, if the bin size is 10kb, left = 50 means position 500000(bp) on the genome.

Attributes
newMnumpy.ndarray, (ndim = 2)

Gap-free interaction matrix.

convertlist

Information required for converting newM to matrix.

cEMnumpy.ndarray, (ndim = 2)

Expected interaction matrix. An upper triangular matrix. Value in each entry will be used to construct a Poisson Model for statistical significance calculation.

fEnumpy.ndarray, (ndim = 2)

An upper triangular matrix. Each entry represents the fold enrichment of corresponding observed interaction frequency.

Psnumpy.ndarray, (ndim = 2)

An upper triangular matrix. Value in each entry indicates the p-value under corresponding Poisson Model.

posnumpy.ndarray, (shape = (N, 2))

Coordinates of the selected IFs in newM.

Npint

Number of the selected IFs.

cluster_idnumpy.ndarray, (shape = (N,))

Cluster labels for each point in pos. -1 indicates noisy points.

Ncint

Cluster number.

clustersdict

Details of each cluster. “Average density”, “radius”, “area (polygon)”, “point coordinates”, and “object number” are all recorded.

Hullsdict

Details of each convex hull (clusters which can be enclosed by a convex polygon).

ptracelist of int

Labels of clusters in which all points are collinear. These clusters cannot be enclosed by a convex polygon.

gdenfloat, [0, 1]

Weighted average density.

coveragefloat, [0, 1]

Total coverage of clusters.

Methods

DBSCAN(self)

Detect natural patterns in selected interactions using DBSCAN.

convertMatrix(self, M)

Convert an internal gap-free matrix(e.g., newM, cEM, fE, and Ps) into a new matrix with the same shape as the original interaction matrix by using the recorded index map(see the convert attribute).

gdensity(self)

Weighted density calculation.

longrange(self[, pw, ww, top, ratio])

Select statistically significant interactions of the TAD.

totalCover(self)

Total coverage of clusters.

DBSCAN(self)[source]

Detect natural patterns in selected interactions using DBSCAN.

DBSCAN is a dennsity-based clustering algorithm. [1] Two input parameters eps and MinPts are calculated in an analytical way. [2] Before further analysis, some basic features are calculated for each cluster, including “density”, “radius” and the “area”, among which, “area” stands for corresponding convex polygon area.

See also

sklearn.cluster.DBSCAN

an implementation of DBSCAN

tadlib.calfea.polygon.Polygon

calculations based on polygon.

Notes

Both “radius” and “density” are defined based on core objects of a cluster. “radius” is the average distance from the core object to its MinPts-nearest neighbors while “density” is the inverse of it.

References

1

Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI, Press, 1996, pp. 226-231

2

Daszykowski M, Walczak B, Massart DL. Looking for Natural Patterns in Data, Part 1: Density Based Approach. Chemom. Intell. Lab. Syst., 2001, 56: 83-92.

convertMatrix(self, M)[source]

Convert an internal gap-free matrix(e.g., newM, cEM, fE, and Ps) into a new matrix with the same shape as the original interaction matrix by using the recorded index map(see the convert attribute).

gdensity(self)[source]

Weighted density calculation.

tadlib.calfea.analyze.Core.longrange() and tadlib.calfea.analyze.Core.DBSCAN() have to be called in advance.

Density of a TAD is the weighted average density of each cluster. Weight is the ratio of object number of a cluster to Np.

longrange(self, pw=2, ww=5, top=0.7, ratio=0.05)[source]

Select statistically significant interactions of the TAD. Both genomic distance and local interaction background are taken into account.

Parameters
pwint

Width of the peak region. Default: 2

wwint

Width of the donut. Default: 5

topfloat, [0.5, 1]

Parameter for noisy interaction filtering. Default: 0.7

ratiofloat, [0.01, 0.1]

Specifies the sample size of significant interactions. Default: 0.05

Notes

pw and ww are sensitive to data resolution. It performs well when we set pw to 4 and ww to 7 at 5 kb, and (2, 5) at 10 kb. [1]

References

1

Rao, S.S., Huntley, M.H., Durand, N.C. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 2014, 159: 1665-1680.

totalCover(self)[source]

Total coverage of clusters.

tadlib.calfea.analyze.Core.longrange() and tadlib.calfea.analyze.Core.DBSCAN() have to be called in advance.

Matrix Manipulating

tadlib.calfea.analyze.manipulation(matrix, start=0)[source]

Remove gaps of the original interaction matrix.

Parameters
matrixnumpy.ndarray, (ndim = 2)

Interaction matrix.

startint

The begining of the region. (Default: 0)

Returns
newMnumpy.ndarray, (ndim = 2)

The gap-removed matrix.

convertlist

The first element is the index map from newM to matrix, and the second element records the length of matrix.

Examples

>>> import numpy as np
>>> from tadlib.calfea.analyze import manipulation
>>> matrix = np.random.rand(4, 4)
>>> matrix[1,:] = 0; matrix[:,1] = 0
>>> print matrix
[[ 0.24822414  0.          0.07782508  0.01812965]
 [ 0.          0.          0.          0.        ]
 [ 0.93870151  0.          0.21986474  0.20462965]
 [ 0.13022712  0.          0.78674168  0.77068304]]
>>> newM, convert = manipulation(matrix)
>>> print newM
[[ 0.24822414  0.07782508  0.01812965]
 [ 0.93870151  0.21986474  0.20462965]
 [ 0.13022712  0.78674168  0.77068304]]
>>> print convert
[{0: 0, 1: 2, 2: 3}, 4]

Polygon Creation and Operations

Here, polygon means convex polygon. In the simplest case, a polygon can be created or presented by vertices. More ofen, you don’t know the vertices but want a polygon enclosing a set of points.

ConvexHull defined in scipy.spatial module uses the Qhull library to compute convex hull for a finite set of points, but doesn’t provide any further polygon operations.

We fix these problems by customizing more methods for ConvexHull.

Note

Obviously, collinear point sets cannot be used to construct polygon. So a collinear test should be performed in advance.

class tadlib.calfea.polygon.Polygon(points)[source]

Main class for polygon creation, manipulation and calulation.

The API is inherited from scipy.spatial.ConvexHull, which constructs convex hull from a point set. So you should input coordinates of a point set to initiate an instance.

More general, not all points are required. Vertices alone are enough to construct the convex hull.

Also, this class is designed to operate on 2-D space, although ConvexHull is okay for higher dimensions.

Parameters
pointsarray_like

Coordinates of points to construct a convex hull from. All objects that can be converted to a ndarray shaped (npoints, 2) are okay.

Examples

Convex hull of a random set of points:

>>> import numpy as np
>>> from tadlib.calfea.polygon import Polygon
>>> points = np.random.rand(20, 2) # 20 random points in 2-D space
>>> P = Polygon(points)
>>> print P.anchors
[[ 0.71119071  0.216714  ]
 [ 0.92908323  0.77903127]
 [ 0.79530032  0.93279735]
 [ 0.02160083  0.98758203]
 [ 0.09094998  0.4423167 ]
 [ 0.16924963  0.05900692]
 [ 0.71119071  0.216714  ]]
Attributes
Common attributes from **ConvexHull:**
pointsndarray of double, shape (npoints, 2)

Coordinates of input points. Converted ndarray.

verticesndarray of ints, shape (nvertices,)

Indices of points forming the vertices of the convex hull. The vertices are counter-clockwise ordered.

simplicesndarray of ints, shape (nfacet, 2)

Indices of points forming the simplical facets of the convex hull.

Customized attributes:
anchorsndarray of double, shape (nvertices,)

Vertices of the convex hull, counter-clockwise ordered.

areafloat

Area of the polygon.

Methods

add_points(points[, restart])

Process a set of additional new points.

calarea(self)

Calculate the polygon area.

close(self)

Close the polygon, i.e., the first point is also the last one.

isinside(self, query[, zerolike])

Compute point location relative to a polygon.

calarea(self)[source]

Calculate the polygon area.

An attribute called area is assigned.

See also

tadlib.calfea.polygon.shoelace

Twice the area of polygon

close(self)[source]

Close the polygon, i.e., the first point is also the last one.

See also

tadlib.calfea.polygon.isinside

judge if points are inside a polygon or not.

Notes

Must be called before tadlib.calfea.polygon.Polygon.isinside().

isinside(self, query, zerolike=1e-12)[source]

Compute point location relative to a polygon.

Parameters
querytuple or array_like

A tuple indicates one point. (The length must be 2) Or you can input an array-like sequence. Any object that can be converted to a ndarray shaped (npoints, 2) is okay.

zerolikefloat

A number used to approximate 0.

Returns
mindstscalar or array_like

If mindst < 0, point is outside the polygon. If mindst = 0, point in on a side of the polygon. If mindst > 0, point is inside the polygon.

Notes

Sloan’s improved version of the Nordbeck and Rydstedt algorithm.

Examples

We start with tadlib.calfea.polygon.Polygon construction:

>>> import numpy as np
>>> from tadlib.calfea.polygon import Polygon
>>> points = np.random.rand(20, 2) # Used for constructing Polygon
>>> P = Polygon(points)
>>> check = np.random.rand(3, 2) # Another 3 random points
>>> P.isinside(check)
array([ 0.21145311,  0.09807244, -0.15341914])
tadlib.calfea.polygon.shoelace(vertices)[source]

Calculate twice the area of polygon using Shoelace formula.

Polygon is defined by vertices.

Parameters
verticesarray_like

Vertex coordinates in a 2-D space. Coordinates must be placed along the last axis. And data points are along the first axis.

Returns
areafloat

You can deduce the order of input vertices from the sign: area is positive if vertices are in counter-clockwise order. area is negative if vertices are in clockwise order. area is zero if all points are colinear.

Notes

This function can be also used to judge if all points in a data set are collinear. Collinear points as input for initializing Polygon instance will raise a QhullError.

Examples

Vertices of a square:

Clockwise:

>>> from tadlib.calfea.polygon import shoelace
>>> sq = [(0,0), (0,1), (1,1), (1,0)]
>>> shoelace(sq)
-2.0

Counter-clockwise:

>>> sq = [(0,0), (1,0), (1,1), (0,1)]
>>> shoelace(sq)
2.0
tadlib.calfea.polygon.collinear(points)[source]

Test whether all given points are collinear.

Collinear points will trigger an error called QhullError when used to initialize a tadlib.calfea.polygon.Polygon instance. However, other conditions may also trigger QhullError. Doing this test in advance can avoid this error and make things clearer.

Parameters
pointsarray_like

Coordinates of points to construct a polygon. Any object that can be converted to a ndarray shaped (npoints, 2) is okay.

Returns
judgebool

True if the input points are collinear else False.

Notes

To judge if all points are collinear, we use a simple heuristic algorithm. We sort the points at first. Then test whether consecutive triples are collinear.

Examples

Trival but still effective:

>>> from tadlib.calfea.polygon import collinear
>>> line = [(2, 0.4), (2, 0.8), (2, 4), (2, 100)]
>>> collinear(line)
True