TAD Data Parsing¶
-
tadlib.calfea.analyze.
load_TAD
(source_fil, chromname=None, cols=[0, 1, 2])[source]¶ Load TAD data from a TXT file.
- Parameters
- sourcestr
Path to the TAD file.
- chromnameNone or str
Template of chromosome names. Suppose
chromname = 'chr'
, then the source data should be as follows:chr1 0 350000 chr1 350000 800000 chr1 800000 1450000
Default: None
- colslist of int (length: 3)
Which columns to read, with 0 being the first. For example,
cols = [0, 1, 2]
will extract the 1st, 2nd and 3rd columns. (Default: [0, 1, 2])
- Returns
- dataNumpy Structured Array
The parsed TAD intervals are contained in a numpy structured array containing 3 fields: “chr”, “start” and “end”.
Intra-TAD Interaction Analysis¶
-
class
tadlib.calfea.analyze.
Core
(matrix, left=0)[source]¶ Interaction analysis at TAD level.
High IFs off the diagonal region can be identified using
tadlib.calfea.analyze.Core.longrange()
.tadlib.calfea.analyze.Core.DBSCAN()
performs a density-based clustering algorithm to detect aggregation patterns in those IFs. Furthermore, two structural features, called AP (Aggregation Preference) and Coverage in our original research, can be calculated bytadlib.calfea.analyze.Core.gdensity()
andtadlib.calfea.analyze.Core.totalCover()
respectively.- Parameters
- matrixnumpy.ndarray, (ndim = 2)
Interaction matrix of a TAD.
- leftint
Starting point of TAD. For example, if the bin size is 10kb,
left = 50
means position 500000(bp) on the genome.
- Attributes
- newMnumpy.ndarray, (ndim = 2)
Gap-free interaction matrix.
- convertlist
Information required for converting newM to matrix.
- cEMnumpy.ndarray, (ndim = 2)
Expected interaction matrix. An upper triangular matrix. Value in each entry will be used to construct a Poisson Model for statistical significance calculation.
- fEnumpy.ndarray, (ndim = 2)
An upper triangular matrix. Each entry represents the fold enrichment of corresponding observed interaction frequency.
- Psnumpy.ndarray, (ndim = 2)
An upper triangular matrix. Value in each entry indicates the p-value under corresponding Poisson Model.
- posnumpy.ndarray, (shape = (N, 2))
Coordinates of the selected IFs in newM.
- Npint
Number of the selected IFs.
- cluster_idnumpy.ndarray, (shape = (N,))
Cluster labels for each point in pos. -1 indicates noisy points.
- Ncint
Cluster number.
- clustersdict
Details of each cluster. “Average density”, “radius”, “area (polygon)”, “point coordinates”, and “object number” are all recorded.
- Hullsdict
Details of each convex hull (clusters which can be enclosed by a convex polygon).
- ptracelist of int
Labels of clusters in which all points are collinear. These clusters cannot be enclosed by a convex polygon.
- gdenfloat, [0, 1]
Weighted average density.
- coveragefloat, [0, 1]
Total coverage of clusters.
Methods
DBSCAN
(self)Detect natural patterns in selected interactions using DBSCAN.
convertMatrix
(self, M)Convert an internal gap-free matrix(e.g., newM, cEM, fE, and Ps) into a new matrix with the same shape as the original interaction matrix by using the recorded index map(see the convert attribute).
gdensity
(self)Weighted density calculation.
longrange
(self[, pw, ww, top, ratio])Select statistically significant interactions of the TAD.
totalCover
(self)Total coverage of clusters.
-
DBSCAN
(self)[source]¶ Detect natural patterns in selected interactions using DBSCAN.
DBSCAN is a dennsity-based clustering algorithm. [1] Two input parameters eps and MinPts are calculated in an analytical way. [2] Before further analysis, some basic features are calculated for each cluster, including “density”, “radius” and the “area”, among which, “area” stands for corresponding convex polygon area.
See also
sklearn.cluster.DBSCAN
an implementation of DBSCAN
tadlib.calfea.polygon.Polygon
calculations based on polygon.
Notes
Both “radius” and “density” are defined based on core objects of a cluster. “radius” is the average distance from the core object to its MinPts-nearest neighbors while “density” is the inverse of it.
References
- 1
Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, OR, AAAI, Press, 1996, pp. 226-231
- 2
Daszykowski M, Walczak B, Massart DL. Looking for Natural Patterns in Data, Part 1: Density Based Approach. Chemom. Intell. Lab. Syst., 2001, 56: 83-92.
-
convertMatrix
(self, M)[source]¶ Convert an internal gap-free matrix(e.g., newM, cEM, fE, and Ps) into a new matrix with the same shape as the original interaction matrix by using the recorded index map(see the convert attribute).
-
gdensity
(self)[source]¶ Weighted density calculation.
tadlib.calfea.analyze.Core.longrange()
andtadlib.calfea.analyze.Core.DBSCAN()
have to be called in advance.Density of a TAD is the weighted average density of each cluster. Weight is the ratio of object number of a cluster to
Np
.
-
longrange
(self, pw=2, ww=5, top=0.7, ratio=0.05)[source]¶ Select statistically significant interactions of the TAD. Both genomic distance and local interaction background are taken into account.
- Parameters
- pwint
Width of the peak region. Default: 2
- wwint
Width of the donut. Default: 5
- topfloat, [0.5, 1]
Parameter for noisy interaction filtering. Default: 0.7
- ratiofloat, [0.01, 0.1]
Specifies the sample size of significant interactions. Default: 0.05
Notes
pw and ww are sensitive to data resolution. It performs well when we set pw to 4 and ww to 7 at 5 kb, and (2, 5) at 10 kb. [1]
References
- 1
Rao, S.S., Huntley, M.H., Durand, N.C. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell, 2014, 159: 1665-1680.
-
totalCover
(self)[source]¶ Total coverage of clusters.
tadlib.calfea.analyze.Core.longrange()
andtadlib.calfea.analyze.Core.DBSCAN()
have to be called in advance.
Matrix Manipulating¶
-
tadlib.calfea.analyze.
manipulation
(matrix, start=0)[source]¶ Remove gaps of the original interaction matrix.
- Parameters
- matrixnumpy.ndarray, (ndim = 2)
Interaction matrix.
- startint
The begining of the region. (Default: 0)
- Returns
- newMnumpy.ndarray, (ndim = 2)
The gap-removed matrix.
- convertlist
The first element is the index map from newM to matrix, and the second element records the length of matrix.
Examples
>>> import numpy as np >>> from tadlib.calfea.analyze import manipulation >>> matrix = np.random.rand(4, 4) >>> matrix[1,:] = 0; matrix[:,1] = 0 >>> print matrix [[ 0.24822414 0. 0.07782508 0.01812965] [ 0. 0. 0. 0. ] [ 0.93870151 0. 0.21986474 0.20462965] [ 0.13022712 0. 0.78674168 0.77068304]]
>>> newM, convert = manipulation(matrix) >>> print newM [[ 0.24822414 0.07782508 0.01812965] [ 0.93870151 0.21986474 0.20462965] [ 0.13022712 0.78674168 0.77068304]]
>>> print convert [{0: 0, 1: 2, 2: 3}, 4]
Polygon Creation and Operations¶
Here, polygon means convex polygon. In the simplest case, a polygon can be created or presented by vertices. More ofen, you don’t know the vertices but want a polygon enclosing a set of points.
ConvexHull defined in scipy.spatial module uses the Qhull library to compute convex hull for a finite set of points, but doesn’t provide any further polygon operations.
We fix these problems by customizing more methods for ConvexHull.
Note
Obviously, collinear point sets cannot be used to construct polygon. So a collinear test should be performed in advance.
-
class
tadlib.calfea.polygon.
Polygon
(points)[source]¶ Main class for polygon creation, manipulation and calulation.
The API is inherited from
scipy.spatial.ConvexHull
, which constructs convex hull from a point set. So you should input coordinates of a point set to initiate an instance.More general, not all points are required. Vertices alone are enough to construct the convex hull.
Also, this class is designed to operate on 2-D space, although ConvexHull is okay for higher dimensions.
- Parameters
- pointsarray_like
Coordinates of points to construct a convex hull from. All objects that can be converted to a ndarray shaped (npoints, 2) are okay.
Examples
Convex hull of a random set of points:
>>> import numpy as np >>> from tadlib.calfea.polygon import Polygon >>> points = np.random.rand(20, 2) # 20 random points in 2-D space >>> P = Polygon(points) >>> print P.anchors [[ 0.71119071 0.216714 ] [ 0.92908323 0.77903127] [ 0.79530032 0.93279735] [ 0.02160083 0.98758203] [ 0.09094998 0.4423167 ] [ 0.16924963 0.05900692] [ 0.71119071 0.216714 ]]
- Attributes
- Common attributes from **ConvexHull:**
- pointsndarray of double, shape (npoints, 2)
Coordinates of input points. Converted ndarray.
- verticesndarray of ints, shape (nvertices,)
Indices of points forming the vertices of the convex hull. The vertices are counter-clockwise ordered.
- simplicesndarray of ints, shape (nfacet, 2)
Indices of points forming the simplical facets of the convex hull.
- Customized attributes:
- anchorsndarray of double, shape (nvertices,)
Vertices of the convex hull, counter-clockwise ordered.
- areafloat
Area of the polygon.
Methods
add_points
(points[, restart])Process a set of additional new points.
calarea
(self)Calculate the polygon area.
close
(self)Close the polygon, i.e., the first point is also the last one.
isinside
(self, query[, zerolike])Compute point location relative to a polygon.
-
calarea
(self)[source]¶ Calculate the polygon area.
An attribute called area is assigned.
See also
tadlib.calfea.polygon.shoelace
Twice the area of polygon
-
close
(self)[source]¶ Close the polygon, i.e., the first point is also the last one.
See also
tadlib.calfea.polygon.isinside
judge if points are inside a polygon or not.
Notes
Must be called before
tadlib.calfea.polygon.Polygon.isinside()
.
-
isinside
(self, query, zerolike=1e-12)[source]¶ Compute point location relative to a polygon.
- Parameters
- querytuple or array_like
A tuple indicates one point. (The length must be 2) Or you can input an array-like sequence. Any object that can be converted to a ndarray shaped (npoints, 2) is okay.
- zerolikefloat
A number used to approximate 0.
- Returns
- mindstscalar or array_like
If mindst < 0, point is outside the polygon. If mindst = 0, point in on a side of the polygon. If mindst > 0, point is inside the polygon.
Notes
Sloan’s improved version of the Nordbeck and Rydstedt algorithm.
Examples
We start with
tadlib.calfea.polygon.Polygon
construction:>>> import numpy as np >>> from tadlib.calfea.polygon import Polygon >>> points = np.random.rand(20, 2) # Used for constructing Polygon >>> P = Polygon(points) >>> check = np.random.rand(3, 2) # Another 3 random points >>> P.isinside(check) array([ 0.21145311, 0.09807244, -0.15341914])
-
tadlib.calfea.polygon.
shoelace
(vertices)[source]¶ Calculate twice the area of polygon using Shoelace formula.
Polygon is defined by vertices.
- Parameters
- verticesarray_like
Vertex coordinates in a 2-D space. Coordinates must be placed along the last axis. And data points are along the first axis.
- Returns
- areafloat
You can deduce the order of input vertices from the sign: area is positive if vertices are in counter-clockwise order. area is negative if vertices are in clockwise order. area is zero if all points are colinear.
Notes
This function can be also used to judge if all points in a data set are collinear. Collinear points as input for initializing Polygon instance will raise a QhullError.
Examples
Vertices of a square:
Clockwise:
>>> from tadlib.calfea.polygon import shoelace >>> sq = [(0,0), (0,1), (1,1), (1,0)] >>> shoelace(sq) -2.0
Counter-clockwise:
>>> sq = [(0,0), (1,0), (1,1), (0,1)] >>> shoelace(sq) 2.0
-
tadlib.calfea.polygon.
collinear
(points)[source]¶ Test whether all given points are collinear.
Collinear points will trigger an error called QhullError when used to initialize a
tadlib.calfea.polygon.Polygon
instance. However, other conditions may also trigger QhullError. Doing this test in advance can avoid this error and make things clearer.- Parameters
- pointsarray_like
Coordinates of points to construct a polygon. Any object that can be converted to a ndarray shaped (npoints, 2) is okay.
- Returns
- judgebool
True if the input points are collinear else False.
See also
Notes
To judge if all points are collinear, we use a simple heuristic algorithm. We sort the points at first. Then test whether consecutive triples are collinear.
Examples
Trival but still effective:
>>> from tadlib.calfea.polygon import collinear >>> line = [(2, 0.4), (2, 0.8), (2, 4), (2, 100)] >>> collinear(line) True