Experiment Quality Assessment

In this tutorial, I will show you how runHiC can be used in data quality assessment.

All you need to do is to run a sinlge-line command after runHiC filtering or runHiC pileup has been finished (refer to quickstart for more details):

$ runHiC quality -m datasets.tsv -L filtered-hg38

Statistic Table

Then statistic tables about your data can be found in .stats files under the *filtered-hg38 sub-folder.

Here’s a snapshot:

000_SequencedReads:   14332993
        010_DoubleSideMappedReads:  10323445
        020_SingleSideMappedReads:  3238504
        030_UnmappedReads:  771044
100_NormalPairs:  10323445
        110_AfterFilteringReads:  9050818
        120_SameFragmentReads:  1193185
                122_SelfLigationReads:  245343
                124_DanglingReads:  934456
                126_UnknownMechanism:  13386
        130_DuplicateRemoved:  79442
400_TotalContacts:  9050818
        410_IntraChromosomalReads:  4170313
                412_IntraLongRangeReads(>=20Kb):  2812664
                412_IntraShortRangeReads(<20Kb):  1357649
        420_InterChromosomalReads:  4880505

Critical Indicators:
Double Unique Mapped Ratio = 10323445 / 14332993 = 0.7203
Self-Ligation Ratio = 245343 / 14332993 = 0.0171
Dangling-Reads Ratio = 934456 / 14332993 = 0.0652
Long-Range Ratio = 2812664 / 9050818 = 0.3108
Data Usage = 9050818 / 14332993 = 0.6315

The following table lists possible statistic names and their meanings:

Statistic Name

Meaning

000_SequencedReads

Total number of sequenced read pairs

010_DoubleSideMappedReads

Number of read pairs of which both sides can be uniquely mapped to the reference genome.

020_SingleSideMappedReads

Number of read pairs of which only one side can be uniquely mapped to the reference genome.

030_UnmappedReads

Number of read pairs of which neither side can be uniquely mapped to the reference genome.

100_NormalPairs

Number of read pairs of which both sides can be uniquely mapped.

110_AfterFilteringReads

Number of read pairs that have passed all filtering criteria.

120_SameFragmentReads

Number of read pairs of which both sides are mapped to the same restriction fragment. Such read pairs are filtered out in our pipeline.

122_SelfLigationReads

Number of read pairs deriving from self-circularized ligation product. The two sides are mapped to the same restriction fragment and face in opposite directions.

124_DanglingReads

Both sides of these read pairs are mapped to the same fragment and face toward each other. There can be many causes of such products, ranging from low ligation efficiency to poor streptavidin specificity.

126_UnknownMechanism

Unknown sources of “120_SameFragmentReads”. Both sides are mapped to the same strand.

310_DuplicatedRemoved

Number of read pairs from PCR products.

400_TotalContacts

Number of read pairs from true contacts, i.e., the remaining read pairs after all filtering processes.

410_IntraChromosomalReads

Number of intra-chromosomal contacts

412_IntraLongRangeReads

Number of long-range contacts (genomic distance >= 20Kb)

412_IntraShortRangeReads

Number of short-range contacts (genomic distance < 20Kb)

420_InterChromosomalReads

Number of inter-chromosomal contacts

Note that we try to organize these statistics hierarchically using indentation, so that, for example, “010_DoubleSideMappedReads”, “020_SingleSideMappedReads” and “030_UnmappedReads” constitutes “000_SequencedReads”.

At the bottom of the statistic table, we include some important quality indicators:

  1. Unique-Mapping Ratio. A low value of this metric indicates low sequencing quality, sample contamination, or poor genome assembly.

  2. Self-Ligation Ratio.

  3. Dangling-Reads Ratio.

  4. Long-Range Ratio. A low value (<0.15) of this metric indicates failed experiment.

Library-size Estimation

Dangling reads can be applied to estimate your library size in nature. Here’s an example of size distribution of dangling read molecules from a typical 300~500bp library:

_images/GM06990-HindIII-allReps-librarySize.png

The inconsistency between this distribution and the experimental library size suggests a failure in DNA size selection.

Read-pair Type Plotting

Read-pair type ratios will also be reported under filtered-hg38. Intra-chromosomal contacts are broken down into four types: “left pair” (both sides map to the reverse strand), “right pair” (both sides map to the forward strand), “inner pair” (two sides map to different strands and point towards each other) and “outer pair” (two sides map to different strands and point away from one another). If reads come from proximity ligation, each pair type should account for roughly 25% of contacts. Thus, distance at which the percentage of each type converges to 25% is a good indication of the minimum distance at which it is meaningful to examine Hi-C contact patterns. Here’s an example below:

_images/GM06990-HindIII-allReps-PairType.png

We can see a distinct turning point around 40kb. While there may be several unknown mechanisms making biases below this point, we should only consider contacts whose genomic distances are greater than 40kb in further deeper analysis.