Quick Start

This tutorial covers the basic usage of 4 subcommands (mapping, filtering, binning, and pileup) provided by runHiC. We will first download an example Hi-C dataset and corresponding reference genome data. Then we will process the Hi-C data step by step from raw sequencing reads (.sra, .fastq, and .fastq.gz) to the ICE-corrected contact matrices. Lastly, we will demonstrate how to streamline the processing pipeline by using the pileup subcommand.

Data Preparation

First of all, let’s make a temporary blank folder somewhere in your system, change your working directory to it, and make two sub-folders named data and workspace within it:

$ mkdir data
$ mkdir workspace
$ ls -lh

total 0
drwxr-xr-x  2 xtwang  staff    64B Sep 16 11:24 data
drwxr-xr-x  2 xtwang  staff    64B Sep 16 11:25 workspace

During this tutorial, all input data including the raw sequencing data and the reference genome data will be placed under the data sub-folder, and runHiC will be run under the workspace sub-folder.

Then download an example Hi-C dataset using the prefetch command of the SRA toolkit:

$ cd data
$ mkdir HiC-SRA
$ cd HiC-SRA
$ prefetch -o SRR027956.sra SRR027956
$ prefetch -o SRR027958.sra SRR027958
$ ls -lh

total 1.4G
-rw-r--r-- 1 xtwang staff 623M Sep 16 11:49 SRR027956.sra
-rw-r--r-- 1 xtwang staff 783M Sep 16 11:53 SRR027958.sra

runHiC currently supports three read formats: SRA (Sequence Read Archive), FASTQ, and compressed FASTQ (.fastq.gz). To demonstrate this, let’s first dump reads using the fastq-dump command, and then compress the FASTQ files using gzip:

$ for i in ./*.sra; do fastq-dump --split-files $i; done
$ for i in ./*.fastq; do gzip -c $i > `basename $i`.gz; done
$ cd ..
$ mkdir HiC-FASTQ
$ mkdir HiC-gzip
$ mv ./HiC-SRA/*.fastq ./HiC-FASTQ
$ mv ./HiC-SRA/*.gz ./HiC-gzip

Note

If your FASTQ files are suffixed with “R1.fastq” and “R2.fastq”, please make sure rename them as “_1.fastq” and “_2.fastq” before you run runHiC, as if they are dumped from an sra file.

Then download the reference genome (hg38) data from UCSC:

$ mkdir hg38
$ cd hg38
$ wget ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/*

Note that above commands can be modified to download any other genomes available in UCSC, by replacing “hg38” with the desired reference genome release name.

Let’s include chromosomes that are completely assembled only. To do so, open a python interpreter and follow the commands below:

>>> import os, glob
>>> labels = list(map(str,range(1,23))) + ['X','Y','M']
>>> pool = ['chr{0}.fa.gz'.format(i) for i in labels]
>>> for c in glob.glob('*.fa.gz'):
...     if not c in pool:
...         os.remove(c)
>>> exit()

Finally, uncompress the .gz files and merge all chromosomes into hg38.fa:

$ gunzip *.gz
$ cat *.fa > hg38.fa
$ cd ../..

Mapping

The first step of runHiC is conducted by the mapping subcommand, which maps raw sequencing reads to the reference genome and parses the resulted SAM/BAM alignments into .pairs.

Usage

runHiC mapping [options]

Create the Meta Data File

Before running runHiC, another thing you need to do is to create a TXT file named “datasets.tsv” under the workspace sub-folder:

$ cd workspace
$ vim datasets.tsv

And fill in the following content:

SRR027956 GM06990 R1 HindIII
SRR027958 GM06990 R2 HindIII

Here, “datasets.tsv” is a meta data file describing your Hi-C data, which should contain 4 columns. In order, they are: the prefix of the SRA file name (in the case of the FASTQ read format, it should be the leading part of the file names apart from the “_1.fastq” or “_2.fastq” substring), cell line name, biological replicate label, and the restriction enzyme name. Note that for Arima Hi-C, you can set the enzyme name to Arima; for experiments that do not use restriction enzymes for DNA fragmentation, you can set the enzyme name arbitrarily for your record. For example, for Micro-C, you can set it to MNase; for ChIA-PET, you can set it to sonication.

runHiC Command

Now type and execute the command below:

$ runHiC mapping -p ../data/ -g hg38 -f HiC-SRA -F SRA -A bwa-mem -t 10 --include-readid --drop-seq --chunkSize 1500000 --logFile runHiC-mapping.log

For FASTQ and the compressed FASTQ format, replace “HiC-SRA” with “HiC-FASTQ” or “HiC-gzip”, and reset the “-F” argument accordingly:

$ runHiC mapping -p ../data/ -g hg38 -f HiC-gzip -F FASTQ -A bwa-mem -t 10 --include-readid --drop-seq --chunkSize 1500000 --logFile runHiC-mapping.log

Two sub-folders named alignments-hg38 and pairs-hg38 will be created under current working directory (workspace). During this process:

  1. Read pairs will be mapped to the hg38 reference genome with bwa mem, and the alignment results will be reported in BAM format and placed under alignments-hg38.

  2. BAM files will be parsed into .pairs using pairtools under pairs-hg38.

runHiC currently supports three read aligners, bwa-mem, chromap, and minimap2. You can switch it by the -A/--aligner argument.

During the alignment parsing, runHiC detects ligation junctions, marks various situations (Unmapped, Multimapped, Multiple ligations-Walks, and valid Single ligations), and sorts pairs for further analysis. In this example, .pairsam.gz files under pairs-hg38 are valid .pairs files defined by the 4DN consortium. By default, it will only contain 7 columns: chr1, pos1, chr2, pos2, strand1, strand2, and pair_type; if you add --include-readid on the command, you will get an additional “readID” column; if you specify --include-sam, two extra columns “sam1” and “sam2” will be added to store the original alignments; if you add --drop-seq, SEQ and QUAL will be removed from the sam fields to save the disk space.

Filtering

The filtering subcommand of runHiC is designed to perform basic filtering procedures on the aligned read pairs. These filtering procedures include:

  1. Remove redundant PCR artifacts.

  2. Remove the read pair that maps to the same restriction fragment (since version 0.8.5, runHiC only performs this filtering if you specify --add-frag when you run runHiC mapping).

During the filtering process, runHiC also records read-level, fragment-level and the contact-level statistics for quality assessment of your Hi-C data. (See quality)

Here’s the command you should type in the terminal:

$ runHiC filtering --pairFolder pairs-hg38/ --logFile runHiC-filtering.log --nproc 10

That will create a new sub-folder named filtered-hg38. Please find the final valid contact pairs in .pairs.gz files. If you specify --include-sam when you run runHiC mapping, it will also output a .bam file accompanying each .pairs.gz file to store alignments that passed all filtering criteria.

Binning

In this step, an .mcool file will be produced under the coolers-hg38 sub-folder for each .pairs.gz file using cooler. The mcool format is the official Hi-C data format for the 4DN consortium and can be visualized using HiGlass:

$ runHiC binning -f filtered-hg38/ --logFile runHiC-binning.log --nproc 10

Pileup

runHiC also provides a handy subcommand called “pileup” by which you can perform all processing steps above using the single-line command below:

$ runHiC pileup -p ../data/ -g hg38 -f HiC-SRA -F SRA -A bwa-mem -t 10 --include-readid --drop-seq --chunkSize 1500000 --logFile runHiC.log