Welcome to iCLIPlib’s documentation!

This library presents a series of tools for manipulating data from RNA Cross-Linking and ImmunoPrecipitation (CLIP) like experiments. Also includes example scripts performing common analyses. Most of the code was developed with iCLIP in mind. This documentation is currently mostly a stub while I get my act together to produce more complete narrative documentation.

Installation

Currently installation relies mostly on the user to deal with the dependencies. These are currently:

  • CGAT (www.github.com/CGATOxford/cgat)
  • numpy
  • pysam
  • pandas
  • bx-python

After installation of the dependencies, run python setup.py install, and put the contents of the scripts directory somewhere your shell can find them.

Look out very soon for proper dependency management and installers from conda and PyPI.

Getter functions

The key functions are the *_getter functions. These functions are used to return Series of signal across a specified genomic region. The usual way to use these functions is to pass the file(s) containing the data to the factory function which will return a function that implements expected interface and can then be passed to various other functions without those functions being aware of the source of the data.

getters.make_getter([bamfile, plus_wig, …]) A factory for getter functions
getters.getter(contig[, start, end, strand, …]) Get the profile of crosslinks across a genomic region, returning the number of crosslinks on each base.

the following should not need to be called directly but contain details of how bamfiles are converted into cross-link profiles:

counting.getCrosslink
counting.find_first_deletion

Counter functions

The getter functions themselves return conuts over regions. These functions return pandas.Series over various objects, such as lists of genomic intervals or transcripts. Transcripts are genearally handled by the CGAT.GTF interface.

counting.count_intervals(getter, intervals, …) Count the crosslinked bases across a set of intervals
counting.count_transcript(transcript, bam[, …]) Count clip cross-link sites across a transcript

Metagene profiles

Metagenes are representations of average profiles. Counts can be binned into a set number of bins across a transcript or using a set bin width. Various normalisation approaches can be applied.

meta.meta_gene(gtf_filelike, bam[, bins, …]) Produce a metagene across a gtf file from CLIP data, where each gene is divided into the same number of bins.
meta.get_binding_matrix(bamfile, …[, …]) Get matrix containing the binding counts across all the genes in the iterator binned into equal sized bins
meta.processing_index(interval_iterator, bam) Calculate the ratio of processed transcripts to non-processed
meta.get_window(profile, position, upstream, …) Return a window around point with an index aligned so the specified point is 0

users with more complex requirements might also want to look at the underlying binning function.

meta.bin_counts(counts, length, nbins) Aggregate counts into specified number of bins spatial bins.

normalisation functions:

meta.quantile_row_norm(matrix[, quantile, …]) Normalise the rows of a matrix so that 1 corresponds to the given quantile.
meta.sum_row_norm(matrix) Normalise the rows of a matrix by the sum of the row
meta.compress_matrix(matrix[, nrows, ncols]) Compress a matrix to a new number of rows/columns.

Randomisation

Since we know very little about the statistical distributions involved with many aspects of this sort of data, the signficance of patterns is often measured by comparison to randomised patterns. These tools help create that type of analysis:

random.bootstrap(x, bootstraps, fun, *args, …) Calculate bootstraps a function operating on a Series.
random.boot_ci(x[, quantiles]) Calculate confidence intervals on the results of function using bootstrapping.
random.count_ratio(x, test, control) Calculate the ratio of the sum to two columns of a DataFrame.
random.ratio_and_ci(test, control[, …]) Calculating ratio between the sum of two columns and add a confidence interval calculated using bootstrapping.
utils.randomiseSites(profile, start, end[, …]) Randomise clipped sites within an interval.
utils.rand_apply(profile, exon, n, func[, …]) Randomise a profile multiple times and apply a function to each randomised profile.

Coordinates Converter

There are many good reasons to align reads to the genome. But many of our analyses take place on spliced transcripts. The TranscriptCoordInterconverter is a tool for moving between these two worlds, converting genome coords to transcript coords and vice-versa

iCLIP.utils.TranscriptCoordInterconverter(…) Interconvert between genome domain and transcript domain coordinates.

Clusters

Although there are now many more advanced ways to calculate clusters, the method described by Wang et al, PLoS Biol. 2010:e1000530 :PMID:`2104891` remains the most popular. These tools help to produce analyses of this type:

clusters.Ph(profile, exon, nspread) Calculates a Series, Ph, such that Ph[i] is the P(X >= i) where X is the height of signal on a base of the profile
clusters.fdr(profile, exon, nspread, …) Calculate the FDR of finding a particular heights by using randomizations
clusters.get_crosslink_fdr_by_randomisation(…) This function will carry out the assessment of crosslink site significance using the method outlined in Wang Z et al.

Measures of similarity and difference between profiles

There is no agreed best way to measure distance between profiles, but these techniques should give you some ideas. Warning, they can be slow…

distance.calcAverageDistance(profile1, profile2) This function calculates the average distance of all pairwise distances in two profiles
distance.findMinDistance(profile1, profile2) Finds mean distance between each cross-link in profile1 and the closest cross-link in profile2
distance.corr_profile(profile1, profile2, …) Calculate the spearmans correlation coefficient btween two (possibly extended) cross-link profiles.
These

profiles across whole transcript sets/genomes in reproducibility_by_exon.py.

Analysis of kmers

Tools for examining the distribution and enrichment of kmers and other landmarks around clipped bases

kmers.find_all_matches(sequence, regexes) Return start positions of a collection of regexes
kmers.pentamer_frequency(profile, length, …) Calculate the frequency of overlaps between a list of cross-link sites (possibly extended) and a collection of sites
kmers.pentamer_enrichment(…[, …]) This function calculates the z-score of enrichment of kmers around

Other

utils.spread(profile, bases[, reindex, …]) Extend cross-link sites in both directions.

Indices and tables