Human Genome Module

Welcome to GenomeModule database

Introduction

The identification of motif modules, groups of multiple motifs frequently occurring in DNA sequences, is one of the most important tasks necessary for annotating the human genome. To identify motif modules systematically, we developed a computational method for the entire non-coding regions around human genes that does not rely upon the use of multiple genome alignments.

The GenomeModule database provides three levels of whole human genome-wide predictions of functional elements: TF binding sites, CRMs, and motif modules. Currently it includes 231,790 conserved blocks among which 116,226 were predicted as CRMs. With a false discovery rate cutoff of 0.05, we also predicted 3,161,839 motif modules, 90.8% of which are supported by various forms of functional evidence. Compared with experimental data from 14 ChIP-seq experiments, on average, our methods predicted 69.6% of the ChIP-seq peaks with TFBSs of multiple TFs.

These predictions are based on a novel algorithm for genome-wide identification of motif modules1. All human, mouse and rat ortholog information were collected and for every human-mouse/human-rat orthologous gene group, their non-coding sequence were aligned by well-known local alignment software CHAOS. The non-coding sequence of a gene includes the upstream sequence until the stop codon of the 5’ adjacent gene, the downstream sequence until the start codon of the 3’ adjacent genes and the intron sequences within the gene. Next, for every 1kb long human non-coding region starting from an aligned high quality segment in the local alignments, we define its “orthologous” region in mouse and rat, which is the 1kb long region in mouse and rat that aligns best with the human one. Here “best” means that the sum of the local alignment scores of all pairs of segments within this pair of 1kb long regions is maximal. Then the top 1kb long human regions that have the best scores with mouse and rat were selected respectively, and merged into non-overlapping blocks. These non-overlapping human blocks potentially contain CRMs. In this study, the total length of all blocks covers about 5% of the human genome. Based on these human blocks motif binding sites for TF were predicted using PWM method. Finally, FP-tree techniques was used to identify motif modules, the frequently shared motif combinations, composed of these predicted motifs.

Publication

1. Xiaohui Cai, Lin Hou, Naifang Su, Haiyan Hu, Minghua Deng, Xiaoman Li. Systematic identification of conserved motif modules in the human genome. BMC Genomics 2010, 11:567.

2. Xiaohui Cai, Haiyan Hu and Xiaoman Li. A new measurement of sequence conservation. BMC Genomics 2009, 10:623.