2013-02-07 Marina Lizio Whole Genome CTSS files provided by Charles Plessy FREEZE_PHASE1.1 in whole-genome BED files ========================================= Introduction ------------ The BED files in this directory recapitulate the expression of all FREEZE_PHASE1.1 HeliScopeCAGE libraries aligned on the human (hg19) and mouse (mm9) genomes. They are sorted by position (with the chromosome names in lexical order), compressed with bgzip and indexed with tabix. In CTSS files, column 4 contains the library ID. For example, here is the result of the follwing command. tabix hg19.ctss.bed.gz chr1:564595-564596 chr1 564594 564595 CNhs10740 1 + chr1 564594 564595 CNhs10750 1 + chr1 564594 564595 CNhs11261 3 + chr1 564594 564595 CNhs12850 1 + chr1 564594 564595 CNhs13060 1 + chr1 564594 564595 CNhs13505 1 + chr1 564595 564596 CNhs10750 1 + chr1 564595 564596 CNhs10874 3 + chr1 564595 564596 CNhs11261 13 + chr1 564595 564596 CNhs11761 1 + chr1 564595 564596 CNhs11786 6 + chr1 564595 564596 CNhs11859 12 + chr1 564595 564596 CNhs12311 1 + chr1 564595 564596 CNhs12332 1 + chr1 564595 564596 CNhs12610 1 + chr1 564595 564596 CNhs12627 1 + chr1 564595 564596 CNhs12842 2 + chr1 564595 564596 CNhs12850 5 + chr1 564595 564596 CNhs13060 2 + chr1 564595 564596 CNhs13505 7 + It indicates for instance, that in the library CNhs11261, there were 3 tags on chromosome 1 at position 564594, and 13 at position 564595. On position 564594, only the libraries CNhs10740, CNhs10750, CNhs11261, CNhs12850, CNhs13060, and CNhs13505 had tags. In TSS files, column 4 contains the genome name. For example, on the same region as above: tabix hg19.tss.bed.gz chr1:564595-564596 chr1 564594 564595 hg19 8 + chr1 564595 564596 hg19 56 + The CTSS files can take a particular advantage of the groupby command from the BEDTools package. tabix hg19.ctss.bed.gz chr1:564595-564596 | sort -k4 | bedtools groupby -g 4 -c 5 -ops sum CNhs10740 1 CNhs10750 2 CNhs10874 3 CNhs11261 16 CNhs11761 1 CNhs11786 6 CNhs11859 12 CNhs12311 1 CNhs12332 1 CNhs12610 1 CNhs12627 1 CNhs12842 2 CNhs12850 6 CNhs13060 3 CNhs13505 8 Making of --------- FREEZE_PHASE1.1 BED CTSS files (./f5pipeline/*/*bed.gz) were downloaded in a local directory called longnames. The list of the files with their MD5 checksums (md5sum.txt) was also downloaded and will be used to list separately human and mouse libraries. Intermediate BED CTSS files were produced, where the fourth field was edited to contain the library identifier instead of the cluster coordinates. mkdir -p bgrezip for LIB in $(ls longnames | cut -f2 -d '.') do echo -ne "${LIB}.ctss.bed.gz\t" zcat longnames/*${LIB}*ctss.bed.gz | awk -v LIB=$LIB '{OFS="\t"} {$4=LIB ; print}' | sort -k1,1 -k2,2n -k6,6 | bgzip | tee bgrezip/${LIB}.ctss.bed.gz | md5sum | cut -f1 -d' ' done | tee bgrezip.txt The files were then indexed with tabix to later retreive whole chromosomes efficiently. cd bgrezip for BED in *ctss.bed.gz do tabix -p bed $BED done For each chromosome, all the libraries were pooled and sorted. CHRLIST='chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY' OUTPUT=mm9.ctss.bed.gz rm -rf chrtmp mkdir chrtmp for CHR in $CHRLIST do for BED in $( grep $OUTPUT ../md5sum.txt | cut -f 5 -d '.' ) do if [ -e ${BED}.ctss.bed.gz ] then echo -e "${CHR}\t${BED}" 1>&2 tabix ${BED}.ctss.bed.gz ${CHR} fi done | sort --temporary-directory=$(pwd) -k2,2n -k6,6 > chrtmp/${CHR} # Make sure there is enough space for sort's temporary files ! done cd chrtmp cat $CHRLIST | bgzip > $OUTPUT The commands above produce a file called mm9.ctss.bed.gz, to be saved elsewhere. The file for human libraries was produced using the following environment variables. CHRLIST='chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY' OUTPUT=hg19.ctss.bed.gz The TSS files were produced from the CTSS files with the following command. for GENOME in hg19 mm9 do zcat $GENOME.ctss.bed.gz | bedtools groupby -g 1,2,3,6 -c 5 -o sum | awk -v GENOME=$GENOME '{OFS="\t"} {print $1, $2, $3, GENOME, $5, $4}' | bgzip > $GENOME.tss.bed.gz done Quality controls ---------------- All the chromosomes are present in the lexical order. zcat hg19.ctss.bed.gz | cut -f 1 | uniq -c 121890678 chr1 53321429 chr10 73576018 chr11 68109558 chr12 29371755 chr13 42488556 chr14 40208741 chr15 44954965 chr16 63438358 chr17 23501095 chr18 61988588 chr19 99045918 chr2 33303067 chr20 15327855 chr21 24321179 chr22 80876028 chr3 56693077 chr4 68831216 chr5 72170072 chr6 63707181 chr7 51816458 chr8 50103431 chr9 6931043 chrM 39980869 chrX 1083737 chrY zcat hg19.tss.bed.gz | cut -f 1 | uniq -c 34459166 chr1 18336074 chr10 19328338 chr11 19931663 chr12 12265256 chr13 13061688 chr14 12360228 chr15 11435409 chr16 13350649 chr17 9694627 chr18 10085405 chr19 33814697 chr2 9200715 chr20 4889183 chr21 5602242 chr22 28268082 chr3 23120090 chr4 23921895 chr5 23915499 chr6 21452225 chr7 19485268 chr8 16081785 chr9 27350 chrM 13408931 chrX 519303 chrY zcat mm9.ctss.bed.gz | cut -f 1 | uniq -c 37417565 chr1 29987001 chr10 44438682 chr11 22100967 chr12 22952932 chr13 22375983 chr14 24138639 chr15 20576967 chr16 26617381 chr17 18006473 chr18 20250625 chr19 46148936 chr2 29748709 chr3 35000448 chr4 37096006 chr5 31533044 chr6 39924850 chr7 29623890 chr8 32838184 chr9 2583405 chrM 17648005 chrX 169080 chrY zcat mm9.tss.bed.gz | cut -f 1 | uniq -c 14018544 chr1 10241318 chr10 12765491 chr11 8220557 chr12 8467492 chr13 8410335 chr14 8175492 chr15 7642188 chr16 8031490 chr17 6724594 chr18 5812328 chr19 15467771 chr2 10742750 chr3 11852990 chr4 12601004 chr5 11375935 chr6 11579276 chr7 10119909 chr8 10984602 chr9 24405 chrM 6197424 chrX 73878 chrY All the FREEZE_PHASE1.1 libraries are present. grep hg19.ctss.bed.gz md5sum.txt | cut -f 5 -d '.' | wc -l 988 zcat hg19.ctss.bed.gz | cut -f 4 | sort | uniq -c | wc -l 988 grep mm9.ctss.bed.gz md5sum.txt | cut -f 5 -d '.' | wc -l 395 zcat mm9.ctss.bed.gz | cut -f 4 | sort | uniq -c | wc -l 395 The clusters in the same base are sorted by strand. Otherwise, the TSS file will have more than one entry per base and per strand. For instance, here is the interval chr1:564597-564598 when sorting was proper. tabix hg19.tss.bed.gz chr1:564597-564598 chr1 564596 564597 hg19 9 + chr1 564597 564598 hg19 14 - chr1 564597 564598 hg19 95 + Here is the same interval when the sorting was not proper. tabix hg19.tss.bed.gz chr1:564597-564598 chr1 564596 564597 hg19 9 + chr1 564597 564598 hg19 4 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 1 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 2 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 5 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 12 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 17 + chr1 564597 564598 hg19 2 - chr1 564597 564598 hg19 6 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 1 + chr1 564597 564598 hg19 2 - chr1 564597 564598 hg19 1 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 2 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 3 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 19 + chr1 564597 564598 hg19 1 - chr1 564597 564598 hg19 22 + MD5 sums -------- 087de6600bae03e9fc3e423837c2bf0b hg19.ctss.bed.gz 6bdc89b48966f92227e43dc1799cdd2d hg19.ctss.bed.gz.tbi b0607adac1e2c519b9b8738d9bcfa31a hg19.tss.bed.gz 5b1ba6335466151efe2662dccf7a04f3 hg19.tss.bed.gz.tbi bffc4ac3f762a493eca09a0e4543a483 mm9.ctss.bed.gz fb893d2e4e6d7343bbe6cdde0fa27660 mm9.ctss.bed.gz.tbi d7158613f3f6640527b7cdd02d69587e mm9.tss.bed.gz 2dd3e26540a2087709ad65c55244748f mm9.tss.bed.gz.tbi Known issues ------------ Be careful that bedtools intersect --sorted expects both files to have their chromosome names sorted in lexical order. See also -------- * http://samtools.sourceforge.net/tabix.shtml * http://code.google.com/p/bedtools/ -- Charles Plessy Fri, 25 Jan 2013 14:06:06 +0900