染色体分型

使用方法等详细信息请关注公众号【生物信息分析学习】(swxxfxxx)

【提示】本文写作命令为:abysw blog 染色体分型
【 另 】有任何问题,欢迎来公众号交流!

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

$ conda create -n hapcut2 -c bioconda hapcut2 -y

支持以下类型数据
NGS short reads (Illumina HiSeq)
single-molecule long reads (PacBio and Oxford Nanopore)
Linked-Reads (e.g. 10X Genomics, stLFR or TELL-seq)
proximity-ligation (Hi-C) reads
high-coverage sequencing (>40x coverage-per-SNP) using above technologies
combinations of the above technologies (e.g. scaffold long reads with Hi-C reads)

两步运行

  1. ./build/extractHAIRS [options] –bam reads.sorted.bam –VCF variants.vcf –out fragment_file
    获取单倍型信息
  2. ./build/HAPCUT2 –fragments fragment_file –VCF variants.vcf –output haplotype_output_file
    组装单倍型

输入文件:比对的bam文件,未压缩的vcf文件。

不知道六倍体怎么玩,试试吧。

软件总是在调试中理解(主要还是没耐性)。

./extractHAIRS [options] –bam reads.sorted.bam –VCF variants.VCF –out fragment_file

Options:
–qvoffset <33/64> : quality value offset, 33/64 depending on how quality values were encoded, default is 33
–mbq : minimum base quality to consider a base for haplotype fragment, default 13
–mmq : minimum read mapping quality to consider a read for phasing, default 20
–realign_variants <0/1> : Perform sensitive realignment and scoring of variants.
–hic <0/1> : sets default maxIS to 40MB, prints matrix in new HiC format
–10X <0/1> : 10X reads. NOTE: Output fragments MUST be processed with LinkReads.py script after extractHAIRS to work with HapCUT2.
–pacbio <0/1> : Pacific Biosciences reads. Similar to –realign_variants, but with alignment parameters tuned for PacBio reads.
–ONT, –ont <0/1> : Oxford nanopore technology reads. Similar to –realign_variants, but with alignment parameters tuned for Oxford Nanopore Reads.
–new_format, –nf <0/1> : prints matrix in new format. Requires –new_format option when running HapCUT2.
–VCF : variant file with genotypes for a single individual in VCF format (unzipped)
–maxIS : maximum insert size for a paired-end read to be considered as a single fragment for phasing, default 1000
–minIS : minimum insert size for a paired-end read to be considered as single fragment for phasing, default 0
–PEonly <0/1> : do not use single end reads, default is 0 (use all reads)
–indels <0/1> : extract reads spanning INDELS, default is 0, variants need to specified in VCF format to use this option
–noquality : if the bam file does not have quality string, this value will be used as the uniform quality value, default 0
–triallelic <0/1> : include variants with genotype 1/2 for parsing, default 0
–ref : reference sequence file (in fasta format, gzipped is okay), optional but required for indels, should be indexed
–out : output filename for haplotype fragments, if not provided, fragments will be output to stdout
–region chr:start-end : chromosome and region in BAM file, useful to process individual chromosomes or genomic regions
–ep <0/1> : set to 1 to estimate HMM parameters from aligned reads (only with long reads), default = 0
–hom <0/1> : set to 1 to include homozygous variants for processing, default = 0 (only heterozygous)

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

USAGE : ./HAPCUT2 –fragments fragment_file –VCF variantcalls.vcf –output haplotype_output_file

Basic Options:
–fragments, –f : file with haplotype-informative reads generated using the extracthairs program
–VCF : variant file in VCF format (use EXACT SAME file that was used for the extracthairs program)
–output, –o : file to which phased haplotype segments/blocks will be output
–outvcf <0/1> : output phased variants to VCF file (.phased.vcf), default: 0
–converge, –c : cut off iterations (global or maxcut) after this many iterations with no improvement. default: 5
–verbose, –v <0/1>: verbose mode: print extra information to stdout and stderr. default: 0

Read Technology Options:
–hic <0/1> : increases accuracy on Hi-C data; models h-trans errors directly from the data. default: 0
–hic_htrans_file, –hf optional tab-delimited input file where second column specifies h-trans error probabilities for insert size bins 0-50Kb, 50Kb-100Kb, etc.
–qv_offset, –qo <33/48/64> : quality value offset for base quality scores, default: 33 (use same value as for extracthairs)
–long_reads, –lr <0/1> : reduces memory when phasing long read data with many SNPs per read. default: automatic.

Haplotype Post-Processing Options:
–threshold, –t : PHRED SCALED threshold for pruning low-confidence SNPs (range 0-100, larger values prune more.). default: 6.98
–skip_prune, –sp <0/1>: skip default likelihood pruning step (prune SNPs after the fact using column 11 of the output). default: 0
–call_homozygous, –ch <0/1>: call positions as homozygous if they appear to be false heterozygotes. default: 0
–discrete_pruning, –dp <0/1>: use discrete heuristic to prune SNPs. default: 0
–error_analysis_mode, –ea <0/1>: compute switch confidence scores and print to haplotype file but don’t split blocks or prune. default: 0

Advanced Options:
–new_format, –nf <0/1>: use new Hi-C fragment matrix file format (but don’t do h-trans error modeling). default: 0
–max_iter, –mi : maximum number of global iterations. Preferable to tweak –converge option instead. default: 10000
–maxcut_iter, –mc : maximum number of max-likelihood-cut iterations. Preferable to tweak –converge option instead. default: 10000
–htrans_read_lowbound, –hrl with –hic on, h-trans probability estimation will require this many matepairs per window. default: 500
–htrans_max_window, –hmw with –hic on, the insert-size window for h-trans probability estimation will not expand larger than this many basepairs. default: 4000000