标签:moved optional getting opera png car local learning his
The objective of this activity is to understand some relevant properties of an ensemble of next generation sequencing reads such as length, quality scores and base distribution in order to assess the quality of the data and to discard low quality reads.
Next-generation sequencing technologies are becoming widely used, and although a massive number of sequences can be generated in a single experiment, it is still important to characterize the reads. Characterizing reads may prevent undesirable outcomes in the assembly or mapping processes. Discarding low-quality reads, controlling for contamination, and trimming adaptor sequences are examples of preliminary quality control procedures that can be applied to raw reads before further analysis.
We will use basic UNIX commands, the FASTX-Toolkit, and FastQC applied to two small 454 and Illumina datasets.
FIX FOR MISSING SCRIPTS
Execute the following commands:
cd ~/wcg_oct2011/local/scripts
wget http://www.molecularevolution.org/molevolfiles/exercises/QC_of_NGS/scripts.tar.gz
tar xvfz scripts.tar.gz
The goal of this exercise is to inspect the sequence data of an Illumina run. The sequence data belongs to the intracellular bacterium Bartonella bacilliformis which is zoonotic pathogen transmitted by insect vectors. Although FASTQ is the main file received from a sequence provider, some users want to perform the base calling step themselves, using a different package than the proprietary Illumina software. This is not covererd in this exercise, and we start directly from the FASTQ file.
fastaNamesSizes.pl -f fastq-illumina bartonella_illumina.fastq > bartonella_illumina.ns
-f fastq-illumina = specifies what format the input sequence file is in. Format options are specified in Bioperl (see link for more information about the fastq format specified)
> bartonella_illumina.ns = specifies the output file
Have a look at the numbers output on the terminal. How many sequences do we have?
This information can found in the standard error:
# 10000 sequences, total length 380000.
# Minimum len: 38. Max: 38. Average: 38.
Inspect the output.
[show answer]
Answer:
The output will be in tabular format, with each line containing the sequence ID and length.
(Note: this is the same as using which to find the file location, and then less to open the file at that path. The backticks (not single quotations) specify that the which command should be performed before less).
fastq_to_fasta -n -i bartonella_illumina.fastq > bartonella_illumina.fasta
-n = keeps sequences with unknown (N) nucleotides
-i bartonella_illumina.fastq = specifies the input file
fastx_quality_stats -i bartonella_illumina.fastq > bartonella_illumina.qualstats
Answer:
Run the program with -h to show help documentation which contains this information. The format of this output is described as the "old" style.
Answer:
The ‘new‘ output contains the original information followed by the analysis for each nucleotide at each cycle (explained in the help documentation).
fastq_quality_boxplot_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.boxplot.png
(or, alternatively with R: Rscript fastx_quality_boxplot.R bartonella_illumina.qualstats bartonella_illumina.boxplot2.png
)
Answer:
x axis represents the read length, y axis indicates the Phred quality score. From the top each box represents upper whisker, Q1, median, Q3 and lower whisker.
The sequence reads have a high quality. The probability of error is about 0.0001. The quality drops towards the end of the reads.
fastx_nucleotide_distribution_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.nucdistr.png
(or, alternatively with R: Rscript fastx_nucleotide_distribution.R bartonella_illumina.qualstats bartonella_illumina.nucdistr2.png
)
Answer:
The sequence reads are A-T rich. The estimate of G+C content is about 30-40%.
The goal of this exercise is to do the same kind of quality checks as in exercise 1, but on 454 data this time. The primary data from 454 is stored in a sff file, but in general, FASTA and qual files are also provided. It is possible to parse the sff files with different parameters using the proprietary software provided by 454 (sfffile/sffinfo) or a free, open-source tool,sff_extract. Sff extraction is not covered in this exercise, and we start from the bartonella_454.fasta andbartonella_454.qual files. How would you perform the following tasks?
Answer: fasta_formatter < bartonella_454.fasta > bartonella_454_1.fasta
fastaQual2fastq.pl bartonella_454.fasta bartonella_454.qual > bartonella_454.fastq
Answer: fastaNamesSizes.pl bartonella_454.fasta > bartonella_454.ns
Standard error:
# 10000 sequences, total length 3666752.
# Minimum len: 40. Max: 633. Average: 367
Although the average read length is 367 bp, very few sequences are ~200-300 bp. The read length of the majority of the sequences in the file fits in two major groups: less than 100bp and more than 300bp.
Rscript plotLengthDistribution.R bartonella_454.ns bartonella_454.readDistr.png
Answer:
The read length distribution is bimodal. Mean read length=367 bp, Median=449 bp.
Answer:
The following code can be inserted after the hist
line and before the dev.off()
line:abline(v=mean(data$V2),col="red")
abline(v=median(data$V2),col="blue")
Answer: fastx_quality_stats -i bartonella_454.fastq > bartonella_454.qualstats
fastx_quality_stats -i bartonella_454.fastq -N > bartonella_454.extqualstats
Answer: fastq_quality_boxplot_graph.sh -i bartonella_454.qualstats -o bartonella_454.qualstats.png
The quality of the sequence reads is high but it drops after ~400bp.
Answer: fastx_nucleotide_distribution_graph.sh -i bartonella_454.qualstats -o bartonella_454.nucdistr.png
The sequence reads are G-C rich. The distribution of nucleotides is not uniform at the 5‘ end of the reads.
Answer: head -n 51 bartonella_454.qualstats > bartonella_454.50f.qualstats
fastx_nucleotide_distribution_graph.sh -i bartonella_454.50f.qualstats -o bartonella_454.50f.nucdistr.png
The distribution of the nucleotides is not uniform in the first 10 nucleotide position of the reads. Adaptor sequences at the 5‘ end are included in the sequenced reads. The sequence reads need to be trimmed.
Most of the sequences have some adaptor sequences at the 5‘ end. We need to remove these adaptor sequences to avoid problems with the assembly or mapping. We will use fastx_trimmer to do this.
Answer: fastx_trimmer -f 11 -m 50 -i bartonella_454.fastq -o bartonella_454.trim.fastq
-f 11 = the position on the read (from the 5‘ end) of first nucleotide to keep. The first 10 bp will be removed from the 5‘ end.
-m 50 = the minimum final read length. Reads shorter than this length will be discarded.
Answer: fastx_quality_stats -i bartonella_454.trim.fastq | head -n 100 > bartonella_454.trim.qualstats
fastx_nucleotide_distribution_graph.sh -i bartonella_454.trim.qualstats -o bartonella_454.trim.nucdistr.png
The distribution of the nucleotides is now similar over the length of the reads.
Answer:
If the adapter was removed by searching for a specific sequence then reads with sequencing or PCR errors in the adapter would not being trimmed. These untrimmed reads would cause problems with assembly.
Another tool to assess sequence quality is FastQC. FastQC can operate in two different modes - as a stand-alone interactive application or in a non-interactive mode which is more suitable for larger pipelines. For the purposes of this activity we will be using the interactive application. To start FastQC:
If you are running the WCG Ubuntu Linux distribution, first run the following commands:
rm ~/wcg_oct2011/local/bin/fastqc
ln -s ~/wcg_oct2011/software/FastQC/fastqc ~/wcg_oct2011/local/bin/fastqc
Then, you should be able to run fastqc from anywhere like so:
fastqc
Or if you are running on Mac OS X, navigate to the ~/wcg_oct2011/software/fastqc_v0.9.3 folder (already copied from your USB during the software installation) and double-click the FastQC icon.
FastQC may take a moment to open and should display a blank window once it has.
Load the bartonella_illumina.fastq Illumina file (File->Open) from the ~/wcg_oct2011/activities/QC_of_NGS_datadirectory. Once loaded, FastQC will process it and generate the report. This report can be exported (File->Save report...) as a zipped HTML folder which makes for easy sharing. You can view the results either within the FastQC application or theexported report. The FastQC documentation provides a detailed explanation of the tool and an explanation of each module report.
One nice feature of FastQC is that it easily handles and detects the many FASTQ variants. Within the Basic Statisticsreport, it shows that our data was Illumina 1.5 encoded.
The per base sequence quality plot is similar to the plot we produced in Exercise 1 with thefastq_quality_boxplot_graph.sh
FastX-Toolkit command. Are there any differences between the two plots?
Explore the rest of the plots. What new information can be gathered about this Illumina dataset?
Load the bartonella_454.fastq file (File->Open) that we created from the 454 bartonella_454.fasta andbartonella_454.qual files in Exercise 2, step 3. Note that FastQC handles the following file types:
You can view the results either within the FastQC application or the exported report.
You will notice for this dataset it shows that our data was Illumina 1.5 encoded. If you remember in exercise 2, we used thefastaQual2fastq.pl script to convert the .fasta and .qual files to FASTQ and it encoded them as Illumina 1.5.
You should notice in this case that 5 of the modules are marked with the red cross as "very unusual." Why was this the case for the 454 data? Try loading the bartonella_454.trim.fastq file to see the impact of our quality control efforts.
Another nice feature of FastQC is that many of the plots bin the nucleotide position on the x-axis which is useful for long reads such as 454. It is especially useful for the per base sequence content plot.
FastQC also provides example datasets on their website for a good Illumina dataset, poor Illumina dataset, a 454 dataset, and PacBio dataset.
Other tools to verify quality of second-generation sequencing results are available:
Quality assessment and quality control of NGS data
标签:moved optional getting opera png car local learning his
原文地址:http://www.cnblogs.com/nkwy2012/p/6089933.html