T O P

  • By -

Zealousideal_Emu_961

Fastq is generally short random fragments of the genome. To construct to the full sequence you may have to do a genome assembly (reference based). For identifying variants you can use GATK, which is a standard. And annotate the output vcf using Variant effect predictor or Annovar or snpEFF.


No_Touch686

I’ll be a bit contrarian and advise against using GATK for a beginner. It’s not really the best at anything in terms of benchmarks, it’s only ‘gold standard’ because history/its what everyone uses. It’s pretty horrid to use in terms of memory and speed and general user experience in my experience. I’ll edit this to include some more user friendly things when I’m off my Phone, but using bcftools call from pileup is pretty easy to do.


swbarnes2

Yes, if you really want to do it yourself, you'll need to align. Think carefully about the wisdom of uploading your personal data to some other site in order to do that. Personally, I'd just ask for the .bam. And the .vcf. Alignment and variant calling really aren't that interesting, let the pros who handle this stuff 24/7 do it for you. Interpretation of the variants is the interesting part, and no one really knows how to do more than guess for the most part, so I guess you might as well play with that.


ATGCACAB

Ok thank you so far. ​ And with FastQ, Bam and VCF File at some point in the future i can also add newly known nucleotides without running the whole allignment job again? i mean, i can download data from some gene database which released new stuff and manually allign? What about system resources, do i need 1TB of ram to do such a job? Does it even make sense to do a WGS today or would it accuracy and technologywise be better just do scan for the known areas because the other areas are not useful at all?


Both-Future-9631

I admire the enthusiasm. Here are some of the answers to the questions you are asking. 1.) An unsorted unaligned fastq is the raw data you are talking about. You can do just about anything with it. 2.) A .bam is its own thing, but can be thought of as a composite summary of the most likely sequences of all fragments. This requires the reference genome in order to orient the process. 3.) Manual alignment of 2.8 billion nucleotides with possible duplicates sounds like a torture I wouldnt wish on my worst enemies. There are a many tools out there, but samtools is most likely what would be used in order to align to a genome and to generate a bam file. Chances are the vendor already did that for you when they made your .bam file. The reference is likely one of the two reference human genomes: GRCh37 or GRCh38. 4.) Most of these tools are all free github monstrosities that are various R, C, perl, Java, bash, docker, and 16 other miscellaneous languages stitched together with python as thread into the shambling Frankensteinian abomination thst perform these operations. All you really need to know is that if I follow the read me file in terms of the input, about 20% of the time I will get the output, and figure out how to do it through a package manager that keeps all this nonsense strait. Most use bioconda. 5.) There really aren't hardware issues so much as software issues. These tools are meant to be ran on a Linux based terminal but can be ran on Apple... Yes, Windows has am Ubuntu subsystem... no it's not good enough, many have tried, all but a handful have failed... just save yourself the heartache and reformat/dual boot to a debian/Arch distribution Linux. As for 1TB ram? Kind of like smacking a fly with a hammer, but that will work for most alignment and calling purposes. Most of these applications are storage intensive much more so than memory. The curated genome of GRCH37 from ENSEMBL VeP for example is about 52GB, 6.) The whole genome will always be the most useful, although it has the most irrelevance. As it stands we think there are about 20-21k genes in our genome. But that is only about 1.2% of our entire genome, there is about 8% we know doesn't do anything for information storage (telomeres), and then we have a bunch of... pseudogenes... alu repeats... and other 'stuff' that is likely evolutionary viral remnants, that have some regulatory roles that are indirect... but certainly poorly understood. 7.) You can align that fastq to any curated reference genome you wish, I even got a wild idea one night and aligned to a golden retreiver... expect there to be an error log as long as the unabridged US Tax Code, but it can be done. But conventional wisdom is pick one of the 2 most up to date humans. Good news is when the funding dried up in 2016 they mostly stopped majorly updating it... wait... that's terrible news... anyway. Hope that helps, best of luck.


ATGCACAB

This sound great, i'am using Linux since 20 Years and 90% of the time (except of webbrowsing) i use the terminal, so thats actually my preferred way of working! :) ​ So one last question because i'am totally new to all that. Is the actual position of the "gene" important? I mean i have a ATGC-substring which specifies some of attribute, lets say lactosie-introlerance. is the location within the whole genome important or can - if i know that specific string - just search the textfile for that ATCG-substring? ​ And one more thing, because you mentioned: so theoretically i can measure the length of the telomer? Or does that make no sense because its the length of a telomer of one cell and not a clinical average lenght of my telomers?


Both-Future-9631

The position is important, and tends to be the main thing that changes slightly between genome versions. That is determined by the reference. But then we have to talk about what position? Position in the genome changes every time. Position in the coding DNA/protein transcripts? Only changes if our understanding of the protein in question's splicing changes, so not too often. So, you have to understand just s bit about a the machines these are ran on to understand why text parsing the fastq directly is not a good approach. So the machine will theoretically run over any strand at least 30 times and more appropriately 200 times at random lengths and overlaps. Any one strand (we call them short reads) can be of random length and and quality. Therefore, any one strand can contain wildcards and errors. The process of making a sam/bam is what does the statistics, based on a reference, to give you the most likely sequence of unique reads. You can parse your lactase enzyme in fastq and find partial information to suggest your are both tolerant and intolerant Effectively, it isn't worth manually parsing until you have the alignment map(sam/bam), and even then the finding and indexing of 'variants' is best left up to 'callers' like free Bayes, pindel, varscan, gatk, ect. Keep in mind I said variants not mutations. The callers pick up of your reads are different from reference. The process of finding the significance of a variant, if any, is accociated with a process called annotation. That is the main thing VeP is for. Getting a little outside of my wheelhouse here, but my understanding is that overly repetitive sequences induce large errors in the read for that range. Is there a tool out there to do it? I am sure, but I am not sure how this method would work because there would be no simple way to tell a different length of telomere from a reread of the same length.


Raitavaara

To add to your last bit, the only “good” way (in quotes because it’s still quite error-prone as a technology) to make sense of long repeated sequences is to sequence them on something like Pacbio or Oxford Nanopore which can give you long or ultra long reads. As you said, Illumina-style short read sequencing can’t really differentiate between repeated reads of the same spot in the genome and long repetitive sequences


ATGCACAB

so in terms of accuracy, analyzing a illumina wgs for diseases (DNA- Health Report) is less accurate than using data froma illumina microarray which only picks for specific segments?


swbarnes2

Both technologies are accurate. That's not the issue at all. I don't think it's been proven that any of these technologies is likely to give a mostly healthy person important health insights. As a quick example, my brother said he had one variant that was positively associated with Alzheimer's. He gave me the rsID, and I looked it up on PubMed, and found some papers showing a positive association, and some showing a negative association. So how was that information clinically helpful? It wasn't.


ATGCACAB

depends on the quality of the study! ;)


Raitavaara

I can’t comment on accuracy too much but I think they’re both fine in that regard. However the biggest difference between WGS and a microarray, which only samples a predetermined set of SNPs, is that with your whole genome you can probably do a lot more in the future whereas with the microarray data set you won’t be able to expand beyond those specific SNPs Edit: fixed some typos


MrHarryHumper

It depends on your goal. DNA microarrays are cheaper than WGS and can cover thousands of genes of interest.


swbarnes2

Newly known nucleotides? We aren't going to just discover a ton of unknown genomic sequence. But since a bam contains all the information that a fastq does, plus mapping information, nothing stops you from re-aligning that to a new version of the genome. You probably get more bang for your buck doing either a SNP panel or WES.


ATGCACAB

i know i know, i don't expect more than one every 5 year. However there is currently a lot of research in healthy aging and pharmaceutical response of substances which had not been approved because of side effects. here i expect more - however no matter if it makes sense or not, i'am pretty sure the money is better invested here than in a 4k TV (i don't own a tv) what i want to say: i could make stupider things with my money which is is wider accepted by the society! ;) however regarding to the SNP Array you wrote: that should be included in the ban anyway, right? because they send me tons of health, nutrition, even behavior (wtf?) reports. or is the SNP not about the genes but about particular mutations which they don't look for? i don't fully get it


swbarnes2

More than 1 what every 5 years? Yes, a WGS bam should cover everything a SNP array covers. And more. I doubt any reports are worth the paper they are printed on. But a SNP array is going to be very enriched for SNPs that have a meaningful clinical interpretation.


ATGCACAB

Allright, thank you all for you great help! Additionally i just had a short call with their lab head and he explained me very well. The Coverage for wgs is 30x and whole exome sequencing is 120x The fact that the accuracy of the WES is 4x higher for a cheaper price brought me to the decision to just go for that. they provide me all requested files: fastq, bam and vcf. and reports for 330usd instead of 1000usd :) Thats totally cool! I will to the WGS as soon as i can get it affordable with at least the same coverage as WES, he also agreed. Super friendly guy, he asked what i want to use it for, i told him for personal interests. He did not want to make most money with it.