Submitting RCC sequences to Genbank with Geneious
Submitting RCC sequences to Genbank with Geneious
1 Aim of document
This document explains how to use Geneious to :
- assemble and clean final sequences from several traces using different internal primers
- annotate the sequences
- submit to Genbank using Bankit through the Genious plug-in
- submit to Genbank for 18S, ITS and 16S that cannot be any more be submitted using Bankit
Notes
- Look at legends below screen captures for directions.
- Changes from previous versions have been labelled with
2 Assemble and clean sequences
- Import the ab1 trace
Drag and Drop
Import trace sequences
- Trim the sequences
Annotate & Predict -> Trim Ends- Use error probability limit from 0.01 to 0.02 (increase to 0.05 if cannot assemble correctly, the trimming will be less drastic). For single reads (e.g. 528F) use a maximum of 0.02.
Trim sequences
Visualize the trimmed sequences
- Assemble if several primers have been used
Align/Assemble/De Novo Assemble- Use for assembly name: RCC####_gene-name_your-initials_date
- e.g. RCC2497_18S_PG_2018_02_15
The name should not contain any space
- e.g. RCC2497_18S_PG_2018_02_15
- Select “save the consensus”
- Select “save contigs”.
- You may have to change the trimming level (increase probability level - see above) if traces cannot be assembled
Assemble sequences
Visualize the assembled sequences
- Check the assembly and edit the consensus if necessary.
This is very important to make sure that your sequence is clean.
Allow editing- Edit bases that maybe wrongly assigned in one the trace.
Check and correct assembly
- Select and extract consensus
Extract consensus
- Reverse complement if necessary (if the sequence was assembled the other way around).
- Locate primers, test forward and reverse separately.
Tools > Primers > Test with saved primers
Test with saved primers
Locate primers
- Remove everything which is outside of primers including the primers.
- allow editing
- pressing the left button of the mouse, mark the region to be delete, It will show in green
- press delete
Remove everything outside primers
- Et voilà , you have a clean sequence
- The coloring corresponds to sequence quality based on the traces and assembly.
Final sequence
3 Add informations to sequences
3.1 Taxonomy
- Do a batch BLAST search
Select the files > sequence search or Blast search- Sometimes it does not work so you can do with the NCBI BLAST server
- Pay attention to the following parameters:
- database : nr genbank
- program : blastn (for protein coding gene like rbcL the blastX can be also use to confirmed)
- results : hit table
- maximum hits : 25 at least
- Parameters can be saved, recalled and deleted by clicking at the bottom of dialog box
- `save current settings > name > save.
- You can request from NCBI an API key which increase the number of request you can do. The process is explained on the NCBI web site
BLASTN
Request a NCBI key
Enter they key in Geneious preferences
- Retrieve the closest sequence from GenBank (Optional)
- From Geneious folder with the Blast results, select the closest result, drag the file into your folder in your local database if you wish to retain the file and/or modify it.
- From Genbank: Copy the accession number > go to NCBI > nucleotide > paste the accession number (look the figure bellow). You drag the file into your folder in your local database if you wish to retain the file and/or modify it.
Retrieve closely related sequences from Genbank
- Do a manual alignment (Optional)
- This is very useful to detect introns, for ITS sequences, combine gene sequencing partial 18S + 28S for example.
Align/Assemble > Pairwise Align MAFFT using the default parameters
Alignement parameters
Alignement results
3.2 Gene annotation
This step is NOT necessary for 16S, 18S, ITS
- With the mouse, select your sequence, add notation
- Parameters to be changed (look the picture below)
- Name: name of the gene
- Type:
- select rRNA for 18S, ITS, 16S platidial and 28S
- CDS or gene coding sequence for example rbcL
- Add property using the 1st ADD: name = product, value = name of gene, for example 18S rRNA.
- Add annotation using the 2nd ADD (click in INTERVALS to see it): click in truncated left end and truncated right end. This is to tell that the sequence is not complete. For example, the 18S in this tutorial had the extremities before the primer removed, so is incomplete.
Make sure you do not have two annotations for the same gene !
Annotate genes
Annotated genes
3.3 Metadata
- Add two new type of metadata (it has to be done only once) in the GenBank submission category:
- Strain
- Culture_collection
Edit Meta data Types > Genbank Submission > click on the + on the right side >
write Culture Collection on the new field - > ok
Make sure that these new fields are in the Genbank Submission category. Do not recreate a new category.
Use exactly the orthograph for names especially with underscores “Culture_collection” and not as before “Culture Collection”.
Add new meta-date type: Culture Collection
- Click on the final sequence, go to info and change or correct the following fields.
- Name : RCC####_gene-name_your-initials_date, e.g.
RCC9999_18S_PG_2015_10_01(change if it is not in this format at this point).- This will be the ID of the sequence submitted to GenBank.
- This name must not contain any space
- This name must be unique. For example if you submit 2 sequences for the same strain and same gene you must use different names e.g.
RCC9999_18S_PG_2015_10_01_AandRCC9999_18S_PG_2015_10_01_B
- Organism :
Picochlorum sp.orTrebouxiophyceae.- Enter the genus name or, if not known, the lowest taxonomic level known.
Only use the species name if you are absolutely sure of the species as determined by microscopy or ITS. Do not rely on BLAST!!
- DO NOT add the RCC number at the end of the organism name.
- For levels above the genus, do not use sp. For example use
Trebouxiophyceaeand notTrebouxiophyceae sp.orChlorophytaand notChlorophyta sp.
- Strain : This is the RCC code as RCCxxxx without space between RCC and number e.g.
RCC1236. - Culture_collection : This is the RCC number as RCC:xxxx with “:” between RCC and number e.g.
RCC:1236.
- Name : RCC####_gene-name_your-initials_date, e.g.
Update the different meta-data fields
- It is possible to quickly change metadata for a set of sequences using the Batch edit mode. For example you can :
- Copy the Strain field to the Culture Collection field
- Add the “:” automatically for all sequences by replacing “RCC” by “RCC:”.
Batch edit - simple
Batch edit - advanced
3.4 Primers information
This step is Optional for 18S, but but must be added for ITS, 28S and other genes
Edit Meta data Types > Sequencing Primer > OK
You can also use Batch edit to go faster
Add primer information to meta-data
4 GenBank submission - General case (not for 16S, 18S or ITS, see next part)
- Note that since August 2018 16S, 18S, 28S and ITS cannot be submitted by BankIt and must submitted through a web interface.
- Install plugin GenBank submission
Tools > plugin > choose the plugin and click in install
- Select the sequences you want to submit
- Select GenBank submission
- Enter first the Publisher details (add the info like the picture bellow, except that the sequence authors is Daniel Vaulot + who did the sequence)
- Name
- Adress
- Sequence authors
- Select Unpublished
- Reference should be “Roscoff Culture Collection”
- Check very carefully all the fields
- Submission name : the name of the file to be saved (this should be kept on the Databases computer)
- Save a local file (only upload when everything is OK)
- Project name : Roscoff Culture Collection
- Molecule type : Genomic DNA
- Genetic location : in general Genomic but can also be Plastid or Mitochondrion or Nucleomorph for Cryptophytes
- Sequence ID : Name
- Organism : Organism
- Include features/annotation : Yes
- Include other fields : Yes
- Culture_collection : Culture_collection (GenBank submission)
- Strain : Strain (GenBank submission).
- Primers : You can put the primers if necessary but they need to be entered Sequencing primers
- Check submission in the Preview mode
- If Errors you need to correct
- Ignore warning about Organism not found and Collection
Warnings - Ignore
Genbank record preview
- Save as tar file.
- The submission has to be done before processing a new one starts because Geneious keep at the memory the info from the last .tar file you saved.
- The tar file can be uncompressed to an .asn file which can be opened with Sequin which can be downloaded from NCBI.
- Finally submit using the Geneious BankIt account and record the BankIt number
Submit to GenBank
5 GenBank submission - 16S, 18S or ITS
Submission must now be done at https://submit.ncbi.nlm.nih.gov/subs/genbank/. If you do not have a login you must create one.
Information about the NCBI submission portal is here. We recommend to read these instructions very carefully before submitting the sequecnes.
The main steps are :
- Create fasta file with unique
Namefor each sequence.- Sequence
Name(Sequence_ID) cannot contain spaces. The Sequence_ID identifies the same specimen in all the steps of a submission. We use a convention of the following typeRCC9999_18S_PG_2015_10_01(see above) - Sequence
Namemust be unique within the set and may not contain spaces. - Sequence
Namemay contain only the following characters - letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks, and number signs(#).
- Sequence
- Create a tabulated file as Text (tsv - tab-delimited) containing all the information about the sequence. See this link for the description of all the modifiers. This file can be easily exported from Geneious and finalized with Excel. For the RCC, the following columns are necessary (fields in bold are mandatory):
- Name - This field will be used as the Sequence_ID for submission
- Organism -
Picochlorum sp.orTrebouxiophyceae sp.
- Genbank Submission : Strain - e.g.
RCC1236. - Genbank Submission : Culture_collection - e.g.
RCC:1236. - Fwd_primer_name - name of forward PCR primer
- Fwd_primer_seq - nucleotide sequence of forward PCR primer
- Rev_primer_name - name of reverse PCR primer
- Rev_primer_seq - nucleotide sequence of reverse PCR primer
- The columns of the tabulated file must be edited, not forgetting the underscores. This is best done with an editor such as Notepad++ or with Excel. In the latter case the file must be saved as a text tabulated file.
- Name -> Sequence_ID
- Genbank Submission : Strain -> Strain
- Genbank Submission : Culture_collection -> Culture_collection
- name of forward PCR primer -> Fwd_primer_name
- nucleotide sequence of forward PCR primer -> Fwd_primer_seq
- name of reverse PCR primer -> Rev_primer_name
- nucleotide sequence of reverse PCR primer -> Rev_primer_seq
Example of header for the tsv file : Sequence_ID Culture Collection Strain Organism Fwd_primer_name Fwd_primer_seq Rev_primer_name Rev_primer_seq
5.1 Prepare files
We will do a simple case but you can add more columns (see list of modifiers.
- Fasta file
- Select sequences
- Export as fasta
Export to Fasta
Ignore this warning
Wrap sequences to 80 characters
Final fasta file
- Source information file tab-delimited (tsv file)
Select sequences. Check the 4 fields (Sequence_ID, Strain, Culture_collection and Organism) are correct.
Export selected documents
Export as tsv
Select the columns to be exported ; Name, Culture_collection, Strain, Organism
Edit the tsv fil to remove GenBank Submission: in the titles of the columns and change Name to Sequence\_ID. This is best done with an editor such as Notepad++ or with Excel. In the latter case the file must be saved as a text tabulated file.
After editing and removing GenBank Submission:
Editing with Excel (save file as tab-delimited tsv)
5.2 Submit to NCBI web portal
- Go to web portal : https://submit.ncbi.nlm.nih.gov/subs/genbank/
Web portal. Register or login if you have already an ID
Enter the type of sequence
Enter submitter information
Enter the sequence technology. In almost all cases choose Sanger and Assembly
Sequences. Release date: Choose immediate release in most cases, there is really no need to delay release.- The chimera question is only for Prokaryotes. - Chose pure cultures for cyanos. - Upload the fasta sequence file
Source information. Since it will be loaded in the text file, choose - NONE of these
Upload tsv file saved from Geneious
After uploading tsv file
Taxonomy error - This error is due to the addition of sp. to taxa at the rank above the genus. You need to correct and remove sp. to the tsv file. If the error comes from a new taxon not yet described you can ignore and GenBank will contact you probably to add this taxon to their database.
Add the reference. For the Roscoff Culture Collection just fill as indicated with the name of the person who produced the sequence first.
Final check
Submission status. When you press submit you should arrive at the final screen showing your submission.
A few time latter (from a few minutes to a few days), you shoud receive an email with the accession numbers. Please forward to rcc@sb-roscoff.fr.
- In case of errors in the submission, you may receive an email as follows.
This email explain which sequence(s) is(are) incorrect.
- You will need to carefully examine the sequence that Genbank has identified as problematic. Do not disregard their analysis because they really seem to catch errors. In most cases this is due to bad assembly. You need to go back to your trace files (see above) and use a higher threshold to trim the sequences (e.g. O.O2 instead of 0.05) and redo the assembly.
- Once this is done, remove the old sequence which was bad and add the new sequence (using the same sequence name) to your fasta file
- Update the .tsv file if necessary.
- Follow the link in the email to go back to the NCBI web site. You will need to reload ALL sequences in the submission that had problem as well as the tsv file with all the information
The web page to correct your submission. When you click on the link you will be asked to reload the fasta and tsv files.
After doing the correction you should see both the validated submissions as well as the one you just fixed.
6 Appendixes
6.1 Retrieve sequences from Genbank using Geneious
- For a list: Go to nucleotide, type the numbers separate by coma and click search. The results will appear in the bottom panel. You must drag the file into a folder in your local database if you wish to retain the file and/or modify it.
For consecutive accessions numbers: type the first and last numbers separated by :, click in more options, change All fields to Accession