Submitting RCC sequences to Genbank with Geneious

1 Aim of document

This document explains how to use Geneious to :

  • assemble and clean final sequences from several traces using different internal primers
  • annotate the sequences
  • submit to Genbank using Bankit through the Genious plug-in
  • submit to Genbank for 18S, ITS and 16S that cannot be any more be submitted using Bankit

Notes

  • Look at legends below screen captures for directions.
  • Changes from previous versions have been labelled with

2 Assemble and clean sequences

  • Import the ab1 trace
    • Drag and Drop
Import trace sequences

Import trace sequences

  • Trim the sequences
    • Annotate & Predict -> Trim Ends
    • Use error probability limit from 0.01 to 0.02 (increase to 0.05 if cannot assemble correctly, the trimming will be less drastic). For single reads (e.g. 528F) use a maximum of 0.02.
Trim sequences

Trim sequences

Visualize the trimmed sequences

Visualize the trimmed sequences

  • Assemble if several primers have been used
    • Align/Assemble/De Novo Assemble
    • Use for assembly name: RCC####_gene-name_your-initials_date
      • e.g. RCC2497_18S_PG_2018_02_15
        The name should not contain any space
    • Select “save the consensus”
    • Select “save contigs”.
    • You may have to change the trimming level (increase probability level - see above) if traces cannot be assembled
Assemble sequences

Assemble sequences

Visualize the assembled sequences

Visualize the assembled sequences

  • Check the assembly and edit the consensus if necessary.
    This is very important to make sure that your sequence is clean.
    • Allow editing
    • Edit bases that maybe wrongly assigned in one the trace.
Check and correct assembly

Check and correct assembly

  • Select and extract consensus
Extract consensus

Extract consensus

  • Reverse complement if necessary (if the sequence was assembled the other way around).
  • Locate primers, test forward and reverse separately.
    • Tools > Primers > Test with saved primers
Test with saved primers

Test with saved primers

Locate primers

Locate primers

  • Remove everything which is outside of primers including the primers.
    • allow editing
    • pressing the left button of the mouse, mark the region to be delete, It will show in green
    • press delete
Remove everything outside primers

Remove everything outside primers

  • Et voilà, you have a clean sequence
    • The coloring corresponds to sequence quality based on the traces and assembly.
Final sequence

Final sequence

3 Add informations to sequences

3.1 Taxonomy

  • Do a batch BLAST search
    • Select the files > sequence search or Blast search
    • Sometimes it does not work so you can do with the NCBI BLAST server
    • Pay attention to the following parameters:
      • database : nr genbank
      • program : blastn (for protein coding gene like rbcL the blastX can be also use to confirmed)
      • results : hit table
      • maximum hits : 25 at least
    • Parameters can be saved, recalled and deleted by clicking at the bottom of dialog box
      • `save current settings > name > save.
    • You can request from NCBI an API key which increase the number of request you can do. The process is explained on the NCBI web site
BLASTN

BLASTN

Request a NCBI key

Request a NCBI key

Enter they key in Geneious preferences

Enter they key in Geneious preferences

  • Retrieve the closest sequence from GenBank (Optional)
    • From Geneious folder with the Blast results, select the closest result, drag the file into your folder in your local database if you wish to retain the file and/or modify it.
    • From Genbank: Copy the accession number > go to NCBI > nucleotide > paste the accession number (look the figure bellow). You drag the file into your folder in your local database if you wish to retain the file and/or modify it.
Retrieve closely related sequences from Genbank

Retrieve closely related sequences from Genbank

  • Do a manual alignment (Optional)
    • This is very useful to detect introns, for ITS sequences, combine gene sequencing partial 18S + 28S for example.
    • Align/Assemble > Pairwise Align MAFFT using the default parameters
Alignement parameters

Alignement parameters

Alignement results

Alignement results

3.2 Gene annotation

This step is NOT necessary for 16S, 18S, ITS

  • With the mouse, select your sequence, add notation
  • Parameters to be changed (look the picture below)
    • Name: name of the gene
    • Type:
      • select rRNA for 18S, ITS, 16S platidial and 28S
      • CDS or gene coding sequence for example rbcL
  • Add property using the 1st ADD: name = product, value = name of gene, for example 18S rRNA.
  • Add annotation using the 2nd ADD (click in INTERVALS to see it): click in truncated left end and truncated right end. This is to tell that the sequence is not complete. For example, the 18S in this tutorial had the extremities before the primer removed, so is incomplete.
    Make sure you do not have two annotations for the same gene !
Annotate genes

Annotate genes

Annotated genes

Annotated genes

3.3 Metadata

  • Add two new type of metadata (it has to be done only once) in the GenBank submission category:
    • Strain
    • Culture_collection Edit Meta data Types > Genbank Submission > click on the + on the right side >
      write Culture Collection on the new field - > ok
      Make sure that these new fields are in the Genbank Submission category. Do not recreate a new category.
      Use exactly the orthograph for names especially with underscores “Culture_collection” and not as before “Culture Collection”.
Add new meta-date type: Culture Collection

Add new meta-date type: Culture Collection

  • Click on the final sequence, go to info and change or correct the following fields.
    • Name : RCC####_gene-name_your-initials_date, e.g. RCC9999_18S_PG_2015_10_01 (change if it is not in this format at this point).
      • This will be the ID of the sequence submitted to GenBank.
      • This name must not contain any space
      • This name must be unique. For example if you submit 2 sequences for the same strain and same gene you must use different names e.g.RCC9999_18S_PG_2015_10_01_A and RCC9999_18S_PG_2015_10_01_B
    • Organism : Picochlorum sp. or Trebouxiophyceae.
      • Enter the genus name or, if not known, the lowest taxonomic level known.
      • Only use the species name if you are absolutely sure of the species as determined by microscopy or ITS. Do not rely on BLAST!!
      • DO NOT add the RCC number at the end of the organism name.
      • For levels above the genus, do not use sp. For example use Trebouxiophyceae and not Trebouxiophyceae sp. or Chlorophyta and not Chlorophyta sp.
    • Strain : This is the RCC code as RCCxxxx without space between RCC and number e.g. RCC1236.
    • Culture_collection : This is the RCC number as RCC:xxxx with “:” between RCC and number e.g. RCC:1236.
Update the different meta-data fields

Update the different meta-data fields

  • It is possible to quickly change metadata for a set of sequences using the Batch edit mode. For example you can :
    1. Copy the Strain field to the Culture Collection field
    2. Add the “:” automatically for all sequences by replacing “RCC” by “RCC:”.
Batch edit - simple

Batch edit - simple

Batch edit - advanced

Batch edit - advanced

3.4 Primers information

This step is Optional for 18S, but but must be added for ITS, 28S and other genes

Edit Meta data Types > Sequencing Primer > OK

You can also use Batch edit to go faster

Add primer information to meta-data

Add primer information to meta-data

4 GenBank submission - General case (not for 16S, 18S or ITS, see next part)

  • Note that since August 2018 16S, 18S, 28S and ITS cannot be submitted by BankIt and must submitted through a web interface.
  • Install plugin GenBank submission
    • Tools > plugin > choose the plugin and click in install

  • Select the sequences you want to submit
  • Select GenBank submission
  • Enter first the Publisher details (add the info like the picture bellow, except that the sequence authors is Daniel Vaulot + who did the sequence)
    1. Name
    2. email
    3. Adress
    4. Sequence authors
    5. Select Unpublished
    6. Reference should be “Roscoff Culture Collection”

  • Check very carefully all the fields
    • Submission name : the name of the file to be saved (this should be kept on the Databases computer)
    • Save a local file (only upload when everything is OK)
    • Project name : Roscoff Culture Collection
    • Molecule type : Genomic DNA
    • Genetic location : in general Genomic but can also be Plastid or Mitochondrion or Nucleomorph for Cryptophytes
    • Sequence ID : Name
    • Organism : Organism
    • Include features/annotation : Yes
    • Include other fields : Yes
      • Culture_collection : Culture_collection (GenBank submission)
      • Strain : Strain (GenBank submission).
    • Primers : You can put the primers if necessary but they need to be entered Sequencing primers

  • Check submission in the Preview mode
    • If Errors you need to correct
    • Ignore warning about Organism not found and Collection
Warnings - Ignore

Warnings - Ignore

Genbank record preview

Genbank record preview

  • Save as tar file.
    • The submission has to be done before processing a new one starts because Geneious keep at the memory the info from the last .tar file you saved.
    • The tar file can be uncompressed to an .asn file which can be opened with Sequin which can be downloaded from NCBI.
  • Finally submit using the Geneious BankIt account and record the BankIt number
Submit to GenBank

Submit to GenBank

5 GenBank submission - 16S, 18S or ITS

Submission must now be done at https://submit.ncbi.nlm.nih.gov/subs/genbank/. If you do not have a login you must create one.

Information about the NCBI submission portal is here. We recommend to read these instructions very carefully before submitting the sequecnes.

The main steps are :

  1. Create fasta file with unique Name for each sequence.
    • Sequence Name (Sequence_ID) cannot contain spaces. The Sequence_ID identifies the same specimen in all the steps of a submission. We use a convention of the following type RCC9999_18S_PG_2015_10_01 (see above)
    • Sequence Name must be unique within the set and may not contain spaces.
    • Sequence Name may contain only the following characters - letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks, and number signs(#).
  2. Create a tabulated file as Text (tsv - tab-delimited) containing all the information about the sequence. See this link for the description of all the modifiers. This file can be easily exported from Geneious and finalized with Excel. For the RCC, the following columns are necessary (fields in bold are mandatory):
    • Name - This field will be used as the Sequence_ID for submission
    • Organism - Picochlorum sp. or Trebouxiophyceae sp.
    • Genbank Submission : Strain - e.g. RCC1236.
    • Genbank Submission : Culture_collection - e.g. RCC:1236.
    • Fwd_primer_name - name of forward PCR primer
    • Fwd_primer_seq - nucleotide sequence of forward PCR primer
    • Rev_primer_name - name of reverse PCR primer
    • Rev_primer_seq - nucleotide sequence of reverse PCR primer
  3. The columns of the tabulated file must be edited, not forgetting the underscores. This is best done with an editor such as Notepad++ or with Excel. In the latter case the file must be saved as a text tabulated file.
    • Name -> Sequence_ID
    • Genbank Submission : Strain -> Strain
    • Genbank Submission : Culture_collection -> Culture_collection
    • name of forward PCR primer -> Fwd_primer_name
    • nucleotide sequence of forward PCR primer -> Fwd_primer_seq
    • name of reverse PCR primer -> Rev_primer_name
    • nucleotide sequence of reverse PCR primer -> Rev_primer_seq

Example of header for the tsv file : Sequence_ID Culture Collection Strain Organism Fwd_primer_name Fwd_primer_seq Rev_primer_name Rev_primer_seq

5.1 Prepare files

We will do a simple case but you can add more columns (see list of modifiers.

  • Fasta file
    • Select sequences
    • Export as fasta
Export to Fasta

Export to Fasta

Ignore this warning

Ignore this warning

Wrap sequences to 80 characters

Wrap sequences to 80 characters

Final fasta file

Final fasta file

  • Source information file tab-delimited (tsv file)
Select sequences. Check the 4 fields (Sequence\_ID, Strain, Culture\_collection and Organism) are correct.

Select sequences. Check the 4 fields (Sequence_ID, Strain, Culture_collection and Organism) are correct.

Export selected documents

Export selected documents

Export as tsv

Export as tsv

Select the columns to be exported ; Name, Culture\_collection, Strain, Organism

Select the columns to be exported ; Name, Culture_collection, Strain, Organism

Edit the tsv fil to remove `GenBank Submission:` in the titles of the columns and change `Name` to `Sequence\_ID`.  This is best done with an editor such as [Notepad++](https://notepad-plus-plus.org/fr/) or with Excel.  In the latter case the file must be saved as a text tabulated file.

Edit the tsv fil to remove GenBank Submission: in the titles of the columns and change Name to Sequence\_ID. This is best done with an editor such as Notepad++ or with Excel. In the latter case the file must be saved as a text tabulated file.

After editing and removing GenBank Submission:

After editing and removing GenBank Submission:

Editing with Excel (save file as tab-delimited tsv)

Editing with Excel (save file as tab-delimited tsv)

5.2 Submit to NCBI web portal

Web portal. Register or login if you have already an ID

Web portal. Register or login if you have already an ID

Enter the type of sequence

Enter the type of sequence

Enter submitter information

Enter submitter information

Enter the sequence technology.  In almost all cases choose Sanger and Assembly

Enter the sequence technology. In almost all cases choose Sanger and Assembly

Sequences. Release date: Choose immediate release in most cases, there is really no need to delay release.- The chimera question is only for Prokaryotes. - Chose pure cultures for cyanos. - Upload the fasta sequence file

Sequences. Release date: Choose immediate release in most cases, there is really no need to delay release.- The chimera question is only for Prokaryotes. - Chose pure cultures for cyanos. - Upload the fasta sequence file

Source information.  Since it will be loaded in the text file, choose - NONE of these

Source information. Since it will be loaded in the text file, choose - NONE of these

Upload tsv file saved from Geneious

Upload tsv file saved from Geneious

After uploading tsv file

After uploading tsv file

Taxonomy error - This error is due to the addition of sp. to taxa at the rank above the genus.  You need to correct and remove sp. to the tsv file.  If the error comes from a new taxon not yet described you can ignore and GenBank will contact you probably to add this taxon to their database.

Taxonomy error - This error is due to the addition of sp. to taxa at the rank above the genus. You need to correct and remove sp. to the tsv file. If the error comes from a new taxon not yet described you can ignore and GenBank will contact you probably to add this taxon to their database.

 Add the reference.  For the Roscoff Culture Collection just fill as indicated with the name of the person who produced the sequence first.

Add the reference. For the Roscoff Culture Collection just fill as indicated with the name of the person who produced the sequence first.

Final check

Final check

Submission status. When you press submit you should arrive at the final screen showing your submission.

Submission status. When you press submit you should arrive at the final screen showing your submission.

A few time latter (from a few minutes to a few days), you shoud receive an email with the accession numbers. Please forward to rcc@sb-roscoff.fr.

A few time latter (from a few minutes to a few days), you shoud receive an email with the accession numbers. Please forward to rcc@sb-roscoff.fr.

  • In case of errors in the submission, you may receive an email as follows.
This email explain which sequence(s) is(are) incorrect.

This email explain which sequence(s) is(are) incorrect.

  • You will need to carefully examine the sequence that Genbank has identified as problematic. Do not disregard their analysis because they really seem to catch errors. In most cases this is due to bad assembly. You need to go back to your trace files (see above) and use a higher threshold to trim the sequences (e.g. O.O2 instead of 0.05) and redo the assembly.
  • Once this is done, remove the old sequence which was bad and add the new sequence (using the same sequence name) to your fasta file
  • Update the .tsv file if necessary.
  • Follow the link in the email to go back to the NCBI web site. You will need to reload ALL sequences in the submission that had problem as well as the tsv file with all the information
The web page to correct your submission.  When you click on the link you will be asked to reload the fasta and tsv files.

The web page to correct your submission. When you click on the link you will be asked to reload the fasta and tsv files.

After doing the correction you should see both the validated submissions as well as the one you just fixed.

After doing the correction you should see both the validated submissions as well as the one you just fixed.

6 Appendixes

6.1 Retrieve sequences from Genbank using Geneious

  • For a list: Go to nucleotide, type the numbers separate by coma and click search. The results will appear in the bottom panel. You must drag the file into a folder in your local database if you wish to retain the file and/or modify it.

For consecutive accessions numbers: type the first and last numbers separated by :, click in more options, change All fields to Accession

Daniel Vaulot & Adriana Lopes dos Santos

Version 2.0 - 23 08 2018