Ask Your Question
1

How to remove EOL characters from imported literal strings?

asked 2020-07-26 00:54:54 +0200

magviana gravatar image
seq=Word("TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA
AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG
AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG")

returns

File "<ipython-input-4-036a51caf660>", line 1
    seq=Word("TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
                                                                                   ^
SyntaxError: EOL while scanning string literal

Of course one could remove the EOLs manually in this case, but when dealing with common DNA sequences this is hardly practical.

edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
2

answered 2020-07-26 17:28:33 +0200

slelievre gravatar image

There are various ways to deal with this question.

  • store just the dna code in a file and read from the file
  • use string concatenation
  • use multi-line strings

Multi-line strings

Use triple quotes (triple single-quotes or triple double-quotes) for multi-line strings that can include newlines.

In your case I think you want to remove newlines so use the replace method.

seq = Word("""
    TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
    TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA
    AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG
    AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG
    """.replace('\n', '')
)

Splitting a string into several lines

Use auto concatenation of strings:

seq = Word(
    "TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA"
    "TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA"
    "AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG"
    "AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG"
)

Read string from file

It could work better to store just the DNA sequence in the file.

Say the file is dna.txt and contains:

TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA
AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG
AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG

Then you can do one of two things.

Read the whole file:

with open('dna.txt', 'r') as f:
    s = f.read()

seq = Word(s.replace('\n', ''))

The advantage here is that the only thing we do while the file is open for reading is read it, then the file is closed again and we use the obtained multi-line string-with-newlines for our purposes: remove newlines and make a word.

Read line by line:

with open('dna.txt', 'r') as f:
    seq = Word(''.join(f))

Here, ''.join(f) takes f as an iterator of lines and joins them with an empty string as separator, thus reconstructing the dna string without linebreaks.

edit flag offensive delete link more

Comments

With your suggestion I can now work directly with DNA files in the FASTA format, of the form

>NC_014373.1 Bundibugyo ebolavirus, complete genome
GACACACAAAAAGAATGAAGGATTTTGAATCTTTATTGTGTGCGAGTA...

To extract its header use

def header(xseq):  
    with open(xseq, 'r') as f:
        s = f.read()
        print(Word(s.replace('\n',''))[1:s.index('\n')])

To remove the header and all subsequent newlines use

def fastaseq(xseq):  
    with open(xseq, 'r') as f:
        s = f.read()
        return Word(s.replace('\n',''))[s.index('\n')+1:len(s)]

Example:

header('ebola.fasta')

returns

 NC_014373.1 Bundibugyo ebolavirus, complete genome

and

  len(fastaseq('ebola.fasta'))

returns

18939
magviana gravatar imagemagviana ( 2020-07-26 22:00:02 +0200 )edit
magviana gravatar imagemagviana ( 2020-07-26 22:29:28 +0200 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2020-07-26 00:54:54 +0200

Seen: 489 times

Last updated: Jul 26 '20