How to remove EOL characters from imported literal strings?

asked 2020-07-26 00:54:54 +0200

magviana
31 ●2 ●2 ●4

seq=Word("TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA
AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG
AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG")

returns

File "<ipython-input-4-036a51caf660>", line 1
    seq=Word("TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
                                                                                   ^
SyntaxError: EOL while scanning string literal

Of course one could remove the EOLs manually in this case, but when dealing with common DNA sequences this is hardly practical.

edit retag flag offensive close merge delete

add a comment

answered 2020-07-26 17:28:33 +0200

slelievre
17839 ●22 ●164 ●354 http://carva.org/samue...

There are various ways to deal with this question.

store just the dna code in a file and read from the file
use string concatenation
use multi-line strings

Multi-line strings

Use triple quotes (triple single-quotes or triple double-quotes) for multi-line strings that can include newlines.

In your case I think you want to remove newlines so use the replace method.

seq = Word("""
    TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
    TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA
    AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG
    AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG
    """.replace('\n', '')
)

Splitting a string into several lines

Use auto concatenation of strings:

seq = Word(
    "TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA"
    "TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA"
    "AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG"
    "AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG"
)

Read string from file

It could work better to store just the DNA sequence in the file.

Say the file is dna.txt and contains:

TCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGA
TCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGACTTGAAAGCGAA
AGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGGCAAGAGGCGAGGGG
AGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAGAGCG

Then you can do one of two things.

Read the whole file:

with open('dna.txt', 'r') as f:
    s = f.read()

seq = Word(s.replace('\n', ''))

The advantage here is that the only thing we do while the file is open for reading is read it, then the file is closed again and we use the obtained multi-line string-with-newlines for our purposes: remove newlines and make a word.

Read line by line:

with open('dna.txt', 'r') as f:
    seq = Word(''.join(f))

Here, ''.join(f) takes f as an iterator of lines and joins them with an empty string as separator, thus reconstructing the dna string without linebreaks.

edit flag offensive delete link

Comments

With your suggestion I can now work directly with DNA files in the FASTA format, of the form

>NC_014373.1 Bundibugyo ebolavirus, complete genome
GACACACAAAAAGAATGAAGGATTTTGAATCTTTATTGTGTGCGAGTA...

To extract its header use

def header(xseq):  
    with open(xseq, 'r') as f:
        s = f.read()
        print(Word(s.replace('\n',''))[1:s.index('\n')])

To remove the header and all subsequent newlines use

def fastaseq(xseq):  
    with open(xseq, 'r') as f:
        s = f.read()
        return Word(s.replace('\n',''))[s.index('\n')+1:len(s)]

Example:

header('ebola.fasta')

returns

 NC_014373.1 Bundibugyo ebolavirus, complete genome

and

  len(fastaseq('ebola.fasta'))

returns

magviana ( 2020-07-26 22:00:02 +0200 )edit

See:

complete sequence

magviana ( 2020-07-26 22:29:28 +0200 )edit

add a comment

How to remove EOL characters from imported literal strings?

1 Answer

Multi-line strings

Splitting a string into several lines

Read string from file

Comments

Your Answer

Question Tools

Stats

How to remove EOL characters from imported literal strings? edit

1 Answer

Multi-line strings

Splitting a string into several lines

Read string from file

Comments

Your Answer

Question Tools

Stats

How to remove EOL characters from imported literal strings?