Python read through file until match, read until next pattern

Python 2.4.3

I need to read through some files (can be as large as 10GB). What I need it to do is go through the file until it matches a pattern. Then print that line and every line after it until it matches another pattern. At that point, resume reading through the file until the next pattern match.

For example. file contains.

---- Alpha ---- Zeta ...(text lines) ---- Bravo ---- Delta ...(text lines)

etc.

If matching on ---- Alpha ---- Zeta, it should print ---- Alpha ---- Zeta and every line after that until it encounters ---- Bravo ---- Delta (or whatever other than ---- Alpha ---- Zeta), which it will read right on by it until it matches ---- Alpha ---- Zeta again.

The following matches what i'm looking for - but only prints the matching line - and not the text that follows it.

Any idea where I'm going wrong on it?

import re fh = open('text.txt', 'r') re1='(-)' # Any Single Character 1 re2='(-)' # Any Single Character 2 re3='(-)' # Any Single Character 3 re4='(-)' # Any Single Character 4 re5='( )' # White Space 1 re6='(Alpha)' # Word 1 re6a='((?:[a-z][a-z]+))' # Word 1 alternate re7='( )' # White Space 2 re8='(-)' # Any Single Character 5 re9='(-)' # Any Single Character 6 re10='(-)' # Any Single Character 7 re11='(-)' # Any Single Character 8 re12='(\\s+)' # White Space 3 re13='(Zeta)' # Word 2 re13a='((?:[a-z][a-z]+))' # Word 2 alternate rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13,re.IGNORECASE|re.DOTALL) rga = re.compile(re1+re2+re3+re4+re5+re6a+re7+re8+re9+re10+re11+re12+re13a,re.IGNORECASE|re.DOTALL) for line in fh: if re.match(rg, line): print line fh.next() while not re.match(rga, line): print fh.next() fh.close()

and my example text file.

---- Pappa ---- Oscar Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris eleifend imperdiet lacus quis imperdiet. Nulla erat neque, laoreet vel fermentum a, dapibus in sem. Maecenas elementum nisi nec neque pellentesque ac rutrum urna cursus. Nam non purus sit amet dolor fringilla venenatis. Integer augue neque, scelerisque ac dictum at, venenatis elementum libero. Etiam nec ante in augue porttitor laoreet. Aenean ultrices pellentesque erat, id porta nulla vehicula id. Cras eu ante nec diam dapibus hendrerit in ac diam. Vivamus velit erat, tincidunt id tempus vitae, tempor vel leo. Donec aliquam nibh mi, non dignissim justo. ---- Alpha ---- Zeta Sed molestie tincidunt euismod. Morbi ultrices diam a nibh varius congue. Nulla velit erat, luctus ac ornare vitae, pharetra quis felis. Sed diam orci, accumsan eget commodo eu, posuere sed mi. Phasellus non leo erat. Mauris turpis ipsum, mollis sed ismod nec, aliquam non quam. Vestibulum sem eros, euismod ut pharetra sit amet, dignissim eget leo. ---- Charley ---- Oscar Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Aliquam commodo, metus at vulputate hendrerit, dui justo tempor dui, at posuere ante vitae lorem. Fusce rutrum nibh a erat condimentum laoreet. Nullam eu hendrerit sapien. Suspendisse id lobortis urna. Maecenas ut suscipit nisi. Proin et metus at urna euismod sollicitudin eu at mi. Aliquam ac egestas magna. Quisque ac vestibulum lectus. Duis ac libero magna, et volutpat odio. Cras mollis tincidunt nibh vel rutrum. Curabitur fringilla, ante eget scelerisque rhoncus, libero nisl porta leo, ac vulputate mi erat vitae felis. Praesent auctor fringilla rutrum. Aenean sapien ligula, imperdiet sodales ullamcorper ut, vulputate at enim. ---- Bravo ---- Delta Donec cursus tincidunt pellentesque. Maecenas neque nisi, dignissim ac aliquet ac, vestibulum ut tortor. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Aenean ullamcorper dapibus accumsan. Aenean eros tortor, ultrices at adipiscing sed, lobortis nec dolor. Fusce eros ligula, posuere quis porta nec, rhoncus et leo. Curabitur turpis nunc, accumsan posuere pulvinar eget, sollicitudin eget ipsum. Sed a nibh ac est porta sollicitudin. Pellentesque ut urna ut risus pharetra mollis tincidunt sit amet sapien. Sed semper sollicitudin eros quis pellentesque. Curabitur ac metus lorem, ac malesuada ipsum. Nulla turpis erat, congue eu gravida nec, egestas id nisi. Praesent tellus ligula, pretium vitae ullamcorper vitae, gravida eu ipsum. Cras sed erat ligula. ---- Alpha ---- Zeta Cras id condimentum lectus. Sed sit amet odio eros, ut mollis sapien. Etiam varius tincidunt quam nec mattis. Nunc eu varius magna. Maecenas id ante nisl. Cras sed augue ipsum, non mollis velit. Fusce eu urna id justo sagittis laoreet non id urna. Nullam venenatis tincidunt gravida. Proin mattis est sit amet dolor malesuada sagittis. Curabitur in lacus rhoncus mi posuere ullamcorper. Phasellus eget odio libero, ut lacinia orci. Pellentesque iaculis, ligula at varius vulputate, arcu leo dignissim massa, non adipiscing lectus magna nec dolor. Quisque in libero nec orci vestibulum dapibus. Nulla turpis massa, varius quis gravida eu, bibendum et nisl. Fusce tincidunt laoreet elit, sed egestas diam pharetra eget. Maecenas lacus velit, egestas nec tempor eget, hendrerit et massa.

+++++++++++++++++++++ Update ++++++++++++++++++++++++++++++++

The following code does work - it matches on the header type row - prints that and every line after it until the next header type pattern - which is that doesn't match, skips until the next header type pattern.

Only problem is - it's really really butt slow. It takes about a minute to do through 10m lines.

re1='(-)' # Any Single Character 1 re2='(-)' # Any Single Character 2 re3='(-)' # Any Single Character 3 re4='(-)' # Any Single Character 4 re5='( )' # White Space 1 re6='(Alpha)' # Word 1 re6a='((?:[a-z][a-z]+))' # Word 1 alternate re7='( )' # White Space 2 re8='(-)' # Any Single Character 5 re9='(-)' # Any Single Character 6 re10='(-)' # Any Single Character 7 re11='(-)' # Any Single Character 8 re12='(\\s+)' # White Space 3 re13='(Zeta)' # Word 2 re13a='((?:[a-z][a-z]+))' # Word 2 alternate rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13,re.IGNORECASE|re.DOTALL) rga = re.compile(re1+re2+re3+re4+re5+re6a+re7+re8+re9+re10+re11+re12+re13a,re.IGNORECASE|re.DOTALL) linestop = 0 fh = open('test.txt', 'r') for line in fh: if linestop == 0: if re.match(rg, line): print line linestop = 1 else: if re.match(rga, line): linestop = 0 else: print line fh.close()

+++++++++ If I add a grep part to it first, i'm thinking that'll speed things up tremendously. i.e. grep out - then run the above regex script.

I got os.system to work good - I can't see how to pass a regex match via pOpen

**** Final Update **********

I'm calling this completed. What I ended up doing was:

  • Grep through the file using os.system - and writing the results out.
  • reading the file in and using the re.match I have for above - printing out only the necessary items.

net result was it went from taking about 65 seconds to read through a 10 million line file - printing out the necessary items - to about 3.5 seconds. I wish I could have figured out how to pass grep other than os.system - but maybe it's just not well implimented in python 2.4

--------------Solutions-------------

You're still matching against line, which doesn't change because you're still in the same iteration of the for loop.

I don't see any need to use regex here. Not that they're that bad, but if you're looking for such a specific pattern, it's just overkill to use regex. Try something like this:

def record(filename, pattern):
with open(filename) as file:
recording = False
for line in file:
if line.startswith(pattern):
yield line
recording = not recording
elif recording:
yield line

Calling record with a filename and your pattern gives you a generator object yielding line after line. This is better when dealing with large files, so you don't have to slurp them in at once.

Printing your lines could then work like this - assuming your file is named example.txt:

for rec in record(filename, '---- Alpha ---- Zeta'):
print rec

To be precise: The record generator yields the lines including the line breaks, so you might want to join them back together without any additional line breaks:

print "".join(list(record(filename, '---- Alpha ---- Zeta')))

Category:python Time:2011-09-14 Views:1

Related post

  • How to get a list of files that match some pattern? 2010-06-26

    How to get a list of files that match some pattern if filenames may contain \n character? Update: I want solution in pure vimscript, so that it will depend on nothing but vim. Update2: Expected output of glob function Consider the following script: :

  • Vim: Editing the python.vim syntax file to highlight like Textmate 2011-11-29

    I'm trying to edit the python.vim syntax file to duplicate the syntax highlighting for python in Textmate. The attached image illustrates the highlighting of function parameters which i'm struggling to achieve. The self, a, b is highlighted in Textma

  • Python re.sub MULTILINE caret match 2008-09-03

    The Python docs say: re.MULTILINE: When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline)... By default, '^' matches only at the beginning of the string.

  • How to find files that match a wildcard string in Java? 2009-04-27

    This should be really simple. If I have a String like this: ../Test?/sample*.txt then what is a generally-accepted way to get a list of files that match this pattern? (e.g. it should match ../Test1/sample22b.txt and ../Test4/sample-spiffy.txt but not

  • Vim: edit all files that match a certain pattern? 2010-01-31

    I'm in ~/src I can do git grep _pattern_ and get a list of all *.cpp */hpp files that match this pattern Now, I would like to go through all the files that match the pattern and make edits on them. How do I do this in vim? (basically I want vim to gr

  • File name matching with -e and regular expressions in perl 2010-02-08

    I need to check for the existence of a file in a directory. The file name has a pattern like: /d1/d2/d3/abcd_12345_67890.dat In my program, I will know the file name up to abcd_ I need to write an if condition using -e option and find the files match

  • Having Trouble Renaming Files that Match with my RegEx 2010-06-29

    I have an app that "cleans" "dirty" filenames. "Dirty" filenames have #%&~+{} in their filenames. What my app does is see if they are a match for a RegEx pattern i have defined and then send it to a method called FileCleanUp where it "cleans" the

  • file name matching with wildcard 2010-07-21

    I need to implement something like my own file system. One operation would be the FindFirstFile. I need to check, if the caller passed something like ., sample*.cpp or so. My "file system" implementation provides the list of "files names" as a array

  • ubuntu /usr/bin/env: python: No such file or directory 2010-09-07

    I update the kernel, after that the Ubuntu doesn't work well, PS: I try to exec "meld" command, it will report that "/usr/bin/env: python: No such file or directory", then I exec "sudo apt-get install python" and get the result "python is already the

  • python(or numpy) equivalent of match in R 2010-11-05

    Is there any easy way in python to accomplish what the match function does in R? what match in R does is that it returns a vector of the positions of (first) matches of its first argument in its second. For example, the following R snippet. > a

  • have a new .gitignore tell me if any file already matches tracked files 2010-11-10

    is there a way to parse a .gitignore and tell me if i already have some files that match the ignores so that i know to remove them or change my .gitignore? example is that i have git tracking my imgs directory, and i wish to ignore all of windows Thu

  • Glob pattern help - Files NOT MATCHING a specific string 2011-07-12

    I understand the basics of PHP's glob function, but I'm having a bit of trouble with a specific type of pattern I'm trying to match. Is it possible to use glob to find files that match the following rule? I have a list of files with the following nam

  • Recursively copy files that match a wildcard combination but not create the directory tree in DOS 2011-09-16

    I found that I can use xcopy /s to copy all files that match a wildcard combination in a folder to another location. But this command re-creates the folder structure. I do not want the tree. I need just the files dumped into the destination folder. T

  • Importing code from file into Python and adding file dialog box 2011-10-02

    I have a python script signalgen.py that plays audio using equations but I would like to be able to hard code the file where the equation is stored in eq1.txt or choose a file and import the equation. The problems I'm having are: 1) How can I hard co

  • List files not matching a pattern? 2011-12-15

    Here's how one might list all files matching a pattern in bash: ls *.jar How to list the complement of a pattern? i.e. all files not matching *.jar? --------------Solutions------------- ls | grep -v '\.jar$' for instance. Use egrep-style extended pat

  • How can I have Python look for files in the location of the program? 2012-02-19

    I'm having a problem with a program in which I have to load images and pickled objects: my Python software doesn't appear to be looking in the location of the program. I have my program in a folder called "King's Capture," and my images in a folder w

  • AS3 - Read all values from properties file that match certain key pattern 2012-03-12

    I've got a properties file with two distinct key-value pair patters like the following; name.name.name.key = value name.name.fullname.key = value Accessing them one by one is fine if the key is known. What I need now, though, is to access ALL values

  • Using powershell to find files that match two seperate regular expressions 2012-03-12

    I want to find any file that can match two regular expressions, for example all files that have at least one of bat, hat, cat as well as the word noun. That would be any file that can match both regular expressions [bhc]at and noun. The regex paramet

  • Listing all files matching a full-path pattern in R 2012-04-27

    I am trying to obtain the list of files matching a full-path pattern. So far, I have used list.files() but it did not work. Let's assume that we have the following directory organization: results |- A | |- data-1.csv | |- data-2.csv | |- B |- data-1.

Copyright (C) pcaskme.com, All Rights Reserved.

processed in 4.737 (s). 13 q(s)