This does sound like a common thing to do. You have a length text file that you only want to get some part of it. For example, I have a file that contain a structure like this
CONTENT I WANT TO GET
END OF FILE
Here, the part I want to grab is between the line with “++++++++++++++” and the blank line.
awk '/\+\+/,/^$/' INFILE
With this small awk trick, you request that awk print the +++ line to the blank line to your terminal.
Now, you just have to remove the +++++++++ and the blank line. I do this with “Stream EDitor” i.e. sed. So the complete lines become something like this…
awk '/\+\+/,/^$/' INFILE | sed '/\+\+/d;/^$/d'
This can really be applied to extract some part of file with tags such as “XML” file. However, it is probably the a very efficient way to parse XML file manually one tag at a time. In R, you can do this more efficiently, using RSXML [http://www.omegahat.org/RSXML/]. And, if you are interacting with a website, you can easily combining it with RCurl [http://www.omegahat.org/RCurl/]