As the doctor gone rogue

September 29, 2011

Print from here to there with “awk”

Filed under: bash, data management, R — hypotheses @ 12:16 pm

This does sound like a common thing to do.  You have a length text file that you only want to get some part of it. For example, I have a file that contain a structure like this

HEADER
BODY
++++++++++++++++++++++++++++
CONTENT I WANT TO GET

END OF FILE

Here, the part I want to grab is between the line with “++++++++++++++” and the blank line.


awk '/\+\+/,/^$/' INFILE

With this small awk trick, you request that awk  print the +++ line to the blank line to your terminal.

Now, you just have to remove the +++++++++ and the blank line. I do this with “Stream EDitor” i.e. sed. So the complete lines become something like this…


awk '/\+\+/,/^$/' INFILE | sed '/\+\+/d;/^$/d'

This can really be applied to extract some part of file with tags such as “XML” file. However, it is probably the a very efficient way to parse XML file manually one tag at a time. In R, you can do this more efficiently, using RSXML [http://www.omegahat.org/RSXML/]. And, if you are interacting with a website, you can easily combining it with RCurl [http://www.omegahat.org/RCurl/]

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: