Go through tar archive in memory to extract metadata?

I have several tar archives that I need to extract/read in memory. The problem is each tar contains many ZIP archives and each contain unique XML documents.

So the structure of each tar is as follows: tar -> directories-> ZIPs->XML.

Obviously I can manually extract a single TAR but I have about 1000 TAR archives that are about 3 GB each and contains about 6000 ZIP archives each. I'm looking for a way to handle the .tar archives in memory and extract the XML data of each ZIP. Is there a way to do this?

--------------Solutions-------------

This should be doable, since all of the relevant methods have non-disk-related options.

Lots of loops here, so let's dig in.

For each tar archive:

  • tarfile.open would open the tar archive. (Docs)
  • Call .getmembers on the resulting TarFile instance to get a list of the zips (or other files) contained in the archive. (Docs)

For each zip within the tar archive:

  • Once you know what member file (i.e., one of your zips) you want to look through, call .extractfile on your TarFile instance to get a file object for that zip. (Docs)
  • Instantiate a new zipfile.ZipFile with your file object in order to open the zip so you can work with it. (Docs)
  • Call .infolist on your ZipFile instance to get a list of the files it contains (including your XML files). (Docs)

For each XML file within the zip:

  • Call .open on your ZipFile instance in order to get a file object of one of your XML files. (Docs)
  • You now have a file object corresponding to one of your XML files. Do whatever you want with it: .read it, copy it to disk somewhere, stick it in an ElementTree (docs), etc.
Category:python Time:2018-11-07 Views:0
Tags: python zip tar

Related post

  • Modifying files nested in tar archive 2011-01-21

    I am trying to do a grep and then a sed to search for specific strings inside files, which are inside multiple tars, all inside one master tar archive. Right now, I modify the files by First extracting the master tar archive. Then extracting all the

  • rsync out of git repository or tar archive 2011-04-29

    I have a git repository and I want to rsync a particular revision of the repository into a directory. Basically I want to do this: $ cd my-git-repo $ git archive $MY_COMMIT > ~/blah.tar.gz $ mkdir ~/tmp $ cd ~/tmp $ tar xf ~/blah.tar.gz $ rsync -a

  • How to avoid clobbering files when creating a tar archive 2009-07-14

    This question notes that it is possible to overwrite files when creating a tar archive, and I'm trying to see how to avoid that situation. Normally, I'd use file roller, but the version installed is playing up a bit (using 1.1 Gb of memory), and I'm

  • Autotools - tar This does not look like a tar archive 2010-07-25

    After running make distcheck I get the message that I have successfully built the package and is ready for distribution. If I untar the tar.gz with tar -zxvf hello-0.2.tar.gz it successfully extracts all of its contents. However, when I try to extrac

  • How do I preserve the setuid bit in tar archives with Perl's Archive::Tar? 2009-02-24

    I'm using Perl's Archive::Tar module. It preserves the file permissions but doesn't preserve the sticky bit. At the other end where I extract the archive, all the sticky bits are gone. I think UNIX/LINUX operating system stores these sticky bits some

  • How can I compare file list from a tar archive and directory? 2009-08-13

    I am still learning Perl. Can anyone please suggest me the Perl code to compare files from .tar.gz and a directory path. Let's say I have tar.gz backup of following directory path which I have taken few days back. a/file1 a/file2 a/file3 a/b/file4 a/

  • Creating tar archive with national characters in Java 2009-09-29

    Do you know some library/way in Java to generate tar archive with file names in proper windows national codepage ( for example cp1250 ). I tried with Java tar, example code: final TarEntry entry = new TarEntry( files[i] ); String filename = files[i].

  • Cocoa class for TAR archiving and unarchiving files 2011-05-30

    Has someone written a simple Cocoa wrapper class around tar archiving/unarchiving of files? I plan on doing it myself, unless someone out there has graciously already done it. EDIT: Are there any reasons I shouldn't just write a wrapper class around

  • tar: This does not look like a tar archive 2012-04-30

    I split a huge folder: tar cvpf - somedir | split -b 50000m I then transfered split files to another server and merge it: cat x* > somedir.tar.gz but when I tried to extract the file it shows errors: tar xvf tar xvf somedir.tar.gz tar: This does n

  • Ruby: Create A Gzipped Tar Archive 2009-07-11

    What's the best way to create a gzipped tar archive with Ruby? I have a Rails app that needs to create a compressed archive in response to user actions. Ideally, it would be possible to write directly to a compressed file without needing to generate

  • Manipulate an Archive in memory with PHP (without creating a temporary file on disk) 2009-07-27

    I am trying to generate an archive on-the-fly in PHP and send it to the user immediately (without saving it). I figured that there would be no need to create a file on disk as the data I'm sending isn't persistent anyway, however, upon searching the

  • How can I check if a file exists in a tar archive with Python? 2010-11-14

    I would like to verify the existence of a given file in a tar archive with Python before I get it as a file-like object. I've tried it with isreg(), but probably I do something wrong. How can I check if a file exists in a tar archive with Python? I t

  • python: walking tar archive with gzip files 2011-01-28

    I have some .tar files (ungzipped). Each of them has some .gz files. I need to walk through .tar file and get ungzipped content of all other files. so I wrote: #!/usr/bin/python2.5 -u import tarfile import zlib ar = tarfile.open('20101231.tar', 'r')

  • How to create tar archive without some folders? 2011-12-13

    How I can create tar archive without some folders? Now I'm creating tar archive of folder and deleting some folders from it. But it is takes long time. Structure: www - sub f 1 - sub f 2 - sub f 3 need create archive only with folders (sub f 1) and (

  • How can I programmatically create a tar archive of nested directories and files solely from Python strings and without temporary files? 2011-12-27

    I want to create a tar archive with a hierarchical directory structure from Python, using strings for the contents of the files. I've read this question , which shows a way of adding strings as files, but not as directories. How can I add directories

  • Extracting Metadata of Lotus Notes Applications Using Notes Java API? 2010-06-04

    I am looking for extracting metadata about Notes Applications on a Domino server using Java Notes API. I tried reading the list of applications/databases from catalog.nsf file. But catalog.nsf does not have new Applications that are created based on

  • Extracting metadata from incomplete video files 2013-07-05

    Can anyone tell me where metadata is stored in common video file formats? And if it would be located towards the start of the file, or scattered throughout. I'm working with a remote object store containing a lot of video files and I want to extract

  • Flash / FLV - How do I extract MetaData from an FLV? 2010-09-21

    What software is available for extracting metadata from an FLV? I'm testing an flv streaming module for apache (mod_flvx) and need to extract keyframe info from an flv for debugging purposes. I was only able to find FLV MetaData Viewer, but its dated

  • Extract metadata from old Word files (from 2.0 onwards) 2010-10-20

    I have to extract metadata from a lot (my small working sample has hundreds, the total will probably be thousands) of Microsoft Office files, mostly Word ones. These files Word versions go from Word 2.0 to Word 2007. I have to do it in .net 3.5 (usin

Copyright (C) pcaskme.com, All Rights Reserved.

processed in 1.641 (s). 14 q(s)