gswrap

Wrap Google Cloud Storage API for multi-threaded data manipulation.

class gswrap.Client(project=None)

Google Cloud Storage Client for simple usage of gsutil commands.

cp(src, dst, recursive=False, no_clobber=False, multithreaded=False, preserve_posix=False)

Copy objects from source to destination URL.

Parameters:
  • src (Union[str, Path]) – Source URL
  • dst (Union[str, Path]) – Destination URL
  • recursive (bool) –

    (from https://cloud.google.com/storage/docs/gsutil/commands/cp) Causes directories, buckets, and bucket subdirectories to be copied recursively. If you neglect to use this option for an upload/download, gswrap will raise an exception and inform you that no URL matched. Same behaviour as gsutil as long as no wildcards are used.

    your-bucket before:
    ”empty”
    client.cp(src=”gs://your-bucket/some-dir/”,
    dst=”gs://your-bucket/another-dir/”, recursive=False)
    # google.api_core.exceptions.GoogleAPIError: No URLs matched
    current some-dir:
    # gs://your-bucket/some-dir/file1
    # gs://your-bucket/some-dir/dir1/file11
    # destination URL without slash
    client.cp(src=”gs://your-bucket/some-dir/”,
    dst=”gs://your-bucket/another-dir”, recursive=True)
    # another-dir after:
    # gs://your-bucket/another-dir/file1
    # gs://your-bucket/another-dir/dir1/file11
    # destination URL with slash
    client.cp(src=”gs://your-bucket/some-dir/”,
    dst=”gs://your-bucket/another-dir/”, recursive=True)
    # another-dir after:
    # gs://your-bucket/another-dir/some-dir/file1
    # gs://your-bucket/another-dir/some-dir/dir1/file11
  • no_clobber (bool) – (from https://cloud.google.com/storage/docs/gsutil/commands/cp) When specified, existing files or objects at the destination will not be overwritten.
  • multithreaded (bool) – if set to False the copy will be performed single-threaded. If set to True it will use multiple threads to perform the copy.
  • preserve_posix (bool) – (from https://cloud.google.com/storage/docs/gsutil/commands/cp) Causes POSIX attributes to be preserved when objects are copied. With this feature enabled, gsutil cp will copy fields provided by stat. These are the user ID of the owner, the group ID of the owning group, the mode (permissions) of the file, and the access/modification time of the file. POSIX attributes are always preserved when blob is copied on Google Cloud Storage.
Return type:

None

Requires:
  • not contains_wildcard(prefix=str(dst))
  • not contains_wildcard(prefix=str(src))
cp_many_to_many(srcs_dsts, recursive=False, no_clobber=False, multithreaded=False, preserve_posix=False)

Perform multiple copy operations in a single function call.

Each source will be copied to the corresponding destination. Only one function call minimizes the overhead and the operations can be performed significantly faster.

sources_destinations = [
# Copy on Google Cloud Storage
(‘gs://your-bucket/your-dir/file’,
‘gs://your-bucket/backup-dir/file’),
# Copy from gcs to local
(‘gs://your-bucket/your-dir/file’,
pathlib.Path(‘/home/user/storage/backup-file’)),
# Copy from local to gcs
(pathlib.Path(‘/home/user/storage/new-file’),
‘gs://your-bucket/your-dir/new-file’),
# Copy locally
(pathlib.Path(‘/home/user/storage/file’),
pathlib.Path(‘/home/user/storage/new-file’))]
client.cp_many_to_many(srcs_dsts=sources_destinations)
Parameters:
  • srcs_dsts (Sequence[Tuple[Union[str, Path], Union[str, Path]]]) – source URLs/paths and destination URLs/paths
  • recursive (bool) – (from https://cloud.google.com/storage/docs/gsutil/commands/cp) Causes directories, buckets, and bucket subdirectories to be copied recursively. If you neglect to use this option for an upload/download, gswrap will raise an exception and inform you that no URL matched. Same behaviour as gsutil as long as no wildcards are used.
  • no_clobber (bool) – (from https://cloud.google.com/storage/docs/gsutil/commands/cp) When specified, existing files or objects at the destination will not be overwritten.
  • multithreaded (bool) – if set to False the copy will be performed single-threaded. If set to True it will use multiple threads to perform the copy.
  • preserve_posix (bool) – (from https://cloud.google.com/storage/docs/gsutil/commands/cp) Causes POSIX attributes to be preserved when objects are copied. With this feature enabled, gsutil cp will copy fields provided by stat. These are the user ID of the owner, the group ID of the owning group, the mode (permissions) of the file, and the access/modification time of the file. POSIX attributes are always preserved when blob is copied on Google Cloud Storage.
Return type:

None

long_ls(url, recursive=False)

List URLs with their stats given the url.

client.long_ls(gcs_url=”gs://your-bucket/your-dir”, recursive=False)
# (‘gs://your-bucket/your-dir/your-subdir1/’, None)
# (‘gs://your-bucket/your-dir/your-subdir2/’ None)
# (‘gs://your-bucket/your-dir/file1,
<gswrap.Stat object at 0x7fea01c4a550>)
Parameters:
  • url (str) – Google Cloud Storage URL
  • recursive (bool) – if True, list directories recursively if False, list only direct subdirectory
Return type:

List[Tuple[str, Optional[Stat]]]

Returns:

List of the urls of the blobs found and their stats

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
ls(url, recursive=False)

List the files on Google Cloud Storage given the prefix.

Functionality is the same as “gsutil ls (-r)” command. Except that no wildcards are allowed. For more information about “gsutil ls” check out: https://cloud.google.com/storage/docs/gsutil/commands/ls

client.ls(gcs_url=”gs://your-bucket/your-dir”, recursive=False)
# gs://your-bucket/your-dir/your-subdir1/
# gs://your-bucket/your-dir/your-subdir2/
# gs://your-bucket/your-dir/file1
client.ls(gcs_url=”gs://your-bucket/your-dir”, recursive=True)
# gs://your-bucket/your-dir/your-subdir1/file1
# gs://your-bucket/your-dir/your-subdir1/file2
# gs://your-bucket/your-dir/your-subdir2/file1
# gs://your-bucket/your-dir/file1
client.ls(url=”gs://your-bucket/your-“, recursive=True)
will return an empty list
Parameters:
  • url (str) – Google Cloud Storage URL
  • recursive (bool) – List only direct subdirectories
Return type:

List[str]

Returns:

List of Google Cloud Storage URLs according the given URL

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
md5_hexdigests(urls, multithreaded=False)

Retrieve hex digests of MD5 checksums for multiple URLs.

urls = [‘gs://your-bucket/file1’, ‘gs://your-bucket/file2’]
client.md5_hexdigests(urls=urls, multithreaded=False)
Parameters:
  • urls (List[str]) – URLs to stat and retrieve MD5 of
  • multithreaded (bool) – if set to False the retrieving hex digests of md5 checksums will be performed single-threaded. If set to True it will use multiple threads to perform the this.
Return type:

List[Optional[str]]

Returns:

list of hexdigests; if an URL does not exist, the corresponding item is None.

read_bytes(url)

Retrieve the bytes of the blob at the URL.

The caller is expected to make sure that the file fits in memory.

data = client.read_bytes(url=”gs://your-bucket/data”)
data.decode(‘utf-8’)
# I’m important data
Parameters:

url (str) – to the blob on the storage

Return type:

bytes

Returns:

bytes of the blob

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
read_text(url, encoding='utf-8')

Retrieve the text of the blob at the URL.

The caller is expected to make sure that the file fits in memory.

client.read_text(url=”gs://your-bucket/file”,
encoding=’utf-8’)
# Hello I’m text
Parameters:
  • url (str) – to the blob on the storage
  • encoding (str) – used to decode the text, defaults to ‘utf-8’
Return type:

str

Returns:

text of the blob

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
rm(url, recursive=False, multithreaded=False)

Remove blobs at given URL from Google Cloud Storage.

# your-bucket before:
# gs://your-bucket/file
client.rm(url=”gs://your-bucket/file”)
# your-bucket after:
# “empty”
# your-bucket before:
# gs://your-bucket/file1
# gs://your-bucket/your-dir/file2
# gs://your-bucket/your-dir/sub-dir/file3
client.rm(url=”gs://your-bucket/your-dir”, recursive=True)
# your-bucket after:
# gs://your-bucket/file1
Parameters:
  • url (str) – Google Cloud Storage URL
  • recursive (bool) – if True remove files within folders
  • multithreaded (bool) – if set to False the remove will be performed single-threaded. If set to True it will use multiple threads to perform the remove.
Return type:

None

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
same_md5(path, url)

Check if the MD5 differs between the local file and the blob.

client.same_md5(path=’/home/user/storage/file’,
url=’gs://your-bucket/file’)
Parameters:
  • path (Union[str, Path]) – to the local file
  • url (str) – to the remote object in Google storage
Return type:

bool

Returns:

True if the MD5 is the same. False if the checksum differs or local file and/or the remote object do not exist.

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
same_modtime(path, url)

Check if local path and URL have equal modification times (up to secs).

Mind that you need to copy the object with -P (preserve posix) flag.

client.same_modtime(path=’/home/user/storage/file’,
url=’gs://your-bucket/file’)
Parameters:
  • path (Union[str, Path]) – to the local file
  • url (str) – URL to an object
Return type:

bool

Returns:

True if the modification time is the same

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
stat(url)

Retrieve the stat of the object in the Google Cloud Storage.

stats = client.stat(url=”gs://your-bucket/file”)
stats.creation_time # 2018-11-21 13:27:46.255000+00:00
stats.update_time # 2018-11-21 13:27:46.255000+00:00
stats.content_length # 1024 [bytes]
stats.storage_class # REGIONAL
stats.file_atime # 2018-11-21 13:27:46
stats.file_mtime # 2018-11-21 13:27:46
stats.posix_uid # 1000
stats.posix_gid # 1000
stats.posix_mode # 777
stats.md5 # b‘1B2M2Y8AsgTpgAmY7PhCfg==’
stats.crc32c # b’AAAAAA==’
Parameters:

url (str) – to the object

Return type:

Optional[Stat]

Returns:

object status, or None if the object does not exist or is a directory.

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
write_bytes(url, data)

Write bytes to the storage at the given URL.

client.write_bytes(url=”gs://your-bucket/data”,
data=”I’m important data”.encode(‘utf-8’))
Parameters:
  • url (str) – where to write in the storage
  • data (bytes) – what to write
Return type:

None

Returns:

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
write_text(url, text, encoding='utf-8')

Write bytes to the storage at the given URL.

client.write_text(url=”gs://your-bucket/file”,
text=”Hello, I’m text”,
encoding=’utf-8’)
Parameters:
  • url (str) – where to write in the storage
  • text (str) – what to write
  • encoding (str) – how to encode, defaults to ‘utf-8’
Return type:

None

Returns:

Requires:
  • not contains_wildcard(prefix=url)
  • url.startswith('gs://')
class gswrap.Stat

Represent stat of an object in Google Storage.

Times are given in UTC.

Variables:
  • creation_time (Optional[datetime.datetime]) – time when blob on Google Cloud Storage was created. Not equal creation time of the local file.
  • update_time (Optional[datetime.datetime]) – time when blob on Google Cloud Storage was last updated. Not equal modification time of the local file.
  • storage_class (Optional[str]) – tells in what kind of storage data is stored. More information: https://cloud.google.com/storage/docs/storage-classes
  • content_length (Optional[int]) – size of the object
  • file_mtime (Optional[datetime.datetime]) – modification time of the local file stored in the metadata of the blob. Only available when file was uploaded with preserve_posix.
  • file_atime (Optional[datetime.datetime]) – last access time of the local file stored in the metadata of the blob. Only available when file was uploaded with preserve_posix.
  • posix_uid (Optional[str]) – user id of the owner of the local file stored in the metadata of the blob. Only available when file was uploaded with preserve_posix.
  • posix_gid (Optional[str]) – group id of the owner of the local file stored in the metadata of the blob. Only available when file was uploaded with preserve_posix.
  • posix_mode (Optional[str]) – inode protection mode of the local file stored in the metadata of the blob. Only available when file was uploaded with preserve_posix.
  • crc32c (Optional[bytes]) – CRC32C checksum for this object.
  • md5 (Optional[bytes]) – MD5 hash for this object.
gswrap.contains_wildcard(prefix)

Check if prefix contains any wildcards.

>>> contains_wildcard(prefix='gs://your-bucket/some-dir/file')
False
>>> contains_wildcard(prefix='gs://your-bucket/*/file')
True
Parameters:prefix (str) – path to a file or a directory
Return type:bool
Returns:
gswrap.resource_type(res_loc)

Determine resource type.

>>> url = resource_type(res_loc='gs://your-bucket/some-dir/file')
>>> isinstance(url, _GCSURL)
True
>>> url.bucket
'your-bucket'
>>> url.prefix
'some-dir/file'
>>> path = resource_type(res_loc='/home/user/work/file')
>>> path
'/home/user/work/file'
>>> isinstance(path, str)
True
Parameters:res_loc (str) – resource location
Return type:Union[_GCSURL, str]
Returns:class corresponding to the file/directory location