Basic Usage

Creating an AppUrl

There are two ways to create an AppUrl: parse a string, or instantiate a class. If you are starting from a string, and in particular, don’t know what the Url class should be, use parse_app_url(). If you do know what kind of Url you want to generate, use the Url subclass directly.

After creating a URL, the basic useage involve either manipulating it, or fetching it. THere are two fetching methods, Url.get_resource(), to download files from the web, and Url.get_target() to extract a file from an archive or other container. When the operation is unnecessary, such as getting the resource for a resource URL that has already been downloaded, the Url.get_resource() returns self.

These AppUrls have these components in addition to standard URLS:

  • A scheme extension, which preceedes the scheme with a ‘+’
  • A target_file, the first part of the URL fragment
  • A target_segment, the second part of a URL fragement, delineated by a ‘;’

The scheme_extension specifies the protocol to use with a sthadard web scheme, inspired by github URLs like git+http://github.com/example. The target_file is usually the file within an archive. It is interpreted as a regular expression. The target_segment may be either a name or a number, and is usually interpreted as the name or number of a worksheet in a spreadsheet file. Combining these extensions:

ckan+http://example.com/dataset/archive.zip#excel.xlsx;worksheet

This url may indicate that to fetch a ZIP file from an CKAN server, using the CKAN protocol, extract the excel.xls file from the ZIP archive, and open the worksheet worksheet.

The URLs define a few important concepts:

  • resource_url: the portion of the URl that defines only the resource to be access or downloaded. In the eample above, the resource url is ‘http://example.com/dataset/archive.zip
  • resource_file: The basename of the resource URL: `archive.zip’
  • resource_format: Usually, the extension of the resource_file: ‘zip’
  • target_file: The name of the target_file: ‘excel.xlsx’
  • target_format: The extension of the target_file: ‘xlsx’

Using AppUrls

Typical use is:

from appurl import  parse_app_url

url = parse_app_url("http://example.com/archive.zip#file.csv")

resource_url = url.get_resource()

target_path = resource_url.get_target()

The call to url.get_resource() will download the resource file and store it in the cache ,returning a File: url pointing to the downloaded file. If the file is an archive, the call to resource.get_target() will extract the target file from the archive. If it is not an archive, it just returns the resource url. The final result is that target_path is a Url pointing to a file in the filesystem.

Parsing Strings

appurl.parse_app_url(u_str, downloader='default', **kwargs)[source]

Parse a URL string and return a Url object, with the class based on the highest priority entry point that matches the Url and which of the entry point classes pass the match() test.

Parameters:
  • u_str – Url string
  • downloader – Downloader object to use for downloading objects.
  • kwargs – Args passed to the Url constructor.
Returns:

The URL Base Class

class appurl.Url(url=None, downloader=None, **kwargs)[source]

Base class for Application URLs .

After construction, a Url object has a set of properties and attributes for access the parts of the URL, and method for manipulating it. The attributes and properties include the typical properties of a parsed URL, plus properties that are derives from the typical parts, and a few extra components that can be part of the fragment query.

The typical parts are:

  • scheme
  • scheme_extension
  • netloc
  • hostname
  • path
  • params
  • query
  • fragment
  • username
  • password
  • port

The fragment is special; it is an array of two elements, the first of which is the target_file and and the second is the target_segment. If there are other parts of the source URL, they must be formates as queriy components, and will be parsed into the fragment_query.

Special application components are:

  • proto. This is set to the scheme_extension if it exists, the scheme otherwise.
  • resource_file. The filename of the resource to download. It is usually the last part of the URL, but can be overidden in the fragment
  • resource_format. The format name of the resource, normally drawn from the resoruce_file extension, but can be overidden in the fragment
  • target_file. The filename of the file that will be produced by :py:meth`Url.get_target`, but may be overidden.
  • target_format. The format of the target_file, but may be overidden.
  • target_segment. A sub-component of the `target_file, such as the worksheet in a spreadsheet.
  • fragment_query. Holds additional parts of the fragment.

When the fragment holds extra parts, these can be be formatted as a URL query. Recognized keys are:

  • resource_file
  • resource_format
  • target_file
  • target_format
  • encoding. Text encoding to be used when reading the target.
  • headers. For row-oriented data, the row numbers of the headers, as a comma-seperated list of integers.
  • start. For row-oriented data, the row number of the first row of data ( as opposed to headers. )
  • end. For row-oriented data, the row number of the last row of data.

Initialize a new Application Url :param url: URL string :param downloader: appurl.web.download.Downloader object. :param kwargs: Additional arguments override URL properties. :return: An Application Url object

Keyword arguments will override properties set by parsing the URL string.

archive_file()[source]

Return the name of the archive file, if there is one.

as_type(cls)[source]

Return the URL transformed to a different class. Copies the downloader and build the new url using Url.dict()

Parameters:cls – Class of Url to construct
Returns:A new Url object
clone(**kwargs)[source]

Return a clone of this Url, possibly with some arguments replaced.

Parameters:kwargs – Keyword arguments are arguments to set in the copy, using setattr()
Returns:A cloned Url object.
dirname()[source]

Return the dirname of the path

downloader

Return the Downloader() for this URL

fspath

The path in a form suitable for use in a filesystem

generator

Return the generator for this URL, if the rowgenerator package is installed.

Returns:A row generator object.
get_resource()[source]

Get the contents of resource and save it to the cache, returning a file-like object

get_target()[source]

Get the contents of the target, and save it to the cache, returning a file-like object

inner

Return the URL without the scheme extension and fragment. Re-parses the URL, so it should return the correct class for the inner URL.

interpolate(context=None)[source]
Use the Downloader.context to interpolate format strings in the URL. Re-parses the URL,
returning a new URL
Parameters:context – Extra context to interpolate with
Returns:
is_archive

Return true if this URL is for an archive. Currently only ZIP is recognized

join(s)[source]

Join a component to the end of the path, using os.path.join(). The argument s may be a appurl.Url or a string. If s includes a netloc property, it is assumed to be an absolute url, and it is returned after parsing as a Url. Otherwise, the path component of s is extracted and joined to the path component of this url.

Parameters:s – A Url object, or a string.
Returns:A copy of this url.
join_dir(s)[source]

Join a component to the parent directory of the path, using join(dirname())

Parameters:s
Returns:a copy of this url.
join_target(tf)[source]

Return a new URL, possibly of a new class, with a new target_file

list()[source]

Return URLS for files contained in an container. This implementation just returns [self], but sub classes may, for instance, list all of the sub-components of a directory, or all of the worksheets in an Excel file.

resolve()[source]

Resolve a URL to another format, such as by looking up a URL that specified a search, into another URL. The default implementation returns self.