Basic Usage¶

Creating an AppUrl¶

There are two ways to create an AppUrl: parse a string, or instantiate a class. If you are starting from a string, and in particular, don’t know what the Url class should be, use parse_app_url(). If you do know what kind of Url you want to generate, use the Url subclass directly.

After creating a URL, the basic useage involve either manipulating it, or fetching it. THere are two fetching methods, Url.get_resource(), to download files from the web, and Url.get_target() to extract a file from an archive or other container. When the operation is unnecessary, such as getting the resource for a resource URL that has already been downloaded, the Url.get_resource() returns self.

These AppUrls have these components in addition to standard URLS:

A scheme extension, which preceedes the scheme with a ‘+’
A target_file, the first part of the URL fragment
A target_segment, the second part of a URL fragement, delineated by a ‘;’

The scheme_extension specifies the protocol to use with a sthadard web scheme, inspired by github URLs like git+http://github.com/example. The target_file is usually the file within an archive. It is interpreted as a regular expression. The target_segment may be either a name or a number, and is usually interpreted as the name or number of a worksheet in a spreadsheet file. Combining these extensions:

ckan+http://example.com/dataset/archive.zip#excel.xlsx;worksheet

This url may indicate that to fetch a ZIP file from an CKAN server, using the CKAN protocol, extract the excel.xls file from the ZIP archive, and open the worksheet worksheet.

The URLs define a few important concepts:

resource_url: the portion of the URl that defines only the resource to be access or downloaded. In the eample above, the resource url is ‘http://example.com/dataset/archive.zip’
resource_file: The basename of the resource URL: `archive.zip’
resource_format: Usually, the extension of the resource_file: ‘zip’
target_file: The name of the target_file: ‘excel.xlsx’
target_format: The extension of the target_file: ‘xlsx’

Using AppUrls¶

Typical use is:

from appurl import  parse_app_url

url = parse_app_url("http://example.com/archive.zip#file.csv")

resource_url = url.get_resource()

target_path = resource_url.get_target()

The call to url.get_resource() will download the resource file and store it in the cache ,returning a File: url pointing to the downloaded file. If the file is an archive, the call to resource.get_target() will extract the target file from the archive. If it is not an archive, it just returns the resource url. The final result is that target_path is a Url pointing to a file in the filesystem.

Parsing Strings¶

appurl.parse_app_url(u_str, downloader='default', **kwargs)[source]¶

Parse a URL string and return a Url object, with the class based on the highest priority entry point that matches the Url and which of the entry point classes pass the match() test.

Parameters:	u_str – Url string downloader – Downloader object to use for downloading objects. kwargs – Args passed to the Url constructor.
Returns:

The URL Base Class¶

class appurl.Url(url=None, downloader=None, **kwargs)[source]¶

Base class for Application URLs .

After construction, a Url object has a set of properties and attributes for access the parts of the URL, and method for manipulating it. The attributes and properties include the typical properties of a parsed URL, plus properties that are derives from the typical parts, and a few extra components that can be part of the fragment query.

The typical parts are:

scheme
scheme_extension
netloc
hostname
path
params
query
fragment
username
password
port

The fragment is special; it is an array of two elements, the first of which is the target_file and and the second is the target_segment. If there are other parts of the source URL, they must be formates as queriy components, and will be parsed into the fragment_query.

Special application components are:

proto. This is set to the scheme_extension if it exists, the scheme otherwise.
resource_file. The filename of the resource to download. It is usually the last part of the URL, but can be overidden in the fragment
resource_format. The format name of the resource, normally drawn from the resoruce_file extension, but can be overidden in the fragment
target_file. The filename of the file that will be produced by :py:meth`Url.get_target`, but may be overidden.
target_format. The format of the target_file, but may be overidden.
target_segment. A sub-component of the `target_file, such as the worksheet in a spreadsheet.
fragment_query. Holds additional parts of the fragment.

When the fragment holds extra parts, these can be be formatted as a URL query. Recognized keys are:

resource_file
resource_format
target_file
target_format
encoding. Text encoding to be used when reading the target.
headers. For row-oriented data, the row numbers of the headers, as a comma-seperated list of integers.
start. For row-oriented data, the row number of the first row of data ( as opposed to headers. )
end. For row-oriented data, the row number of the last row of data.

Initialize a new Application Url :param url: URL string :param downloader: appurl.web.download.Downloader object. :param kwargs: Additional arguments override URL properties. :return: An Application Url object

Keyword arguments will override properties set by parsing the URL string. Valid keywords that will set object properties are listed below. Other keyswords are accepted and ignored

scheme
scheme_extension
netloc
hostname
path
params
fragment
fragment_query
username
password
port

archive_file()[source]¶: Return the name of the archive file, if there is one.

as_type(cls)[source]¶

Return the URL transformed to a different class. Copies the downloader and build the new url using Url.dict()

Parameters:	cls – Class of Url to construct
Returns:	A new Url object

clear_fragment()[source]¶

Return a copy of the URL with no fragment components

Returns:	A cloned URl object, with the fragment and fragment queries cleared.

clone(**kwargs)[source]¶

Return a clone of this Url, popssibly with some arguments replaced.

Parameters:	kwargs – Keyword arguments are arguments to set in the copy, using `setattr()`
Returns:	A cloned Url object.

dict¶

Returns a dictionary of the object components.

Returns:	a dict.

dirname()[source]¶: Return the dirname of the path

downloader¶: Return the Downloader() for this URL

fspath¶: The path in a form suitable for use in a filesystem

generator¶

Return the generator for this URL, if the rowgenerator package is installed.

Returns:	A row generator object.

get_resource()[source]¶: Get the contents of resource and save it to the cache, returning a file-like object

get_target()[source]¶: Get the contents of the target, and save it to the cache, returning a file-like object

inner¶: Return the URL without the scheme extension and fragment. Re-parses the URL, so it should return the correct class for the inner URL.

interpolate(context=None)[source]¶

Use the Downloader.context to interpolate format strings in the URL. Re-parses the URL,: returning a new URL

Parameters:	context – Extra context to interpolate with
Returns:

is_archive¶: Return true if this URL is for an archive. Currently only ZIP is recognized

join(s)[source]¶

Join a component to the end of the path, using os.path.join(). The argument s may be a appurl.Url or a string. If s includes a netloc property, it is assumed to be an absolute url, and it is returned after parsing as a Url. Otherwise, the path component of s is extracted and joined to the path component of this url.

Parameters:	s – A Url object, or a string.
Returns:	A copy of this url.

join_dir(s)[source]¶

Join a component to the parent directory of the path, using join(dirname())

Parameters:	s –
Returns:	a copy of this url.

join_target(tf)[source]¶: Return a new URL, possibly of a new class, with a new target_file

list()[source]¶: Return URLS for files contained in an container. This implementation just returns [self], but sub classes may, for instance, list all of the sub-components of a directory, or all of the worksheets in an Excel file.

resolve()[source]¶: Resolve a URL to another format, such as by looking up a URL that specified a search, into another URL. The default implementation returns self.

set_fragment(f)[source]¶: Return a clone with the fragment set

set_target_file(v)[source]¶: Return a clone with a target_file set

set_target_segment(v)[source]¶: Return a clone with a target_file set

Basic Usage¶

Creating an AppUrl¶

Using AppUrls¶

Parsing Strings¶

The URL Base Class¶

Row Generators

Navigation

Related Topics