Basic Usage¶

Creating an AppUrl¶

There are two ways to create an AppUrl: parse a string, or instantiate a class. If you are starting from a string, and in particular, don’t know what the Url class should be, use parse_app_url(). If you do know what kind of Url you want to generate, use the Url subclass directly.

After creating a URL, the basic useage involve either manipulating it, or fetching it. THere are two fetching methods, Url.get_resource(), to download files from the web, and Url.get_target() to extract a file from an archive or other container. When the operation is unnecessary, such as getting the resource for a resource URL that has already been downloaded, the Url.get_resource() returns self.

These AppUrls have these components in addition to standard URLS:

A scheme extension, which preceedes the scheme with a ‘+’
A target_file, the first part of the URL fragment
A target_segment, the second part of a URL fragement, delineated by a ‘;’

The scheme_extension specifies the protocol to use with a sthadard web scheme, inspired by github URLs like git+http://github.com/example. The target_file is usually the file within an archive. It is interpreted as a regular expression. The target_segment may be either a name or a number, and is usually interpreted as the name or number of a worksheet in a spreadsheet file. Combining these extensions:

ckan+http://example.com/dataset/archive.zip#excel.xlsx;worksheet

This url may indicate that to fetch a ZIP file from an CKAN server, using the CKAN protocol, extract the excel.xls file from the ZIP archive, and open the worksheet worksheet.

The URLs define a few important concepts:

resource_url: the portion of the URl that defines only the resource to be access or downloaded. In the eample above, the resource url is ‘http://example.com/dataset/archive.zip’
resource_file: The basename of the resource URL: `archive.zip’
resource_format: Usually, the extension of the resource_file: ‘zip’
target_file: The name of the target_file: ‘excel.xlsx’
target_format: The extension of the target_file: ‘xlsx’

Using AppUrls¶

Typical use is:

from appurl import  parse_app_url

url = parse_app_url("http://example.com/archive.zip#file.csv")

resource_url = url.get_resource()

target_path = resource_url.get_target()

The call to url.get_resource() will download the resource file and store it in the cache ,returning a File: url pointing to the downloaded file. If the file is an archive, the call to resource.get_target() will extract the target file from the archive. If it is not an archive, it just returns the resource url. The final result is that target_path is a Url pointing to a file in the filesystem.

Parsing Strings¶

appurl.parse_app_url(u_str, downloader='default', **kwargs)[source]¶

Parse a URL string and return a Url object, with the class based on the highest priority entry point that matches the Url and which of the entry point classes pass the match() test.

Parameters:	u_str – Url string downloader – Downloader object to use for downloading objects. kwargs – Args passed to the Url constructor.
Returns:

The URL Base Class¶

class appurl.Url(url=None, downloader=None, **kwargs)[source]¶

Base class for Application URLs .

After construction, a Url object has a set of properties and attributes for access the parts of the URL, and method for manipulating it. The attributes and properties include the typical properties of a parsed URL, plus properties that are derives from the typical parts, and a few extra components that can be part of the fragment query.

The typical parts are:

scheme
scheme_extension
netloc
hostname
path
params
query
fragment
username
password
port

The fragment is special; it is an array of two elements, the first of which is the target_file and and the second is the target_segment. If there are other parts of the source URL, they must be formates as queriy components, and will be parsed into the fragment_query.

Special application components are:

proto. This is set to the scheme_extension if it exists, the scheme otherwise.
resource_file. The filename of the resource to download. It is usually the last part of the URL, but can be overidden in the fragment
resource_format. The format name of the resource, normally drawn from the resoruce_file extension, but can be overidden in the fragment
target_file. The filename of the file that will be produced by :py:meth`Url.get_target`, but may be overidden.
target_format. The format of the target_file, but may be overidden.
target_segment. A sub-component of the `target_file, such as the worksheet in a spreadsheet.
fragment_query. Holds additional parts of the fragment.

When the fragment holds extra parts, these can be be formatted as a URL query. Recognized keys are:

resource_file
resource_format
target_file
target_format
encoding. Text encoding to be used when reading the target.
headers. For row-oriented data, the row numbers of the headers, as a comma-seperated list of integers.
start. For row-oriented data, the row number of the first row of data ( as opposed to headers. )
end. For row-oriented data, the row number of the last row of data.

Initialize a new Application Url :param url: URL string :param downloader: appurl.web.download.Downloader object. :param kwargs: Additional arguments override URL properties. :return: An Application Url object

Keyword arguments will override properties set by parsing the URL string.

archive_file()[source]¶: Return the name of the archive file, if there is one.

as_type(cls)[source]¶

Return the URL transformed to a different class. Copies the downloader and build the new url using Url.dict()

Parameters:	cls – Class of Url to construct
Returns:	A new Url object

clone(**kwargs)[source]¶

Return a clone of this Url, possibly with some arguments replaced.

Parameters:	kwargs – Keyword arguments are arguments to set in the copy, using `setattr()`
Returns:	A cloned Url object.

dirname()[source]¶: Return the dirname of the path

downloader¶: Return the Downloader() for this URL

fspath¶: The path in a form suitable for use in a filesystem

generator¶

Return the generator for this URL, if the rowgenerator package is installed.

Returns:	A row generator object.

get_resource()[source]¶: Get the contents of resource and save it to the cache, returning a file-like object

get_target()[source]¶: Get the contents of the target, and save it to the cache, returning a file-like object

inner¶: Return the URL without the scheme extension and fragment. Re-parses the URL, so it should return the correct class for the inner URL.

interpolate(context=None)[source]¶

Use the Downloader.context to interpolate format strings in the URL. Re-parses the URL,: returning a new URL

Parameters:	context – Extra context to interpolate with
Returns:

is_archive¶: Return true if this URL is for an archive. Currently only ZIP is recognized

join(s)[source]¶

Join a component to the end of the path, using os.path.join(). The argument s may be a appurl.Url or a string. If s includes a netloc property, it is assumed to be an absolute url, and it is returned after parsing as a Url. Otherwise, the path component of s is extracted and joined to the path component of this url.

Parameters:	s – A Url object, or a string.
Returns:	A copy of this url.

join_dir(s)[source]¶

Join a component to the parent directory of the path, using join(dirname())

Parameters:	s –
Returns:	a copy of this url.

join_target(tf)[source]¶: Return a new URL, possibly of a new class, with a new target_file

list()[source]¶: Return URLS for files contained in an container. This implementation just returns [self], but sub classes may, for instance, list all of the sub-components of a directory, or all of the worksheets in an Excel file.

resolve()[source]¶: Resolve a URL to another format, such as by looking up a URL that specified a search, into another URL. The default implementation returns self.

Basic Usage¶

Creating an AppUrl¶

Using AppUrls¶

Parsing Strings¶

The URL Base Class¶

Row Generators

Navigation

Related Topics