Basic Usage¶
Creating an AppUrl¶
There are two ways to create an AppUrl: parse a string, or instantiate a class.
If you are starting from a string, and in particular, don’t know what the Url class
should be, use parse_app_url(). If you do know what kind of Url you want
to generate, use the Url subclass directly.
After creating a URL, the basic useage involve either manipulating it, or fetching it.
THere are two fetching methods, Url.get_resource(), to download files from
the web, and Url.get_target() to extract a file from an archive or other
container. When the operation is unnecessary, such as getting the resource for
a resource URL that has already been downloaded, the Url.get_resource()
returns self.
These AppUrls have these components in addition to standard URLS:
- A scheme extension, which preceedes the scheme with a ‘+’
- A target_file, the first part of the URL fragment
- A target_segment, the second part of a URL fragement, delineated by a ‘;’
The scheme_extension specifies the protocol to use with a sthadard
web scheme, inspired by github URLs like
git+http://github.com/example. The target_file is usually the
file within an archive. It is interpreted as a regular expression. The
target_segment may be either a name or a number, and is usually
interpreted as the name or number of a worksheet in a spreadsheet file.
Combining these extensions:
ckan+http://example.com/dataset/archive.zip#excel.xlsx;worksheet
This url may indicate that to fetch a ZIP file from an CKAN server,
using the CKAN protocol, extract the excel.xls file from the ZIP
archive, and open the worksheet worksheet.
The URLs define a few important concepts:
- resource_url: the portion of the URl that defines only the resource to be access or downloaded. In the eample above, the resource url is ‘http://example.com/dataset/archive.zip’
- resource_file: The basename of the resource URL: `archive.zip’
- resource_format: Usually, the extension of the resource_file: ‘zip’
- target_file: The name of the target_file: ‘excel.xlsx’
- target_format: The extension of the target_file: ‘xlsx’
Using AppUrls¶
Typical use is:
from appurl import parse_app_url
url = parse_app_url("http://example.com/archive.zip#file.csv")
resource_url = url.get_resource()
target_path = resource_url.get_target()
The call to url.get_resource() will download the resource file and store it in the cache ,returning a
File: url pointing to the downloaded file. If the file is an archive, the call to resource.get_target()
will extract the target file from the archive. If it is not an archive, it just returns the resource url. The final
result is that target_path is a Url pointing to a file in the filesystem.
Parsing Strings¶
-
appurl.parse_app_url(u_str, downloader='default', **kwargs)[source]¶ Parse a URL string and return a Url object, with the class based on the highest priority entry point that matches the Url and which of the entry point classes pass the match() test.
Parameters: - u_str – Url string
- downloader – Downloader object to use for downloading objects.
- kwargs – Args passed to the Url constructor.
Returns:
The URL Base Class¶
-
class
appurl.Url(url=None, downloader=None, **kwargs)[source]¶ Base class for Application URLs .
After construction, a Url object has a set of properties and attributes for access the parts of the URL, and method for manipulating it. The attributes and properties include the typical properties of a parsed URL, plus properties that are derives from the typical parts, and a few extra components that can be part of the fragment query.
The typical parts are:
schemescheme_extensionnetlochostnamepathparamsqueryfragmentusernamepasswordport
The
fragmentis special; it is an array of two elements, the first of which is thetarget_fileand and the second is thetarget_segment. If there are other parts of the source URL, they must be formates as queriy components, and will be parsed into thefragment_query.Special application components are:
proto. This is set to thescheme_extensionif it exists, the scheme otherwise.resource_file. The filename of the resource to download. It is usually the last part of the URL, but can be overidden in the fragmentresource_format. The format name of the resource, normally drawn from theresoruce_fileextension, but can be overidden in the fragmenttarget_file. The filename of the file that will be produced by :py:meth`Url.get_target`, but may be overidden.target_format. The format of thetarget_file, but may be overidden.target_segment. A sub-component of the`target_file, such as the worksheet in a spreadsheet.fragment_query. Holds additional parts of the fragment.
When the fragment holds extra parts, these can be be formatted as a URL query. Recognized keys are:
resource_fileresource_formattarget_filetarget_formatencoding. Text encoding to be used when reading the target.headers. For row-oriented data, the row numbers of the headers, as a comma-seperated list of integers.start. For row-oriented data, the row number of the first row of data ( as opposed to headers. )end. For row-oriented data, the row number of the last row of data.
Initialize a new Application Url :param url: URL string :param downloader:
appurl.web.download.Downloaderobject. :param kwargs: Additional arguments override URL properties. :return: An Application Url objectKeyword arguments will override properties set by parsing the URL string. Valid keywords that will set object properties are listed below. Other keyswords are accepted and ignored
- scheme
- scheme_extension
- netloc
- hostname
- path
- params
- fragment
- fragment_query
- username
- password
- port
-
as_type(cls)[source]¶ Return the URL transformed to a different class. Copies the downloader and build the new url using
Url.dict()Parameters: cls – Class of Url to construct Returns: A new Url object
-
clear_fragment()[source]¶ Return a copy of the URL with no fragment components
Returns: A cloned URl object, with the fragment and fragment queries cleared.
-
clone(**kwargs)[source]¶ Return a clone of this Url, popssibly with some arguments replaced.
Parameters: kwargs – Keyword arguments are arguments to set in the copy, using setattr()Returns: A cloned Url object.
-
dict¶ Returns a dictionary of the object components.
Returns: a dict.
-
downloader¶ Return the Downloader() for this URL
-
fspath¶ The path in a form suitable for use in a filesystem
-
generator¶ Return the generator for this URL, if the rowgenerator package is installed.
Returns: A row generator object.
-
get_resource()[source]¶ Get the contents of resource and save it to the cache, returning a file-like object
-
get_target()[source]¶ Get the contents of the target, and save it to the cache, returning a file-like object
-
inner¶ Return the URL without the scheme extension and fragment. Re-parses the URL, so it should return the correct class for the inner URL.
-
interpolate(context=None)[source]¶ - Use the Downloader.context to interpolate format strings in the URL. Re-parses the URL,
- returning a new URL
Parameters: context – Extra context to interpolate with Returns:
-
is_archive¶ Return true if this URL is for an archive. Currently only ZIP is recognized
-
join(s)[source]¶ Join a component to the end of the path, using
os.path.join(). The argumentsmay be aappurl.Urlor a string. Ifsincludes anetlocproperty, it is assumed to be an absolute url, and it is returned after parsing as a Url. Otherwise, the path component ofsis extracted and joined to the path component of this url.Parameters: s – A Url object, or a string. Returns: A copy of this url.
-
join_dir(s)[source]¶ Join a component to the parent directory of the path, using join(dirname())
Parameters: s – Returns: a copy of this url.
-
list()[source]¶ Return URLS for files contained in an container. This implementation just returns
[self], but sub classes may, for instance, list all of the sub-components of a directory, or all of the worksheets in an Excel file.