Basic Usage¶
Creating an AppUrl¶
There are two ways to create an AppUrl: parse a string, or instantiate a class.
If you are starting from a string, and in particular, don’t know what the Url class
should be, use parse_app_url()
. If you do know what kind of Url you want
to generate, use the Url
subclass directly.
After creating a URL, the basic useage involve either manipulating it, or fetching it.
THere are two fetching methods, Url.get_resource()
, to download files from
the web, and Url.get_target()
to extract a file from an archive or other
container. When the operation is unnecessary, such as getting the resource for
a resource URL that has already been downloaded, the Url.get_resource()
returns self
.
These AppUrls have these components in addition to standard URLS:
- A scheme extension, which preceedes the scheme with a ‘+’
- A target_file, the first part of the URL fragment
- A target_segment, the second part of a URL fragement, delineated by a ‘;’
The scheme_extension
specifies the protocol to use with a sthadard
web scheme, inspired by github URLs like
git+http://github.com/example
. The target_file
is usually the
file within an archive. It is interpreted as a regular expression. The
target_segment
may be either a name or a number, and is usually
interpreted as the name or number of a worksheet in a spreadsheet file.
Combining these extensions:
ckan+http://example.com/dataset/archive.zip#excel.xlsx;worksheet
This url may indicate that to fetch a ZIP file from an CKAN server,
using the CKAN protocol, extract the excel.xls
file from the ZIP
archive, and open the worksheet
worksheet.
The URLs define a few important concepts:
- resource_url: the portion of the URl that defines only the resource to be access or downloaded. In the eample above, the resource url is ‘http://example.com/dataset/archive.zip’
- resource_file: The basename of the resource URL: `archive.zip’
- resource_format: Usually, the extension of the resource_file: ‘zip’
- target_file: The name of the target_file: ‘excel.xlsx’
- target_format: The extension of the target_file: ‘xlsx’
Using AppUrls¶
Typical use is:
from appurl import parse_app_url
url = parse_app_url("http://example.com/archive.zip#file.csv")
resource_url = url.get_resource()
target_path = resource_url.get_target()
The call to url.get_resource()
will download the resource file and store it in the cache ,returning a
File:
url pointing to the downloaded file. If the file is an archive, the call to resource.get_target()
will extract the target file from the archive. If it is not an archive, it just returns the resource url. The final
result is that target_path
is a Url pointing to a file in the filesystem.
Parsing Strings¶
-
appurl.
parse_app_url
(u_str, downloader='default', **kwargs)[source]¶ Parse a URL string and return a Url object, with the class based on the highest priority entry point that matches the Url and which of the entry point classes pass the match() test.
Parameters: - u_str – Url string
- downloader – Downloader object to use for downloading objects.
- kwargs – Args passed to the Url constructor.
Returns:
The URL Base Class¶
-
class
appurl.
Url
(url=None, downloader=None, **kwargs)[source]¶ Base class for Application URLs .
After construction, a Url object has a set of properties and attributes for access the parts of the URL, and method for manipulating it. The attributes and properties include the typical properties of a parsed URL, plus properties that are derives from the typical parts, and a few extra components that can be part of the fragment query.
The typical parts are:
scheme
scheme_extension
netloc
hostname
path
params
query
fragment
username
password
port
The
fragment
is special; it is an array of two elements, the first of which is thetarget_file
and and the second is thetarget_segment
. If there are other parts of the source URL, they must be formates as queriy components, and will be parsed into thefragment_query
.Special application components are:
proto
. This is set to thescheme_extension
if it exists, the scheme otherwise.resource_file
. The filename of the resource to download. It is usually the last part of the URL, but can be overidden in the fragmentresource_format
. The format name of the resource, normally drawn from theresoruce_file
extension, but can be overidden in the fragmenttarget_file
. The filename of the file that will be produced by :py:meth`Url.get_target`, but may be overidden.target_format
. The format of thetarget_file
, but may be overidden.target_segment
. A sub-component of the`target_file
, such as the worksheet in a spreadsheet.fragment_query
. Holds additional parts of the fragment.
When the fragment holds extra parts, these can be be formatted as a URL query. Recognized keys are:
resource_file
resource_format
target_file
target_format
encoding
. Text encoding to be used when reading the target.headers
. For row-oriented data, the row numbers of the headers, as a comma-seperated list of integers.start
. For row-oriented data, the row number of the first row of data ( as opposed to headers. )end
. For row-oriented data, the row number of the last row of data.
Initialize a new Application Url :param url: URL string :param downloader:
appurl.web.download.Downloader
object. :param kwargs: Additional arguments override URL properties. :return: An Application Url objectKeyword arguments will override properties set by parsing the URL string.
-
as_type
(cls)[source]¶ Return the URL transformed to a different class. Copies the downloader and build the new url using
Url.dict()
Parameters: cls – Class of Url to construct Returns: A new Url object
-
clone
(**kwargs)[source]¶ Return a clone of this Url, possibly with some arguments replaced.
Parameters: kwargs – Keyword arguments are arguments to set in the copy, using setattr()
Returns: A cloned Url object.
-
downloader
¶ Return the Downloader() for this URL
-
fspath
¶ The path in a form suitable for use in a filesystem
-
generator
¶ Return the generator for this URL, if the rowgenerator package is installed.
Returns: A row generator object.
-
get_resource
()[source]¶ Get the contents of resource and save it to the cache, returning a file-like object
-
get_target
()[source]¶ Get the contents of the target, and save it to the cache, returning a file-like object
-
inner
¶ Return the URL without the scheme extension and fragment. Re-parses the URL, so it should return the correct class for the inner URL.
-
interpolate
(context=None)[source]¶ - Use the Downloader.context to interpolate format strings in the URL. Re-parses the URL,
- returning a new URL
Parameters: context – Extra context to interpolate with Returns:
-
is_archive
¶ Return true if this URL is for an archive. Currently only ZIP is recognized
-
join
(s)[source]¶ Join a component to the end of the path, using
os.path.join()
. The arguments
may be aappurl.Url
or a string. Ifs
includes anetloc
property, it is assumed to be an absolute url, and it is returned after parsing as a Url. Otherwise, the path component ofs
is extracted and joined to the path component of this url.Parameters: s – A Url object, or a string. Returns: A copy of this url.
-
join_dir
(s)[source]¶ Join a component to the parent directory of the path, using join(dirname())
Parameters: s – Returns: a copy of this url.