Expected duration: less than 1 week I'm looking for a new feature to be added to GNU Wget which will alter URLs before they are added to the download queue, based on user-provided PCRE (Perl Compatible Regular Expression) patterns.
The feature should support multiple PCREs, passed via command line arguments (and optionally via another mechanisms such as input file).
[ Note that "--url-replace" does not currently exist, and is there to demonstrate the feature desired ]
Given the above command, Wget should apply the patterns given with the "--url-replace" arguments early enough such that checks for the corresponding local file already existing (i.e. --no-clobber) are based on the post-replacement URL's value.
A Worked Example:
When recursively downloading / mirroring from the start page "https://site.tld/", using the "Example Syntax" command above:
- Wget finds the link "https://site.tld/aaa.html?a=1&foo=2&bar=3". - This link is converted into "https://site.tld/BBB.html?a=1&bar=3" after applying the '--url-replace' patterns above (NB: there is a /gi modifier on the second pattern). - If the local file "./site.tld/BBB.html?a=1&bar=3" already exists, the content will not be downloaded. Otherwise, wget will fetch it as normal. - If Wget later finds a link to "https://site.tld/a3.bkp.html", this URL is converted to "https://site.tld/a3.BACKUP.html" before being downloaded and saved at "./site.tld/a3.BACKUP.html" (again, assuming that ./site.tld/a3.BACKUP.html does not already exist)
NOTE: If you can suggest another approach to the one above which achieves the same objective and fits within the project budget, I'm willing to consider it. As an example, passing URLs for changes via external script before adding to the queue might be a reasonable approach.
Delivery requirements:
- You will add the feature to the existing code available from https://ftp.gnu.org/gnu/wget/wget-latest.tar.gz (currently: 1.20.3)
- You will provide a zip file of the complete source code with the new feature added, and a patch file which can be applied the original (clean) wget-1.20.3 source code.
- The updated code will compile cleanly on a Fedora 32/CentOS 8 system with normal wget dependencies already installed via DNF/yum (pcre/pcre2/gnutls/nettle/zlib/libidn2)
- The usual commands "make dist clean; ./configure --enable-pcre; make" will work as expected, leaving a ready-to-execute "wget" binary under ./src/wget.