Here is a simple script that took me about 30 minutes to write which can download the Hindi newspaper Dainik Jagran in pdf format. The code is a bit messy but it does the work. Here is the code:
import os from pyPdf import PdfFileWriter, PdfFileReader import urllib2 import shutil def make_date(): from datetime import date, timedelta today = str(date.today()) yesterday = date.today() - timedelta(1) yesterday = yesterday.day if yesterday in range(10): yesterday = '0'+str(yesterday) else: yesterday = str(yesterday) return ''.join(reversed(today.split('-'))), yesterday def create_dir(city): directory = 'C:\Users\Agnes\Desktop\' + city if os.path.exists(directory): shutil.rmtree(directory) os.makedirs(directory) os.chdir(directory) def make_pages(city, url_date, url_val, city_short, page): continue_loop = True output = PdfFileWriter() while continue_loop: url = "http://epaper.jagran.com/epaperimages/" + url_date + "/" + city + "/" + url_val + city_short +"-pg"+ str(page) +"-0.pdf" print url try: request = urllib2.urlopen(url) print("Downloading %s page number %s \n" % (city, page)) data = request.read() FILE = open('Page-' + str(page) + '.pdf', "wb") FILE.write(data) FILE.close() input1 = PdfFileReader(file("Page-"+ str(page)+".pdf", "rb")) output.addPage(input1.getPage(0)) except urllib2.HTTPError, err: if err.code == 404: continue_loop = False page +=1 outputStream = file("Full-Paper.pdf", "wb") output.write(outputStream) outputStream.close() print("Download for %s completed!!!\n\n\n" % city) def make_pdf(city): make_pages(city, date, day) ########################## Execution ################## date, day = make_date() create_dir("Delhi") make_pages("delhi", date, day, 'del', 1) create_dir("Jhajjar") make_pages("panipat", date, day, 'jhr', 1)
It can download any city edition provided by the website. All you need to do is replace the city name in "make_page" function at the end to the the desired city name. The 4th parameter of the function needs to be setaccording to url city code. The 5th parameter takes the page no. argument which can be set to any page.
The code makes use of additional module that doesn't come within python libraries which is pyPdf to save the PDFs.
This code is Windows OS specific but can used on any other platforms easily by changing the directory paths in the code.
The code makes use of additional module that doesn't come within python libraries which is pyPdf to save the PDFs.
This code is Windows OS specific but can used on any other platforms easily by changing the directory paths in the code.
0 comments:
Post a Comment