Monday, 28 October 2013

Tagged under: , , , , , , , ,

ENEWSPAPER DOWNLOAD SCRIPT(PYTHON) [COOL PYTHON SCRIPTS]



Here is a simple script that took me about 30 minutes to write which can download the Hindi newspaper Dainik Jagran in pdf format. The code is a bit messy but it does the work. Here is the code:

import os
from pyPdf import PdfFileWriter, PdfFileReader
import urllib2
import shutil

def make_date():
 from datetime import date, timedelta
 today = str(date.today())
 yesterday = date.today() - timedelta(1)
 yesterday = yesterday.day
 if yesterday in range(10):
  yesterday = '0'+str(yesterday)
 else:
  yesterday = str(yesterday)
 return ''.join(reversed(today.split('-'))), yesterday

def create_dir(city):
 directory = 'C:\Users\Agnes\Desktop\' + city
 if os.path.exists(directory):
  shutil.rmtree(directory)
 os.makedirs(directory)
 os.chdir(directory)

def make_pages(city, url_date, url_val, city_short, page):
 continue_loop = True
 output = PdfFileWriter()
 while continue_loop:
  url = "http://epaper.jagran.com/epaperimages/" + url_date + "/" + city + "/" + url_val + city_short +"-pg"+ str(page) +"-0.pdf"
  print url
  try:
   request = urllib2.urlopen(url)
   print("Downloading %s page number %s \n" % (city, page))
   data = request.read()
   FILE = open('Page-' + str(page) + '.pdf', "wb")
   FILE.write(data)
   FILE.close()
   input1 = PdfFileReader(file("Page-"+ str(page)+".pdf", "rb"))
   output.addPage(input1.getPage(0))
  except urllib2.HTTPError, err:
   if err.code == 404:
    continue_loop = False
  page +=1
 outputStream = file("Full-Paper.pdf", "wb")
 output.write(outputStream)
 outputStream.close()
 print("Download for %s completed!!!\n\n\n" % city)


def make_pdf(city):  
 make_pages(city, date, day) 
 
 
########################## Execution ##################
date, day =  make_date()

create_dir("Delhi")
make_pages("delhi", date, day, 'del', 1)
create_dir("Jhajjar")
make_pages("panipat", date, day, 'jhr', 1)
It can download any city edition provided by the website. All you need to do is replace the city name in "make_page" function at the end to the the desired city name. The 4th parameter of the function needs to be setaccording to url city code. The 5th parameter takes the page no. argument which can be set to any page.

The code makes use of additional module that doesn't come within python libraries which is pyPdf to save the PDFs.

This code is Windows OS specific but can used on any other platforms easily by changing the directory paths in the code.


0 comments:

Post a Comment