Friday, 3 July 2015

ECJ clarifies Database Directive scope in screen scraping case

EC on the legal protection of databases (Database Directive) in a case concerning the extraction of data from a third party’s website by means of automated systems or software for commercial purposes (so called 'screen scraping').

Flight data extracted

The case, Ryanair Ltd vs. PR Aviation BV, C-30/14, is of interest to a range of companies such as price comparison websites. It stemmed from  Dutch company PR Aviation operation of a website where consumers can search through flight data of low-cost airlines  (including Ryanair), compare prices and, on payment of a commission, book a flight. The relevant flight data is extracted from third-parties’ websites by means of ‘screen scraping’ practices.

Ryanair claimed that PR Aviation’s activity:

• amounted to infringement of copyright (relating to the structure and architecture of the database) and of the so-called sui generis database right (i.e. the right granted to the ‘maker’ of the database where certain investments have been made to obtain, verify, or present the contents of a database) under the Netherlands law implementing the Database Directive;

• constituted breach of contract. In this respect, Ryanair claimed that a contract existed with PR Aviation for the use of its website. Access to the latter requires acceptance, by clicking a box, of the airline’s general terms and conditions which, amongst others, prohibit unauthorized ‘screen scraping’ practices for commercial purposes.

Ryanair asked Dutch courts to prohibit the infringement and order damages. In recent years the company has been engaged in several legal cases against web scrapers across Europe.

The Local Court, Utrecht, and the Court of Appeals of Amsterdam dismissed Ryanair’s claims on different grounds. The Court of Appeals, in particular, cited PR Aviation’s screen scraping of Ryanair’s website as amounting to a “normal use” of said website within the meaning of the lawful user exceptions under Sections 6 and 8 of the Database Directive, which cannot be derogated by contract (Section 15).

Ryanair appealed

Ryanair appealed the decision before the Netherlands Supreme Court (Hoge Raad der Nederlanden), which decided to refer the following question to the ECJ for a preliminary ruling: “Does the application of [Directive 96/9] also extend to online databases which are not protected by copyright on the basis of Chapter II of said directive or by a sui generis right on the basis of Chapter III, in the sense that the freedom to use such databases through the (whether or not analogous) application of Article[s] 6(1) and 8, in conjunction with Article 15 [of Directive 96/9] may not be limited contractually?.”

The ECJ’s ruling

The ECJ (without the need of the opinion of the advocate general) ruled that the Database Directive is not applicable to databases which are not protected either by copyright or by the sui generis database right. Therefore, exceptions to restricted acts set forth by Sections 6 and 8 of the Directive do not prevent the database owner from establishing contractual limitations on its use by third parties. In other words, restrictions to the freedom to contract set forth by the Database Directive do not apply in cases of unprotected databases. Whether Ryanair’s website may be entitled to copyright or sui generis database right protection needs to be determined by the competent national court.

The ECJ’s decision is not particularly striking from a legal standpoint. Yet, it could have a significant impact on the business model of price comparison websites, aggregators, and similar businesses. Owners of databases that could not rely on intellectual property protection may contractually prevent extraction and use (“scraping”) of content from their online databases. Thus, unprotected databases could receive greater protection than the one granted by IP law.

Antitrust implications

However, the lawfulness of contractual restrictions prohibiting access and reuse of data through screen scraping practices should be assessed under an antitrust perspective. In this respect, in 2013 the Court of Milan ruled that Ryanair’s refusal to grant access to its database to the online travel agency Viaggiare S.r.l. amounted to an abuse of dominant position in the downstream market of information and intermediation on flights (decision of June 4, 2013 Viaggiare S.r.l. vs Ryanair Ltd). Indeed, a balance should be struck between the need to compensate the efforts and investments made by the creator of the database with the interest of third parties to be granted with access to information (especially in those cases where the latter are not entitled to copyright protection).

Additionally, web scraping triggers other issues which have not been considered by the ECJ’s ruling. These include, but are not limited to trademark law (i.e., whether the use of a company’s names/logos by the web scraper without consent may amount to trademark infringement), data protection (e.g., in case the scraping involves personal data), or unfair competition.

Source: http://www.globallegalpost.com/blogs/global-view/ecj-clarifies-database-directive-scope-in-screen-scraping-case-128701/

Wednesday, 24 June 2015

Data Scraping - What Are Hand-Scraped Hardwood Floors and What Are the Benefits?

If you love the look of hardwood flooring with lots of character, then you may want to check out hand-scraped hardwood flooring. Hand-scraped wood provides a warm vintage look, providing the floor instant character. These types of scraped hardwoods are suitable for living rooms, dining rooms, hallways and bedrooms. But what exactly is hand-scraped hardwood flooring?

Well, it is literally what you think it is. Hand-scraped hardwood flooring is created by hand using specialized wood working tools to make each board unique and giving an overall "old worn" appearance.

At Innovation Builders we offer solid wood floors finished on site with an actual hand-scraping technique followed by stain and sealer. Solid wood floors are installed by an expert team of technicians who work each board with skilled craftsman-like attention to detail. Following the scraping procedure the floor is stained by hand with a customer selected stain color, and then protected with multiple coats of sealing and finishing polyurethane. This finishing process of staining, sealing and coating the wood floors contributes to providing the look and durability of an old reclaimed wood floor, but with today's tough, urethane finishes.

There are many, many benefits to hand-scraped wood flooring. Overall, these floors are extremely durable and hard wearing, providing years of trouble-free use. These wood floors remain looking newer for longer because the texture that the process provides hides the typical dents, dings and scratches that other floors can't hide so easily. That's great news for households with kids, dogs, and cats.

These types of wood flooring have another unique advantage as well. When you do scratch these floors during their lifetime, the scratches are easily repaired. As long as the scratch isn't too deep you can make them practically disappear without ever having to hire a professional. It's simple to hide the scratch by using a color-matched stain marker or repair kit that is readily available through local flooring distributors. These features make hand-scraped hardwood flooring a lot more durable and hassle-free to maintain than other types of wood flooring.

The expert processes utilized in the creation of these floors provides a custom look of worn wood with deep color and subtle highlights. When the light hits the wood at different times during the day, it provides an understated but powerful effect of depth and beauty. They instantly offer your rooms a rustic look full of character, allowing your home to become a warm and inviting environment. The rustic look of this wood provides a texture, style and rustic appeal that cannot be matched by any other type of flooring.

Hand-Scraped Hardwood Flooring is a floor that says welcome and adds a touch of elegance to any home. If you are looking to buy a new home and you haven't had the opportunity to see or feel hand scraped hardwoods, stop in any of the model homes at Innovation Builders in Keller, North Richland Hills or Grand Prairie, Texas and check it out!

Source: http://ezinearticles.com/?What-Are-Hand-Scraped-Hardwood-Floors-and-What-Are-the-Benefits?&id=6026646

Monday, 8 June 2015

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source: http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Tuesday, 2 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782″ or “c 1782″, even more vaguely they may be described as having “flourished 1663-1778″ or “fl. 1663-1778″. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Thursday, 28 May 2015

Data Scraping Services - Scraping Yelp Business Data With Python Scraping Script

Yelp is a great source of business contact information with details like address, postal code, contact information; website addresses etc. that other site like Google Maps just does not. Yelp also provides reviews about the particular business. The yelp business database can be useful for telemarketing, email marketing and lead generation.

Are you looking for yelp business details database? Are you looking for scraping data from yelp website/business directory? Are you looking for yelp screen scraping software? Are you looking for scraping the business contact information from the online Yelp? Then you are at the right place.

Here I am going to discuss how to scrape yelp data for lead generation and email marketing. I have made a simple and straight forward yelp data scraping script in python that can scrape data from yelp website. You can use this yelp scraper script absolutely free.

I have used urllib, BeautifulSoup packages. Urllib package to make http request and parsed the HTML using BeautifulSoup, used Threads to make the scraping faster.

Yelp Scraping Python Script

import urllib from bs4 import BeautifulSoup import re from threading import Thread #List of yelp urls to scrape url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg'] i=0 #function that will do actual scraping job def scrape(ur): html = urllib.urlopen(ur).read() soup = BeautifulSoup(html) title = soup.find('h1',itemprop="name") saddress = soup.find('span',itemprop="streetAddress") postalcode = soup.find('span',itemprop="postalCode") print title.text print saddress.text print postalcode.text print "-------------------" threadlist = [] #making threads while i<len(url): t = Thread(target=scrape,args=(url[i],)) t.start() threadlist.append(t) i=i+1 for b in

threadlist: b.join()

import urllib

from bs4 import BeautifulSoup

import re

from threading import Thread

 #List of yelp urls to scrape

url=['http://www.yelp.com/biz/liman-fisch-restaurant-hamburg','http://www.yelp.com/biz/casa-franco-caramba-hamburg','http://www.yelp.com/biz/o-ren-ishii-hamburg','http://www.yelp.com/biz/gastwerk-hotel-hamburg-hamburg-2','http://www.yelp.com/biz/superbude-hamburg-2','http://www.yelp.com/biz/hotel-hafen-hamburg-hamburg','http://www.yelp.com/biz/hamburg-marriott-hotel-hamburg','http://www.yelp.com/biz/yoho-hamburg']

 i=0

#function that will do actual scraping job

def scrape(ur):

           html = urllib.urlopen(ur).read()

          soup = BeautifulSoup(html)

       title = soup.find('h1',itemprop="name")

          saddress = soup.find('span',itemprop="streetAddress")

          postalcode = soup.find('span',itemprop="postalCode")

          print title.text

          print saddress.text

          print postalcode.text

          print "-------------------"

 threadlist = []

#making threads

while i<len(url):

          t = Thread(target=scrape,args=(url[i],))

          t.start()

          threadlist.append(t)

          i=i+1

for b in threadlist:

          b.join()

Recently I had worked for one German company and did yelp scraping project for them and delivered data as per their requirement. If you looking for scraping data from business directories like yelp then send me your requirement and I will get back to you with sample.

Source: http://webdata-scraping.com/scraping-yelp-business-data-python-scraping-script/

Tuesday, 26 May 2015

Data Mining Services

Data Minng Services, through its data mining services can mine required data for you from any of the available sources. Over the years, we have successfully catered to wide variety of outsource data mining requirements, which specifies our competency in dealing with your data mining requirements.

Based on your requirements, we can mine data from your preferred data sources, or we will use our own reliable sources to mine the data required by you. We have been using automated as well manual data mining strategies to deliver superior data mining services.

Types of data mining services delivered by us

With an extensive variety of data mining services provided by us, you will definitely be able to find the most perfect service package to cater to your requirements. Below listed are just some of the data mining services offered by us:

•    Web data mining
•    Data extraction
•    Data capture
•    Data gathering
•    Collection of required data
•    Validation of data

Outsource data mining requirements to us, and we are sure that the data mining India unit of Hi-Tech BPO Services will be able to formulate the most appropriate and cost effective solutions to include your entire requirements.

Highlights of our data mining services:

•    Most affordable rates
•    Dedicated data mining India unit
•    Latest data mining technologies used to mine all required data
•    Data will be mined, gathered, processed and validated as per your requirements
•    Mined data can be directly included into your database

Competitive advantage of using our data mining services

To mine accurate and relevant data, some level of internet knowledge is essential. And it would also consume a lot of your valuable time. With our data mining services, we will take care of all your data mining tasks, while you look after your business and its core functions.

The affordably priced data mining services delivered by the data mining India unit will also help you to save considerable amount of your money, which you can put into more productive purposes.

Source: http://www.hitechbposervices.com/data-mining.php

Monday, 25 May 2015

Improving performance for web scraping code

2 down vote favorite

I have a website in which the code scrapes other websites for getting the accurate data. While the code works good but there a decent lag in performance because the code firsts downloads the html stream from various sites(some times 9 websites), extracts the relative part and then renders the html page.

What should I do to get an optimal performance. Should I change from shared hosting (godaddy) to my own server or it has nothing to do with my hosting and I need to make changes to my code?

1 Answer

API/CSV

Ask those websites if they provide an API, or, if you don't need an up-to-date information or the information you need doesn't change frequently, if they can sell/give you for free the data itself (for example in an CSV file). Some small websites may have fancier ways to access data, like a CSV file for the older information, and an RSS feed for the changed one.

Those websites would probably be happy to help you, since providing you with an API would reduce their own CPU and bandwidth usage by you.

Profile

Screen scrapping is really ugly when it comes to performance and scaling. You may be limited by:

    your machine performance, since parsing, sometimes an invalid HTML file, takes time,

    your network speed,

    their network speed usage, i.e. how fast can you access the pages of their website depending on the restrictions they set, like the DOS protection and the number of requests per second for screen scrappers and search engine crawlers,

    their machine performance: if they spend 500 ms. to generate every page, you can't do anything to reduce this delay.

If, despite your requests to them, those websites cannot provide any convenient way to access their data, but they give you a written consent to screen scrape their website, then profile your code to determine the bottleneck. It may be the internet speed. It may be your database queries. It may be anything.

For example, you may discover that you spend too much time finding with regular expressions the relevant information in the received HTML. In that case, you would want to stop doing it wrong and use a parser instead of regular expressions, then see how this improve the performance.

You may also find that the bottleneck is the time the remote server spends generating every page. In this case, there is nothing to do: you may have the fastest server, the fastest connection and the most optimized code, the performance will be the same.

Do things in parallel:

Remember to use parallel computing wisely and to always profile what you're doing, instead of doing premature optimization, in hope that you're smarter than the profiler.

Especially when it comes to using network, you may be very surprised. For example, you may believe that making more requests in parallel will be faster, but as Steve Gibson explains in episode 345 of Security Now, this is not always the case.

Legal aspects

Also note that screen scrapping is explicitly forbidden by the conditions of use (like on IMDB) on many websites. And if nothing is said on this subject in conditions of use, it doesn't mean that you can screen scrape those websites.

The fact that the information is available publicly on the internet doesn't give you the right to copy and reuse it this way neither.

Why? you may ask. For two reasons:

    Most websites are relying on advertisement and marketing. When people use one of those websites directly, they waste some CPU/network bandwidth of the website, but in response, they may click on an ad or buy something sold on the website. When you screen scrape, your bot waste their CPU/network bandwidth, but will never click on an ad or buy something.

    Displaying the information you screen scrapped on your website can have even worse effects. Example: in France, there are two major websites selling hardware. The first one is easy and fast to use, has a nice visual design, better SEO, and in general is very well done. The second one is a crap, but the prices are lower. If you screen scrape them and give the raw results (prices with links) to your users, they will obviously click on the lower price every time, which means that the website with pretty design will have less chances to sell the products.

    People made an effort in collecting, processing and displaying some data. Sometimes they paid to get it. Why would they enjoy seeing you pulling this data conveniently and for free?

Source: http://programmers.stackexchange.com/questions/141403/improving-performance-for-web-scraping-code/141406#141406