Thursday, 27 November 2014

Scraping Online Communities for your Outreach Campaigns

Online communities offer a wealth of intelligence for blog owners and business owners alike.

Exploring the data within popular communities will help you to understand who the major influencers are, what content is popular and who are the key content aggregators within your niche.

This is all fair and well to talk about, but is it feasible to be manually sorting through online communities to find this information out? Probably not.

This is where data scraping comes in.
What is Scraping and What Can it do?

I’m not going to go into great detail on what data scraping actually means, but to simplify this, here’s a definition from the Wikipedia page:

    “Data scraping is a technique in which a computer program extracts data from human-readable output coming from another program.”

Let me explain this with a little example…

Imagine a huge community full of individuals within your industry. Each person within the community has a personal profile page that contains information about their interests, contact details, social profiles, etc.

If you were tasked with gathering all of this data on all of the individuals then you might start to hyperventilating at the thought of all the copy and pasting you’d need to do.

Well, an alternative is to scrape all of this content so that you can automate all of this process and easily export all of this information into a manageable, more consumable format in a matter of seconds. It’d be pretty awesome, right?
Luckily for you, I’m going to show you how to do just that!
The Example of Inbound.org

Recently, I wanted to gather a list of digital marketers that were fairly active on social media and shared a lot of content online within communities. These people were going to be some of my core targets to get content from the blog in front of.

To do this, I first found some active communities online where these types of individuals hang out. Being a digital marketer myself, this process was fairly easy and I chose Inbound.org as my starting place.

Scoping out Data Requirements
Each community is different and you’ll be able to gather varying information within each.

The place to look for this information is within the individual user profile pages. This is usually where the contact information or links to social media accounts are likely to be displayed.

For this particular exercise, I wanted to gather the following information:

    Full name
    Job title
    Company name and URL
    Location
    Personal website URL
    Twitter URL, handle and follower/following stats
    Google+ URL, follower count and list of contributor URLs
    Profile image URL
    Facebook URL
    LinkedIn URL

With all of this information I’ll be able to get a huge amount of intelligence about the community members. I’ll also have a list of social media accounts to add and engage with.
On top of this, with all the information on their websites and sites that they write for, I’ll have a wealth of potential link building prospects to work on.

Inbound.org Profiles

You’ll see in the above screenshot that a few of the pieces of data are available to see on the Inbound.org user profiles. We’ll need to get the other bits of information from the likes of Twitter and Google+, but this will all stem from the scraping of Inbound.org.

Sign Up To My Newsletter
Scraping the Data

The idea behind this is that we can set up a template based on one of the user profiles and then automate the data gathering across the rest of the profiles on the site.

This is where you’ll need to install the SEO Tools plugin for Excel (it’s free). If you’ve not used this plugin before, don’t worry – I’ve put together a full tutorial here.

Once you’ve installed the plugin, you’re good to go on the actual scraping side of things…
Quick Note: Don’t worry if you don’t have a good knowledge of coding – you don’t need it. All you’ll need is a very basic understanding of reading some code and some basic Excel skills.

To begin with, you’ll need to do a little Excel admin. Simply add in some column titles based around the data that you’re gathering. For example, with my example of Inbound.org, I had, ‘Name’, ‘Position’, ‘Company’, ‘Company URL’, etc. which you can see in the screenshot below. You’ll also want to add in a sample profile URL to work on building the template around.

spreadsheet admin
Now it’s time to start getting hands on with XPath.
How to Use XPathOnURL()

This handy little formula is made possible within Excel by the SEO Tools plugin. Now, I’m going to keep this very basic because there are loads of XPath tutorials available online that can go into the very advanced queries that are possible to use.

For this, I’m simply going to show you how to get the data we want and you can have a play around yourself afterwards (you can download the full template at the end of this post).

Here’s an example of an XPath query that gathers the name of the person within the profile that we’re scraping:

=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

A2 is simply referencing the cell that contains the URL that we’re scraping. You’ll see in the screenshot above that this is Jason Acidre’s profile page.

The next part of the formula is the XPath.

What this essentially says is to scrape through the HTML to find a tag that has ‘user-profile’ id attached to it. This could be a div, span, a or whatever.

Once it’s found this tag, it then needs to look at the first h2 tag within this area and grab the text within it. This is Jason’s name, which you’ll see in the screenshot below of the code:

website code

Don’t be put off at this stage because you don’t need to go manually trawling through code to build these queries, there’s a much simpler way.

The easiest way to do this is by right-clicking on the element you want to scrape on the webpage (within Chrome); for example, on Inbound.org, this would be the profile name. Now click ‘Inspect element’.

inspect element

The developer tools window should now appear at the bottom of your browser (or in a separate window). Within that, you should see the element that you’ve drilled down on.

All you need to do now is right-click on it and press ‘Copy XPath’.
copy XPath

This will now copy the XPath code for your Excel formula to the clipboard. You’ll just need to add in the first part of the query, i.e. =XPathOnUrl(A2,

You can then paste in the copied XPath after this and add a closing bracket.

Note: When you use ‘Copy XPath’ it will wrap some parts of the code in double apostrophes (“) which you’ll need to change to single apostrophes. You’ll also need to wrap the copied XPath in double apostrophes.

Your finished code will look like this:
=XPathOnUrl(A2, "//*[@id='user-profile']/h2")

You can then apply this formula against any Inbound.org profile and it will automatically grab the user’s full name. Pretty good, right?

Check out the full video tutorial below that I’ve put together that talks you through this whole process:

[sws_blue_box box_size=””] Want more useful video tutorials? Subscribe to my YouTube channel now![/sws_blue_box]

XPath Examples for Grabbing Other Data

As you’re probably starting to see, this technique could be scaled across any website online. This makes big data much more attainable and gives you the kind of results that an expensive paid tool would offer without any of the cost – bonus!

Here’s a few more examples of XPath that you can use in conjunction with the SEO Tools plugin within Excel to get some handy information.

Twitter Follower Count

If you want to grab the number of followers for a Twitter user then you can use the following formula. Simply replace A2 with the Twitter profile URL of the user you want data on. Just a quick word of warning with this one; it looks like it’s really long and complicated, but really I’ve just used another Excel formula to snip of the text ‘followers’ from the end.

=RIGHT(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"),LEN(XPathOnUrl(D57,"//li[@class='ProfileNav-item ProfileNav-item--followers']"))-10)

Google+ Follower Count

Like with the Twitter follower formula, you’ll need to replace A2 with the full Google+ profile URL of the user you want this data for.

=XPathOnUrl(H67,"//span[@class='BOfSxb']")

List of ‘Contributor to’ URLs

I don’t think I need to tell you the value of pulling in a list of websites that someone contributes content to. If you do want to know then check out this post that I wrote.

This formula is a little more complex than the rest. This is because I’m pulling in a list of URLs as opposed to just one entity. This requires me to use the StringJoin function to separate all of the outputs with a comma (or whatever character you’d like).

Also, you may notice that there is an additional section to the XPath query, “href”. This pulls in the link within the specific code block instead of the text.

As you’ll see in the full Inbound.org scraper template that I’ve made, this is how I pull in the Twitter, Google+, Facebook and LinkedIn profile links.

You’ll want to replace A2 with the Google+ profile URL of the person you wish to gather data on.

=StringJoin(", ",XPathOnUrl(A2,"//a[@rel='contributor-to nofollow']","href"))

Twitter Profile Image URL
If you want to get a large version of someone’s Twitter profile image then I’ve got just the thing for you.
Again, you’ll just need to substitute A2 with their Twitter profile URL.
=XPathOnUrl(A2,"//*[@class='profile-picture media-thumbnail js-tooltip']","data-resolved-url-large")


Some Findings from the Data I’ve Gathered

With all big data sets will come some interesting findings. Here’s a few little things that I’ve found from the top 100 influential users on Inbound.org.

average followers chart

The chart above maps out the average number of followers that the top 100 users have on both Twitter (12,959) and Google+ (9,601). As well as this, it shows the average number of users that they follow on Twitter (1,363).

The next thing that I’ve looked at is the job titles of the top 100 users. You can see the most common occurrences of terms within the tag cloud below:

Job titlesFinally, I had a look through all of the domains listed within each of the top 100 Inbound.org users’ Google+ ‘contributor to’ sections and mapped out the most frequently mentioned sites.

Here’s the spread of domains that were the most popular to be contributed to:

domain frequency
It Doesn’t Stop There

As you’ve probably gathered, this can be scaled out across pretty much any community/forum/directory/website online.

With this kind of intelligence in your armoury, you’ll be able to gather more intelligence on your targets and increase the effectiveness of your outreach campaigns dramatically.

Also, as promised, you can download my full Inbound.org scraper template below:

[sdfile url=”http://www.matthewbarby.com/goodies/MatthewBarby-Inbound-Scraper.xlsx” redirect=”http://www.matthewbarby.com/thanks-downloading-inbound-scraper/”]

TL;DR

    Online communities hold valuable data on your target audiences – use it!
    Scale out your intelligence gathering by brushing up on your XPath.
    Download my Inbound.org scraper template and let it work its magic.

Source: http://www.matthewbarby.com/scraping-communities-with-xpath/

Sunday, 23 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

    Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

    There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor.  Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Monday, 17 November 2014

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

Screen Shot 2014-02-18 at 4.29.05 PM

Screen Shot 2014-02-18 at 4.29.27 PM

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

Screen Shot 2014-02-18 at 4.29.50 PM

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source:http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Thursday, 13 November 2014

Future of Web Scraping

The Internet is large, complex and ever-evolving. Nearly 90% of all the data in the world has been generated over the last two years. In this vast ocean of data, how does one get to the relevant piece of information? This is where web scraping takes over.

Web scrapers attach themselves, like a leech, to this beast and ride the waves by extracting information form websites at will. Granted “scraping” doesn’t have a lot of positive connotations, yet it happens to be the only way to access data or content from a web site without RSS or an open API.

Future of Web Scraping

Web scraping faces testing times ahead. We outline why there may be some serious challenges to its future.

With rise in data, redundancies in web scraping are rising. No more is web scraping a domain of the coders; in fact, companies now offer customized scraping tools to clients which they can use to get the data they want. The outcome of everyone equipped to crawl, scrape, and extract, is unnecessary waste of precious man-power. Collaborative scraping could well heal this hurt. Here, where one web crawler does a broad scraping, the others scrape data off an API. An extension of the problem is that text retrieval attracts more attention than multimedia; and with websites becoming more complex, this enforces limited scraping capacity.

Easily, the biggest challenge to web scraping technology is Privacy concerns. With data freely available (most of it voluntary, much of it involuntary), the call for stricter legislation rings loudest. Unintended users can easily target a company and take advantage of the business using web scraping. The disdain with which “do not scrape” policies are treated and terms of usage violated, tells us that even legal restrictions are not enough. This begs to ask an age-old question: is scraping legal?

Is Crawling Legal? from PromptCloud

The flipside to this argument is that if technological barriers replace legal clauses, then web scraping will see a steady, and sure, decline. This is a distinct possibility since the only way scraping activity thrives is on the grid, and if the very means are taken away and programs no longer have access to website information, then web scraping by itself will be wiped out.

Building the Future

On the same thought is the growing trend of accepting “open data”. The open data policy, while long mused hasn’t been used at the scale it should be. The old way was to believe that closed data is the edge over competitors. But that mindset is changing. Increasingly, websites are beginning to offer APIs and embracing open data. But what’s the advantage of doing so?

Selling APIs not only brings in the money, but also is useful in driving back traffic to the sites! APIs are also a more controlled, cleaner way of turning sites into services. Steadily many successful sites like Twitter, LinkedIn etc. are offering access to their APIs with paid services and actively blocking scraper and bots.

Yet, beyond these obvious challenges, there’s a glimmer of hope for web scraping. And this is based on a singular factor: the growing need for data!

With Internet & web technology spreading, massive amounts of data will be accessible on the web. Particularly with increased adoption of mobile internet. According to one report, by 2020, the number of mobile internet users will hit 3.8 billion, or around half of the world’s population!

Since ‘big data’ can be both, structured & unstructured; web scraping tools will only get sharper and incisive. There is fierce competition between those who provide web scraping solutions. With the rise of open source languages like Python, R & Ruby, Customized scraping tools will only flourish bringing in a new wave of data collection and aggregation methods.

Source: https://www.promptcloud.com/blog/Future-of-Web-Scraping

Wednesday, 12 November 2014

3 Reasons to Up Your Web Scraping Game

If you aren’t using a machine-learning-driven intelligent Web scraping solution yet, here are three reasons why you might want to abandon that entry-level Web-scraping software or cut your high-cost script-writing approach.

    You need to keep an eye on a large number of web sources that get updated frequently.

    Understanding what’s changed is at least as critical as the data itself.

    You don’t want maintenance and scheduling to drag you down.

Here’s what an intelligent Web-scraping solution can deliver – and why:

1. Better data monitoring of an ever-shifting Web

If you need to keep a watch over hundreds, thousands or even tens of thousands of sites, an intelligent Web scraper is a must, because:

    It can scale – easily adding new websites, coordinating extraction routines, and automating the normalization of data across different websites.

    It can navigate and extract data from websites efficiently. Script-based approaches typically only can view a Web page in isolation, making it difficult to optimize navigation across unique pages of a targeted site. More intelligent approaches can be trained to bypass unnecessary links and leave a lighter footprint on the sites you need to access. And, they can monitor millions of precise Web data points quickly. This means you can monitor more pages on more sites with more frequent updates.

2. Critical alerts to Web data changes

A key sales executive suddenly drops off of the management page of your main competitor. That can mean big shakeup in the entire organization, which your sales team can jump on.

An intelligent Web scraper can alert you to this personnel shift because it can be set to monitor for just the changes; less powerful technologies or script-based approaches can’t. Whether you’re tracking price shifts, people moves, or product changes (or more) intelligent Web scraping delivers more profound insights.

3. Maintenance may become your biggest nightmare

You’ve purchased an entry-level tool and built out scrapers for a few hundred sites.  At first, everything seems fine. But, within weeks you begin to notice that your data is incomplete and not being updated as you’d expected. Why did your data deliveries disappear?

Reality is that these low-cost tools are simply not designed for mission-critical business applications – on the surface they look helpful and easy to use, but underneath the surface they are script-based and highly dependent upon the HTML of a website. But websites change, and entry-level web scraping tools are simply not engineered to adapt to those changes.

And, most of these tools are simply not designed for enterprise use. They have limited reporting, if any, so the only way to know whether they’re successfully completing their tasks is by finding gaps in the data – often when it’s too late.

An intelligent web scraping approach doesn’t rely upon the HTML of a web page. It uses machine learning algorithms which view the web the same way a user might. A typical reader doesn’t get confused when a font or color is changed on a website, and neither do these algorithms. But simple approaches to web scraping are highly dependent on the specific HTML to help it understand the content of a page. So, when websites have design changes (on average once every 18 months), the software fails.

While entry-level web scraping software can be an easy solution for simple, one-time web scraping projects, the scripts they generate are fragile and the resources required for tracking and maintenance can become overwhelming when you need to regularly extract data from multiple sites.

Case in point: Shopzilla assimilates data five times faster than outsourced Web scrapers

To demonstrate the power of intelligent Web scraping, here’s a real-life example from Shopzilla.  Shopzilla manages a premier portfolio of online shopping brands in the United States and Europe, connecting more than 40 million shoppers each month with millions of products from retailers worldwide. With the explosive growth of retail data on the Web, Shopzilla’s outsourced, custom-built approach, based on scripting, could not add the product lines of new retailers to its site in a timely fashion. It was taking up to two weeks to write the scripts needed to make a single site accessible.

By deploying Connotate’s intelligent web scraping platform on site, Shopzilla gained the ability to harness Web data’s rapid growth and keep up to date. Today, new sources are added in days, not weeks.  The platform continually monitors Web content from thousands of sites, delivering high volumes of data every day in a structured format. The result: 500 percent more data from new retailers. An added bonus: the company has reduced IT maintenance costs and its dependence on outsourced development timetables. Case in point: Deep competitor intelligence in two languages

A global manufacturer needed to monitor competitors’ technology improvements in a field where market leadership hinges on an ability to quickly leverage these advances. That meant accessing scholarly journals and niche sites in multiple languages. Using the Connotate solution, it was able to access highly-targeted, keyword-driven university and industry research journals and blogs in German and English that are hard to reach because they do not support RSS feeds. Our solution also incorporated semantic analysis to tag and categorize data and help identify new technologies and products not currently in the keyword list. The firm enhanced its competitive edge with the up-to-the-minute, precise data it needed.

Is your Web scraping intelligent enough?

See what intelligent agents through an automated Web data extraction and monitoring solution can bring to your business. Contact us and speak with one of experts.

Source:http://www.connotate.com/3-reasons-web-scraping-game-6579#.VGMjH2f4EuQ

Friday, 7 November 2014

Web Scraping the Solution to Data Harvesting

The internet is the number one information provider in the world and it is of course the largest in the same course. Web scraping is meant to extract and harvest useful information from the internet. It can be regarded as a multidisciplinary process that involves statistics, databases, data harvesting and data retrieval.

There has been noted a rapid expansion of the web and therefore causing an enormous growth of information. This has led to increased difficulty in the extraction of useful and potential information. Web scraping therefore confronts this problem by harvesting explicit information from a number of websites for knowledge discovery and easy access. It is important to realize that query interfaces of web databases are prone to sharing of same building blocks. It is therefore important to realize that the web offers unprecedented challenge and opportunity to data harvesting.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-solution-data-harvesting/