Web Scraping Tools & Tips

Intro to Web Scraping by Dan Nguyen

Meet Your Web Inspector by Dan Nguyen

How to Import Web Data into Google Docs by Amit Agarwal

Amit Agarwal holds an Engineering degree in Computer Science from I.I.T. and has previously worked at ADP Inc. for clients like Goldman Sachs and Merrill Lynch. In 2004, Amit quit his job to become India’s first and only Professional Blogger. Amit authors the hugely popular and award-winning Digital Inspiration blog where he writes how-to guides around consumer software and mobile apps. He has developed several popular web apps and Google Add-ons including Mail Merge for Gmail.  You can read his interview on Lifehacker and YourStory.


Samantha Sunne:


Samantha Sunne and Matt Wynne:

Getting data from the web: From the quick grab to the intricate scrape

Samantha Sunne and Matt Wynn IRE 2015 Philadelphia “Scraping” is a term for catching, collecting or coaxing data off the web. If, for example, you have a list of campaign donors, but you want it in a spreadsheet, you’re going to want to scrape it.

WEB TRICKS AND SECRETS These techniques may or may not qualify as “scraping,” but they do let you to gather data using just a web browser in a way you wouldn’t usually use it.

● Use a web inspector or page source to take a look at the data or sources that your web page is accessing. Try Samantha Sunne’s tutorial or Dan Nguyen’s guide to the web inspector.
● Some page source code or database results come in a format called JSON. It’s not very pretty to look at, but you can: ○ Convert it to a csv (a type of spreadsheet) http://sunlightfoundation.com/blog/2014/03/11/making­json­as­simple­as­a­spre adsheet/ ○ make it look nicer: http://codebeautify.org/jsonviewer
● Obtain metadata (data that describes other data, such as a timestamp on a photo) on things like websites, photos and webpages POINT­AND­CLICK TOOLS These are some scraping tools that may make your life easier. Or they might make your life harder. Try to figure out which it is, early on, so you don’t waste too much time. Extremely simple tools like Google Scraper are either going to work or they’re just not.
● Scraper ○ A very simple Google Chrome extension that scrapes text and tables that you select. Almost equivalent to copy­and­paste. ● DownThemAll ○ A Firefox add­on that downloads all the links, images or text you select on a webpage in one fell swoop.
● Kimono ○ A website and browser extension that uses the same point­and­click technology to create APIs (really, just scrapers that will run automatically whenever you want them to).
● import.io ○ A free app that will scrape web pages and sites manually or automatically. Has a slightly larger learning curve.
● Outwit hub ○ A similar app that identifies all the assets (text, images, etc.) in a web page or site and enables you to download them all.
● Samantha’s other recommendations: https://delicious.com/samanthasunne/scraping
● Scott Klein and Michelle Minkoff’s recommendations: datagrab
● If you’re looking for something specific, try searching Github. Lots of people have made scrapers they just put up for public use. For example, “scrape linkedin”

Ok, so there are lots of ways you can scrape data off the web without ever having to crack open a terminal. Why would you ever want to level up?
● Services come and go, but your code will stay the same.
● The web may evolve to leave some services in the dirt. But learn to code, and you’ll have a toolbox that can always get the job done (even if you do need to pick up new tools from time to time).
● Programming in general isn’t going to get less useful. Scraping is a fantastic way to back into learning a skill that will make you better at what you do. SO…. HOW? Any language can scrape from websites. PHP, Perl and Ruby, to name a few, all have tons of supported libraries and great documentation. I’m a Python guy, though, so we’ll stick to some useful libraries in that language.
● BeautifulSoup. Beautiful Soup understands websites and other formats (XML, specifically) for what they are. Rather than seeing the source of a page as a big blob of text, BeautifulSoup understands that tags represent different types of elements, that styles and classes are used to represent different levels of information, and so on.
● Mechanize. Mechanize can automate much of the interaction with a website we take for granted as we’re clicking around in our daily lives. It can fill out forms, cli ck links, navigate to urls and the like.
● Requests. Requests is a simple, modern library that basically accesses sites.
● Selenium. More and more often, sites are working outside of HTML, which really throws a monkeywrench into everything. Selenium is a library that literally cracks open a browser and cranks through commands. The learning curve is a little steeper, but it’s a good trick to know about when the time comes.