Error message

  • Warning: Cannot modify header information - headers already sent by (output started at /home/adam/sites/adamyoung.net/index.php:2) in drupal_send_headers() (line 1043 of /home/adam/sites/adamyoung.net/includes/bootstrap.inc).
  • Warning: Cannot modify header information - headers already sent by (output started at /home/adam/sites/adamyoung.net/index.php:2) in drupal_send_headers() (line 1043 of /home/adam/sites/adamyoung.net/includes/bootstrap.inc).
  • Warning: Cannot modify header information - headers already sent by (output started at /home/adam/sites/adamyoung.net/index.php:2) in drupal_send_headers() (line 1043 of /home/adam/sites/adamyoung.net/includes/bootstrap.inc).
  • Warning: Cannot modify header information - headers already sent by (output started at /home/adam/sites/adamyoung.net/index.php:2) in drupal_send_headers() (line 1043 of /home/adam/sites/adamyoung.net/includes/bootstrap.inc).
  • Warning: Cannot modify header information - headers already sent by (output started at /home/adam/sites/adamyoung.net/index.php:2) in drupal_send_headers() (line 1043 of /home/adam/sites/adamyoung.net/includes/bootstrap.inc).

Quickstart to PHP Screen Scraping

With the increased use of RSS, screen scraping is becoming less necessary. However, there's still plenty of times that it can get you out of a bind, at least temporarily. I think it's easiest to do screen scraping with PHP and the curl extension (a typical PHP configuration).

Screen Scraping Solutions:

Worst solution:
I've seen so many bad methods for screen scraping. The worst I've seen is doing a sub-string and then if statements followed by more nested sub-strings/if statements. This is one of those points in programming where you have to be telling yourself there's an easier - it just wouldn't feel right.

A little better but still awful:
Another method I've seen is using the PHP explode() followed by list() to get somewhat structured data and then sorting through that. These two methods make for a terrible time when the page you're scraping changes its layout.

Almost good:
The latest method I've seen that I still consider bad is using PHP DOM methods to parse the HTML. DOM is meant for reading/writing XML. This could be a decent solution, except for the fact that probably 99% of the pages on the internet are invalid HTML so you never know what DOM will give back. This solution still has the problem of if the page changes but at least as long as DOM is giving back useful stuff, you'll have a structured format to look through.

Best:
Regular expressions is what I consider the best solution. It doesn't give you structured HTML but people don't usually write correctly structured HTML. Usually in a line or two you can rip out all the data you need from the page.

Regular expressions was probably an obvious choice for those that know about it. For those that don't, here's a very nice Regex Cheat Sheet (don't read it yet - use it later on in the example). I'll try to give an explanation below about the regular expression I used and you can look up on the link why it works.

Case Study: Michigan Lottery Website
Below is the PHP I used to screen scraped the Michigan Lottery website for my article Michigan Keno Stats Over 16 Years

* note: I broke $regex and $variables into multiple lines to make it look right on this site, they can be all one line

<?
$url = 'http://www.michigan.gov/lottery/0,1607,7-110-28916---LP,00.html';
$regex = '/\<\/form\>\<font face=\'arial, helvetica, sans-serif\' ';
$regex .= 'color=\'black\' size=\'-1\'\>(.*?)\<\/font\>\<\/td>';
$regex .= '.*?(\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&nbsp;&nbsp; (\d{2})&';
$regex .= 'nbsp;&nbsp; (\d{2})/msi';

$year = 1990;
while ($year < 2006) {
	$variables = "fpInternalProcName=::LOTTERY::lotteryArchiveUI&fpSelectMonthBegin=1&";
	$variables .= "fpSelectDayBegin=1&fpSelectYearBegin={$year}&fpSelectMonthEnd=12&";
	$variables .= "fpSelectDayEnd=31&fpSelectGame=K_7&fpSubmit=\"Lookup Drawing Results\"&";
	$variables .= "fpSelectYearEnd=".$year++;
	$result = fgc($url, $variables);
	preg_match_all($regex, $result, $matches);
	/** this does nice print out **/
	for ($j = 0; $j < count($matches[1]); $j++) {
		for ($i = 1; $i < 24; $i++) {
			echo "{$matches[$i][$j]} ";
		}
		echo "\n";
	}
}
?>

I used this code along with a function I made up so that I didn't always have to remember the curl variables:

function fgc($url, $variables=null) {
	$ch = curl_init();
	if ($variables) {
	   curl_setopt($ch, CURLOPT_POST, 1 );
	   curl_setopt($ch, CURLOPT_POSTFIELDS, $variables);
	}
	curl_setopt ($ch, CURLOPT_URL, $url);
	curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101");
	curl_setopt ($ch, CURLOPT_HEADER, 0);
	curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
	$result = curl_exec ($ch);
	curl_close ($ch);
	return $result;
}

Results:
The results of running this code will be lines that look like these (It's simple to change the output to other formats):

Fri. Dec 20, 1991 01 06 10 13 15 16 19 21 24 27 28 35 36 37 38 43 53 57 61 64 68 69 
Mon. Dec 23, 1991 03 08 15 18 21 29 30 32 37 40 41 43 44 47 48 49 50 51 52 58 73 80 
Tue. Dec 24, 1991 02 03 04 07 13 15 18 20 23 24 30 32 39 42 46 59 66 67 69 71 74 75 
Thu. Dec 26, 1991 01 05 07 11 14 16 28 30 31 33 34 38 39 42 43 46 48 51 56 63 76 79 

The important part is the useful information has been scraped off the website and put into arrays which you can easily manipulate. If the website design changes, you just have to change the single variable $regex to fix the rest of your code. Nice.

How the Regex Works:
I viewed the source of the Michigan Lottery Keno Results website and noticed the repeating pattern that showed up on every row of the table of results. There's hundreds of ways you could write the regex to pick out the information you need. Some ways are more robust than others. For me, I picked the fastest way that worked. Lines 1-2 look for the date that the lottery numbers were drawn. The rest of the lines look for 2 numbers (that's the (\d{2}) part) with 3 spaces between each one (the &nbsp;&nbsp;  part). You'll notice that some of the regex has extra backslashes \. These are called escapes. This is because some of the characters I used are reserved characters. I think I over did the backslashes, but it doesn't matter. Extra ones are ignored. To put a backslash in for real you have to do \\.

While testing my regular expressions, I like to use the mac program RegExTest. This program is great and very simple to use. It's shareware but pretty much freeware because there isn't any windows popping up nagging you to pay the guy or anything. I paid him anyways because it's a great time saver instead of running your PHP each time to test. One thing to note is PHP says it's preg_XXX functions are perl-compatible regex but that's not always the case. This program is using the actual perl regular expression engine so sometimes what works in this won't work in PHP. Usually the fix is to add some more escapes. If that doesn't work, check out preg_quote().

Technorati Tags: , ,

Comments

just pipe the HTML through Tidy and you'll have something you can run the DOM against every time.

easy enough.

-LightningCrash

Can you please email me a copy you have.

I'm trying to extract information from a website much like you are and could use all the help I can get as I have never done it before.

Thanks in advance! I have been searching everywhere for someone who has done what I need to do too.

Unfortunately, I no longer have a copy and I can't find it online. Reggy.app is a similar product but it sucks in comparison.