Book review - Guide to Web Scraping with PHP

by Robert Basic on May 25, 2011.
It took me a while to grab myself a copy of Matthew Turland’sGuide to Web Scraping with PHP”, but a few weeks ago a copy finally arrived and I had the pleasure of reading it. I planned to buy it right as the print copy was announced, but then realised that php|arch accepts only PayPal as the payment method, which doesn’t work from Serbia, so I had to postpone the shopping for some better times. Fast forward 5-6 months and I found a copy on the Book Depository, which has no shipping costs! Yey!

My overall impression of the book is that it was worth the time and I’m really glad that I bought it. Matthew did a great job explaining all the tools we have at our disposal for writing web scrapers and how to use them. The chapter on HTTP at the beginning and a chapter with some tips and tricks at the end of the book, fit in great with the rest of the chapters, which are full of code examples. For the first reading, I’d recommend reading the book cover to cover, to get an overall view of all the tools presented, but later the chapters can be read independently.

As I said, the first chapter (actually, the second one, the first one is the introductory chapter :p), deals with the HTTP, especially with the parts of it which are needed for understanding, using and creating web scrapers.

The book then continues on different client libraries we can use to send HTTP requests and receive responses. Libraries like cURL or Zend_Http_Client are explained, but it is also explained how one can create his own using streams (the author does note that you’d be better of with an existing one!). For each of the tools it is described, how to handle things like authentication, redirects and timeouts, amongst others…

The second part of the book deals with preparing the documents for, and with the actual parsing of the data from these documents. Again, different tools are presented and explained, which one to use when and why. If none of the parsing tools can help, a most essential overview of the PCRE extension is given, too.

The book is finished with a nice “Tips and Tricks” chapter, which discusses real-time vs batch job scrapers, how to work with forms, the importance of unit testing… IMHO, without this last chapter, the book would not be finished.

I’m thinking hard right now, what bad things could I say about this book, but I can’t think of any. It is a guide, clear and straight-to-the-point, explaining what tools are there, which one to use and how for writing scrapers and that’s exactly what I wanted to know.

Yep, I’d recommend this book to anyone interested in web scraping with PHP :)

New adventures ahead!

by Robert Basic on May 23, 2011.
After a month or two of pondering and thinking and planning and thinking and some more thinking, today I finally told the management at work that I’ll be leaving in a month from today. Actually, I won’t be extending my contract with them which will end on June 24th.

Why? I don’t like the road the leadership of the company has taken (if this can be called a road at all…), the amount of energy the whole team is wasting on some small and silly things, the fact that extra effort is not recognised, thanked or paid and that there’s currently 8 of us in a roughly 25 m^2 room.

I know, there are bad moments everywhere when one just have to suck it up and deal with it for the whole team/company, but I’ve been doing that for quite some time and I had enough of that.

The only thing that makes me sad about this decision is that I’ll be leaving @milosija on his own here, a great mentor and friend from whom I’ve learnt a lot.

On the other hand, as of June 25th I’ll be starting my own company, which should be an exciting new experience. Still need to wrap my head around that one, so more on that in some future post…

DORS/CLUC 2011 recap

by Robert Basic on May 18, 2011.
This year’s DORS/CLUC has been and gone, 18th in the row for the organisers, first one (but not the last!) for me. For those of you who not know, DORS/CLUC is a conference about GNU/Linux and open source, which took place in the nice town of Zagreb, Croatia, in one of the auditoriums of FER (Faculty of electronics and computer science). I attended the conference as a speaker with a lightning talk on and a regular talk on Zend Framework.

First of all, I’d like to thank Nikola Plejic from NeutrinoDev for having me crash at his place during these couple of days. I hope that I’ll have the chance to return the favor one day :) BTW, if you need some Drupal/PHP/Python work done, you should talk to Nikola, he’s a kick ass dev ;)

The overall feeling I have about this conference is GODDAMN THIS IS AWESOME I WANT MOAR!!!1! Seriously, great organising, great talks, great speakers, great people, great food, great town, great WiFi. Granted, I haven’t attended a lot of conferences so far, but I honestly doubt it can get a lot better than this.


I really can’t say anything bad about the organisation. Before the conference the organisers where responsive via email and twitter for any questions/comments I had, the conference website was updated regularly. During the conference they worked hard to keep everything in order, to reduce the delays and to keep the coffee machine full of hot coffee at all times :)

Talks and speakers

The first day had Mark Shuttleworth, from Canonical and outter space as the keynote speaker with a talk generally on open source and Ubuntu. Adnan Hodzic gave one of the best talks about his experience on organising DebConf11 in Banja Luka. Other talks included web and mobile development, OpenStreetMap and some hardcore talks on LXC virtualization and MySQL fatal crash recovery.

The second day had Jan Wildeboer, Red Hat’s evangelist as the keynote speaker, with a great talk on open source. Other talks were about some business wishy-washy (I skipped those :P), databases, open source in academia… Personally, I think the first day was better, but nevertheless the second day was great and interesting too.

Although a bit exhausting, as both days had more than 10 hours of talks, it was really worth it. Now only if I had more hours in a day so I could play around with and look more into all of the stuff I learned here.

In the end, I’m really glad I came here, will be here next year too. If you’re into open source (and you better be! :P), I only can recommend you to check out and keep an eye on DORS/CLUC next year.

