I downloaded a dump of the current articles in the English Wikipedia, the dump file is enwiki-20120104-pages-articles.xml. This is "current revisions only, no talk or user pages", and has no images. I installed a LAMP server with MediaWiki in a virtual machine and began using MWDumper to import the dump into my database. I haven't had any errors, and I can browse Wikipedia on this local server and see the articles that have been imported so far.
My problem is that I thought Wikipedia had about 3.85 million articles, but I've already imported 3.93 million pages into my database with MWDumper. I don't know how many are left, but when I browse my local Wikipedia, there are a lot of red links still. I looked on the talk page for MWDumper and saw that someone else complained that he or she expected 3.8 million pages and MWDumper imported 11 million.
I'm getting frustrated by how long this is taking. It's been importing for more than a week already and I thought it would be done today. I'm wondering why there are so many more pages in the Wikipedia dump than are in the English Wikipedia.
↧
Open Question: Why does a dump of the Wikipedia database have more pages than Wikipedia itself has?
↧