Some time ago I wrote an article,SayIt is time to use the whitelist to the wallGot it,But that has expired long whitelist,It is not so smooth with the,Then I say boasted:I want to implement your own a reptile,Come crawling China's domain,Good update whitelist。
Ok,All in all this crawler is written on the line and then climb took more than ten thousand,But in the end I found the former to do a better solution,So the project was abandoned reptile。In short,White List more powerful,Just do not use it this reptile。
Crawler is written in Python,I did not use the classic crawler frame - because I think I write too simple reptiles do not need to use so much of the frame,So yourself achieving a small wheel,On the one hand but also to learn the language Python。All right,So much,Now let's look at the entire record of the development process,for your reference?
design
For the first reptiles,Always be able to get on the right page,In fact, this thing reptiles,It is to download the Web page and then get content,Extract links,Then proceed according to the download page link。Then,Our basic goal came out:
Download page for,Get all the links in the page,Discard everything,Continue to get links。Stored at the same time to get a link to the process。
Since I have a domestic vps,So a lot of simple logic - have access to the domain name must not walls,In addition to not visit the kneeling,That can only be a wall:)
(This logic is not perfect,In fact behind you will encounter a lot of problems;Next, I will improve little by little,This is a thought process。In fact, there is always another recall bias,But at least there is a reference value, right? )
Get to the list of domain names for how to manage it? Taking into account the future may have to refresh them - after all, a domain name may be in the future could be right out of the wall,So I still consider using the database to manage these domains,Simple parameters,A database,A table,Inside a domain name to a line。
correct,If the crawler is stopped,We also hope that it will no longer start when restarted,Then we should be able to save its current execution state,Taking into account the link easy to obtain,One by one the download link to the page difficult (time consuming),We need a caching mechanism。
Or queue it,So more precise。I have built a first-in last-out queue,This is an array,When I get to the page,To use regular expressions to get all the links inside out,Put the end of the array。
The spider crawling time,It is taken from the head of the array,With a delete a,If the number is too large array (for example, more than ten thousand),Then the first will not be added。
Simultaneously,We crawl a page,It means that the domain name is available,Join the white list - of course,Have a look at adding Prior to joining,If you add too,At the popularity +1
You see,Database entry corresponding to the,But also the way to make a simple hot level domain name ranking。Such consideration to the whitelist may have tens of thousands of future,But in fact not generally used to cover so much,We can choose the class ten thousand before output。
Then this basic logic built up,Then we need to design what kind of - after all,,We object to be oriented。
Nothing is object-oriented solve the problem,If there is,Instantiating a class to。
First, we like a Spider,This is our cute little spider,It is responsible for all crawling function,For example, a need to get access to the next page,From the seeds begin to crawl,Read from the pages links, etc.。
Second it is devoted to the IO responsible for communication with the database ,It is responsible for all communication with the database package,This will be very convenient,We do not have to worry about database issues in dealing with reptiles。
correct,Finally it:I use Python 3
achieve
Probably designed so,Now we have to achieve concrete classes,First to talk about IO :
I
This class is responsible for the database communication,Here I am using a package pymysql ,In addition, to add a database to give last updated label,We then use a datetime Package to get into the database when the time。
Although I did not learn database theory,But that does not prevent me to deploy and use it。
Here we use a MySQL database,Create a file called whitelist The database,And create a file called WhiteList Table:
1 2 3 4 5 6 7 |
CREATE TABLE WhiteList ( DomainRank int NOT NULL, Domain varchar(255) NOT NULL, LastUpdate date NOT NULL, PRIMARY KEY (Domain) ) |
Then we are to achieve the specific methods:
1 2 3 4 5 6 7 8 9 10 11 12 |
def __updateDomain(self,domain): cur = self.conn.cursor() cur.execute('select * from WhiteList where Domain=%s',domain) data = cur.fetchall() if data: rank = data[0][self.__DomainRank] cur.execute('update WhiteList set DomainRank=%s,LastUpdate=%s where domain=%s',(rank + 1,datetime.datetime.now().strftime("%Y%m%d"),domain)) self.conn.commit() else: cur.execute('insert into WhiteList (Domain,DomainRank,LastUpdate) values (%s,%s,%s)',(domain,1,datetime.datetime.now().strftime("%Y%m%d"))) self.conn.commit() cur.close() |
Here I only give part of the code to achieve,Complete at the end of the article I will give Github repository,You can clone yourself to read。
The above is the most updated domain name entry method,First, get a cursor,Then get data from cursor,Then updates the existing data entry;If there is no corresponding entry,It directly into a new entry。Finally, remember to commit ,Then close the cursor。Here I use a few names instead of numbers:
1 2 3 |
__DomainRank = 0 __Domain = 1 __LastUpdate = 2 |
So easy to use it a lot:)As you can see,I have in front of name added two underscores,Role is to achieve private variables and methods in Python。In fact, I think this is very pit,This is the first I encountered a pit。
Programmers count,How you can not start from scratch? The first zero pit: self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self,self 。
1 2 |
def saveDomain(self,domain): self.__updateDomain(domain) |
Of course,In order to allow external add domain,I wrote a method exposed to the outside:) (What is the significance?)
Spider
From now on,Highlight came,Our little spider。Due to the use of regular expressions,I used Python package re ,Due to the use of utf-8 Codecs page,I also used codecs ,Of course,To get the page,So there urllib3 。
1 2 |
def __getSeeds(self): return ['www.hao123.com'] |
The easiest way to start a start,Spider total first beginning to its first seed,Or how it started? I like the priority needs of such breadth,That naturally from this page polymerization domain of the best start。If necessary,In fact, you can also fill out more,But here I actually wrote a。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def __nextPage(self): if len(self.__domainList) == 0: self.__domainList = self.__getSeeds() url = self.__domainList.pop(0) pageContent = self.__getPage(url) topDomain = self.__topDomainRex.findall(url) self.__io.saveDomain(topDomain[0]) self.__gatherDomainFromPage(pageContent) |
This is our main function of,Then run,Simply call the "Next" loop method,Crawlers can always climb down。According to our logic,He will first detect queues,If the queue is empty,Then that is just the start,It will go from seed started。
Then a successful visit to page added to the database,Then collected link in the page,Other resources are freed up。So we would not need this little reptile care what robot.txt ,Because it does not collect information page。
You see some strange code indentation,Because I deleted the content portion meaningless,Those contents will be mentioned in the back。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def __gatherDomainFromPage(self,page): if len(self.__domainList) > 10000: return try: m = self.__domainRex.findall(page) except: #print('Get wrong data! skip it!') return domainList = [] if m: for domain in m: domainList.append(domain[1]) domainList = self.__deDuplicate(domainList) domainList = self.__checkDomainFromList(domainList) self.__domainList += domainList |
Obtain the domain name from a page,If the queue is already greater than ten thousand,It is no longer added,Too much。Then if less than ten thousand,We will use regular expressions to obtain links from pages,,This regular expression like this: self.__domainRex = re.compile(r'http(s)?://([\w-_]+\.[\w.-_]+)[\/\*]*') There will be a problem,.cn domain is not required to judge,It is a country domain,Certainly visit!
1 2 3 4 5 6 7 8 9 10 11 |
def __checkDomainFromList(self,list): domainList = [] for domain in list: m = self.__cnDomainRex.findall(domain) if m: continue else: domainList.append(domain) return domainList |
So we use another regular to eliminate an additional list of Chinese domain name,Regular expressions are so: self.__cnDomainRex = re.compile(r'.cn(/)?$')
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def __getPage(self,url): http = urllib3.PoolManager( cert_reqs='CERT_REQUIRED', # Force certificate check. ca_certs=certifi.where(), # Path to the Certifi bundle. ) data = '' try: data = http.request('GET', url, timeout=10, headers={ 'User-agent' : 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'} ).data codeType = chardet.detect(data) data = data.decode(codeType['encoding']) except: pass return data |
Tome coming! It is our way to get the page,Here we encounter some ssl page,That is Https,and urllib3 The default does not support,We need a package with additional: certifi ,Here we visit while disguised head,It looks more like a Windows user access。
Later, I encountered a problem,For example, some pages (jd.com) to be wrong,After investigation found Jingdong use the gbk coding,And I always use the default utf-8 ,So I used the chardet To infer page coding,Although there will still be some coding can not correctly identify,But, overall,,Those mistakes have been small enough to be negligible。
1 2 3 4 5 6 7 8 |
def __deDuplicate(self,list): result = [] for item in list: try: result.index(item) except: result.append(item) return result |
Said front,Our primary goal is to get the spider as many different domain names,So also it is called a breadth-first,So using these two methods to ensure that:Using deduplication to remove the list of duplicate links,Avoid the same page repeatedly accessed (a number of different pages of the same site or want)
At last,Methods We then save and read into the cache write queue:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def __cache(self): f = codecs.open('./domainlistCache','w','utf-8') for domian in self.__domainList: f.write(domian + '\n') f.close() def __getLastTimeList(self): try: f = codecs.open('./domainlistCache', 'r','utf-8') for line in f.readlines(): line = line.strip('\n') self.__domainList.append(line) f.close() except: pass |
The two former method will be executed when the spider is to stop the destruction of the time in class,The latter will be executed when the initialization is started,This ensures that the state of crawling reptiles。
In fact, our crawlers may have been on the line - in fact this is my first edition of the test version。
ending
This time our project is not yet complete,Because it is single-threaded blocked,I have not written the main loop,Once executed,It is stuck completely motionless。so,We need to make it the background。
First, let's complete the primary cycle:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def start(self): self.__domainList = self.__getSeeds() # self.__cache() while True: threads = [] self.__cache() for thread in range(20): threads.append(threading.Thread(target=self.__nextPage)) for thread in threads: thread.start() for thread in threads: thread.join() |
A concurrent 20 threads simultaneously crawl。There will be a problem,Because Python has a global lock (For details, please Google the next bar,In the past period of time,At that time the page has no reference),It "guarantees" of Python impossible to achieve true concurrency,so,If you are concurrent Python on multi-core CPU,This will happen:
This is my second encounter Python Pit。
Ok,In short,We still need to become a reptile service,We now need to main Inside the operation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
def daemonize(pidfile, *, stdin='/dev/null', stdout='/dev/null', stderr='/dev/null'): if os.path.exists(pidfile): raise RuntimeError('Already running') # First fork (detaches from parent) try: if os.fork() > 0: raise SystemExit(0) # Parent exit except OSError as e: raise RuntimeError('fork #1 failed.') # os.chdir('/') os.umask(0) os.setsid() # Second fork (relinquish session leadership) try: if os.fork() > 0: raise SystemExit(0) except OSError as e: raise RuntimeError('fork #2 failed.') # Flush I/O buffers sys.stdout.flush() sys.stderr.flush() # Replace file descriptors for stdin, stdout, and stderr with open(stdin, 'rb', 0) as f: os.dup2(f.fileno(), sys.stdin.fileno()) with open(stdout, 'ab', 0) as f: os.dup2(f.fileno(), sys.stdout.fileno()) with open(stderr, 'ab', 0) as f: os.dup2(f.fileno(), sys.stderr.fileno()) # Write the PID file with open(pidfile, 'w') as f: print(os.getpid(), file=f) # Arrange to have the PID file removed on exit/signal atexit.register(lambda: os.remove(pidfile)) # Signal handler for termination (required) def sigterm_handler(signo, frame): raise SystemExit(1) signal.signal(signal.SIGTERM, sigterm_handler) |
This is a Python function allows the background,It is the principle of multi-process,While the child process exits after the parent process exits,But we put it to switch to the root directory,Then its standard output redirected to a log file,So this process will be preserved:)
of course,This is only a function is not enough:
1 2 3 |
if __name__=='__main__': PIDFILE = '/tmp/GFW-White-Domain-List-daemon.pid' main() |
We write path pid here,Then run main function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def main(): if len(sys.argv) == 1: help() return if sys.argv[1] == 'start': startSpider() elif sys.argv[1] == 'status': status() elif sys.argv[1] == 'stop': stop() elif sys.argv[1] == 'restart': stop() startSpider() elif sys.argv[1] == 'list': outPutList() elif sys.argv[1] == 'help': help() else: print('Unknown command {!r}'.format(sys.argv[1]), file=sys.stderr) |
For ease of operation,I gave it some added functions,So that we can use, such as ./main.py start Such commands to operate the service! I will not enumerate specific function,You can go to download the complete code reading。
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def startSpider(): print('WhiteList spider started!', file=sys.stderr) try: daemonize(PIDFILE, stdout='/tmp/spider-log.log', stderr='/tmp/spider-err.log') except RuntimeError as e: print(e, file=sys.stderr) raise SystemExit(1) io = IO.IO() spider = Spider.Spider(io) spider.start() |
Here I give a start function,Very simple, right? First call back function,Then normal reptiles class to instantiate。Here, we first instantiate IO,Then IO examples cited passed Spider,Then call it start() Technique。
Such,Complete program on OK,of course,Then I'm going to talk about the issues behind the mend。
repair
As the saying goes,Has been released in the production environment debugging bug,That's it:
Endless loop
First problem I encountered is to repeat domain case is added too much,Sometimes get out in a storm pages,For example, the hao123 start,And then link to a page,The results of this page Friends of the chain there is a hao123,Then go back and reptiles will climb and then climb again hao123!,so,Here we use a function called "well-known domain name" to limit the number of repetitions,We say that if a domain name has been repeated met over ten thousand times (later changed to 2000.),Well, this must be a major stations,Do not continue to access the。
1 2 3 4 5 6 |
def __isWellKnown(self,domain): topDomain = self.__topDomainRex.findall(domain) self.__lock.acquire() result = self.__io.getDomainRank(topDomain) self.__lock.release() return result > 2000 |
Thread Safety
We said above,,Python concurrency is a disability,But it actually is not thread-safety issue that is crippled ......,The security thread had to make do。so,We want to before and after each IO class method calls are thread-safe plus statement,such as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def __nextPage(self): self.__lock.acquire() if len(self.__domainList) == 0: self.__domainList = self.__getSeeds() # print(self.__domainList) url = self.__domainList.pop(0) self.__lock.release() if self.__pingDomain(url): if self.__isWellKnown(url) : return #skip this domain if wellknowen pageContent = self.__getPage(url) topDomain = self.__topDomainRex.findall(url) self.__lock.acquire() self.__io.saveDomain(topDomain[0]) self.__lock.release() self.__gatherDomainFromPage(pageContent) |
Other similar,This ensures that the database access will not conflict。
Wait problem
We say,If you can not access is to be a standard wall,But obviously we define "can not access" the phrase is ambiguous,For a reptile,Timeout is the loading time of a typical reflection of the wall,But every time so really want to wait forever the。so,We call a ping to determine,If the ping does not pass,That directly abandoned well。Here to use subprocess Package;Simultaneously,still need dnspython3 Package to query the domain name from ip;as well as shlex package:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def __pingDomain(self,domain): a = [] isAvailable = True try: a = self.__resolver.query(domain) except: isAvailable = False if len(a) == 0:return False ip = str(a[0]) if self.__chinaIP.isChinaIP(ip): cmd = "ping -c 1 " + ip args = shlex.split(cmd) try: subprocess.check_call(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE) except subprocess.CalledProcessError: isAvailable = False else: isAvailable = False return isAvailable |
This direct filter out a lot slower、Timeout site,Greatly accelerated creep speed。Other,Later, I also think,Simply on the server's IP address look good judgment,If it is not in China's IP,Direct return false Finished thing,After all, foreign websites linked to or faster proxy is not it?
So I went online to find a judge of the class added into it,Specific not put up,After all, it is copied to the。
Country-specific domain
Since I use regular expressions to determine the domain name,Here there will be some problems,For example, we want to determine the country domain name is too difficult, .cc Fortunately, that kind of,that .net.cc I kind of broke down,and。Strictly speaking, this type of domain names .la domain name, although the country,But we are using can not catch ......
Finally, I found a CCTLD from ICANN's official website (ie, Country Code Top Level Domain) list,I deal a little bit to match the direct,Such a list would not have generated in such a pan-domain presence net.la。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def outPutList(): # count = 10000 cctlds = TLDS.getCCTLDS() tlds = TLDS.getTLDS() io = IO.IO() list = io.getList() f = codecs.open('./whitelist.txt','w','utf-8') print('Output top 10000 domains in whitelist.txt\n', ) i = 0 skip = 0 #its ugly ... but worked! #fk country code top-level domain!!!! for item in list: d = re.findall(r'.\w{2}$',item[1]) if d: if cctlds.__contains__(d[0]): t = re.findall(r'^\w+',item[1]) if tlds.__contains__(t[0]): skip += 1 continue i += 1 f.write(item[1]+'\n') if i == 10000: break print('done! got '+str(i)+' domains.\n and skip '+str(skip)+' error domain.\n') |
end
All right,The article finally to the end ...... but I think the writing is not particularly detailed,Because a lot of the time I can not remember the specific problems,I'm sorry I was delayed until now to write。?
After all, this is the first program I have written in Python,So look less elegant code implementation,But then again:
At least run up。
I've read as much as possible in accordance with the time of the development process to write,So it just shows the main core code,See specific code repository Github。
Other,I said,White list no longer rely on this reptile update,Instead usingfelixonmars 的dnsmasq-china-list ,But it also led directly to the white list jumped to more than 20,000,No longer suitable for mobile terminal use。
Ok,This article on here。?
Original article written by LogStudio:R0uter's Blog » Using python write a domain whitelist reptiles
Reproduced Please keep the source and description link:https://www.logcg.com/archives/1697.html
Code is a mess,But learning。。
?They are directly attached to the,So really mess ...... In short,There are complete code on github,Then I do not bother posted more the hey ........................。