输入关键字 抓取所有网页(51招聘列表页,查找百度,谷歌上面的某个排行 )
优采云 发布时间: 2021-10-11 20:36输入关键字 抓取所有网页(51招聘列表页,查找百度,谷歌上面的某个排行
)
如果你想获取网站的某个页面的信息,关键是能够顺利请求那个页面。一些网站加密等技术可以防止你被抓住,你很难成功。
我抓到的是51job招聘列表页面。问题的关键是如何找到下一页。51是通过post方式提交表单,那么所有的参数都要通过参数找出来写入请求信息中。
请求连接方式
private Scanner openConnection (int i,String keyName,String link) {<br /><br /> try {<br /><br /> URL url = new URL("http://search.51job.com/jobsearch/keyword_search.php");<br /> //参数设置<br /> String parameter = "postchannel=0000&stype=2&jobarea=0100&district=&address=&lonlat=&radius=" +<br /> "&funtype_big=0000&funtype=0000&industrytype=00&issuedate=9&keywordtype=2&dis_keyword=" +<br /> "&keyword=&workyear=99&providesalary=99&cotype=99°reefrom=99&jobterm=01&ord_field=0" +<br /> "&list_type=1&last_list_type=1&curr_page=&last_page=1&nStart=1&start_page=&total_page=86" +<br /> "&jobid_list=39297991~39298287~39298722~39298729~39297918~39297800~39298262~39297331~39297238~39297080~39296848~39297361~39296644~39296315~39287153~39295409~39295407~39295397~39295396~39295391~39287385~39293469~39287417~39285861~39281595~39281853~39279955~39281274~39280683~38748545~37068616~38130945~39023955~36747022~36493173~39006183~38960955~38960944~38960615~38980334~37888484~37584999~38998054~37585073~37332619~36882505~34976909~37307284~37307262~36999896~36767409~39242127~7369258~35503114~35502793~35496087~35496083~35495350~35494140~35493224~35492320~35487346~35468080~35457510~35457504~35457501~35398467~35380047~35347719~35347637~34991677~20974922~20974918~37441300~35465051~39160193~39029414~38138399~39136977~36632495~39266845~39270060~39266835~39097249~39082877~37663952~37662532~37662480~37663986~37662626~37662589~37662556~37738455~39270625~38433053~38261468~38486743~39057636~34582292~36475553~37257361~37257567~37257262~36741386~36711006~36498218~38914431~38734212~38674569~38787188~39259469~38927584~39024252~39024230~39228632~35252232~38658258~38658243~38625335~39245388~37319651~36852389~39136912~39159440~37456013~39256295~39214509~39253898~37376056~38561452~38295890~39156937~26052225~38711016~39272058~39271701~37777885~38524663~39022301~39063658~37777523~39018693~37897821~37023954~39242449~39242399~36227979~38635974~39100175~39200749~39251242~39197848~39229735~39108206~38520680~38520612~37512047~37373955~36748357~36558807~36553946~36994069~35651002~37645149~35650457~37547299~37547226~37547191~37547135~37325202~38909563~37981021~36518439~38435329~38356348~39225954~38905834~39100737~38753876~38753837~38648131~38909881~38909871~39253871~39139848~37756802~38207471~38715097~38714739~39228968~39109760~39109531~39109511~38412880~39193350~38918885~38443045~38133816~35085561~38011368~"+<br /> "&jobid_count=2551&schTime=15&statCount=364" +<br /> "&statData=404|114|45|61|92|99|29|34|80|27|15|29|49|449|1|228|133|0|0|1|1|243|494|5|0|0|1|0|7|232|321|139|26|1|0|152|831|1|1|4|18|8|8|4|3|0|0|0|0|0|0|588|0|1|0|0|0|0|1|13|0|0|0|0|0|0|0|1|0|0|0|0|0|0|2|254|6|6|0|1|1|0|0|0|0|0|0|1|0|0|0|0|2|0|1|0|0|0|0|0|0|0|0|0|0|0|365|14|13|0|5|3|18|9|2|0|1|26|6|2|0|0|3|1|2|3|0|9|32|1|0|6|1|0|0|0|13|209|1|0|3|1|7|32|5|37|1|0|3|0|0|13|2|9|10|0|1|0|5|1|1|0|0|2"+<br /> "&fromType=";<br /> //设置分页的页码<br /> parameter = parameter.replace("curr_page=", "curr_page="+String.valueOf(i));<br /> parameter = parameter.replace("fromType=", "fromType="+String.valueOf(14));<br /> //设置关键字“程序员”<br /> parameter = parameter.replace("dis_keyword=", "dis_keyword="+URLEncoder.encode(keyName, "GBK"));<br /> parameter = parameter.replace("keyword=", "keyword="+URLEncoder.encode(keyName, "GBK"));<br /><br /> //打开链接设置头信息<br /> HttpURLConnection conn=(HttpURLConnection)url.openConnection(); <br /> conn.setDoOutput(true); <br /> conn.setRequestMethod("POST"); <br /> //伪装请求<br /> conn.setRequestProperty("Host", "search.51job.com");<br /> conn.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");<br /> //post方式参数长度必须设定<br /> conn.setRequestProperty("Content-Length", Integer.toString(parameter.getBytes("GB2312").length)); <br /> conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5; .NET CLR 1.1.4322; .NET CLR 2.0.50727; Alexa Toolbar; MAXTHON 2.0)");<br /><br /> OutputStream o = conn.getOutputStream();<br /> OutputStreamWriter out = new OutputStreamWriter(o, "GBK"); <br /> out.write(parameter);<br /> out.flush();<br /> out.close();<br /><br /> //获得请求字节流<br /> InputStream in = conn.getInputStream();<br /> //解析<br /> Scanner sc = new Scanner(in, "GBK");<br /> return sc;<br /> } catch (Exception e) {<br /> log.error(e,e);<br /> return null;<br /> }<br /> }<br />
这样就可以在第一页获取关键字的列表信息。
完成这一步后,你就可以分析你要查找的信息了,比如公司信息、招聘信息...
<p>while (sc.hasNextLine()) {<br /> String line = sc.nextLine();<br /> sp = line.indexOf("class=\"jobname\" >", sp + 1);<br /> if (sp != -1) {<br /> sp = line.indexOf("