parsing - Why does Nutch think it has already parsed all segments when it hasn't? -


I am using the Nutch 1.6 to crawl a few forums and lists them with Solr 1.6.2 I am I ran a test question on Solar and was surprised that there were only a few results, I was worried that there was either a problem that was with parsing of pages of nach or indexing of solar. After around snooping I discovered that Nutch pages it has parsed a lot has been taken:

  bin / Nutch readseg -List -dir crawl mothering2 / Regions / Names Generated Paras 20130228001531 23 27 9 20130228003940 1430 1434 661 20130228001829 202 206 105 20130228061337 1068 1090 475 20130228091009 1 2 20130228085956 34 34 25 20130228090348 44 45 34 20130228090851 7 7 6 20130228080438 364 374 192 20130228030933 1774 1795 903 20130228084205 168 169 63 168 169 63   

But when I try to parse the fields, I get this:

  bin / nutch cross Crawl-mothering2 / regions / * ParseSegment: Starting 2013-03-21 00:20: 43 parasaggements: Segment: Crawl-Miting 2 / Segment / 20130228001531 "Exception" main in the thread "java.io.IOException: Segment first the parsed! org.apache on org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs (ParseOutputFormat.java:89) (JobClient.java:889) on org.apache.hadoop.mapred.JobClient $ 2.run. hadoop.mapred.JobClient java.security.AccessController.doPrivileged (Native method at $ 2 .run (JobClient.java:850) javax.security.auth.Subject.doAs on org.apache.hadoop.security.UserGroupInformation.doAs) (UserGroupInformation.java1212) at org.apache.hadoop.mapred.jobClient.submitJobInternal (JobClient.java:85) 0) at org.apache.hadoop.mapred.JobClient.submitJob (JobClient.java:824) at org.apache. hadoop.mapred.JobClient.runJob (JobClient.java:1261) org.apache.nutch.parse.ParseSegment.parse (ParseSegment.java:209) on org.apache.nutch.parse.ParseSegment.run (on ParseSegment.java: 243) org.apache.hadoop.util.ToolRunner.run (ToolRunner.java:65) on org.apache.nutch.parse.ParseSegment.main (ParseSegment.java:216)   

What gives?

Can not react to segments. To overcome this, you have to delete some folders. Please check the mailing list discussion.

You will get a quick response to .

Comments

Popular posts from this blog

excel vba - How to delete Solver(SOLVER.XLAM) code -

github - Teamcity & Git - PR merge builds - anyway to get HEAD commit hash? -

ios - Replace text in UITextView run slowly -