DigitalPebble/storm-crawler: A scalable, mature and versatile web crawler based on Apache Storm

769
STARS
71
WATCHERS
240
FORKS
42
ISSUES

storm-crawler's Language Statistics

DigitalPebble's Other Repos

Star history of storm-crawler
Issue history of storm-crawler

storm-crawler Recent Issues

Issue Title State Comments Created Date Updated Date Closed Date
Use URLFrontier in archetype closed 0 2022-11-23 2022-11-28 2022-11-25
JSoupParserBolt improve performance of link extraction closed 0 2022-10-11 2022-11-28 2022-10-11
Blocking fetcher thread open 3 2022-09-02 2022-11-28 -
Delete redirected pages open 1 2022-08-08 2022-11-28 -
ES IndexerBold - Fix behaviour of afterBulk open 6 2022-07-16 2022-11-28 -
ConcurrentModificationException thrown by metrics in Fetcher executor open 0 2022-07-15 2022-11-28 -
Update xsoup from 0.3.2 to 0.3.4 closed 0 2022-07-07 2022-11-28 2022-07-07
Fix starvation and busy waiting of ES StatusUpdaterBolt closed 2 2022-07-07 2022-11-28 2022-07-12
[URLFrontier] URLFrontier extension not returning ID preventing Status-ACK making crawling impossible closed 6 2022-07-01 2022-11-28 2022-07-06
Number of StatusUpdaterBolt instances can be a multiple of the frontier nodes closed 0 2022-06-27 2022-11-28 2022-06-27
Dependency and maven plugin upgrades closed 0 2022-06-20 2022-11-28 2022-06-20
urlfrontier spout does not log the number of URLs obtained from a call to getURLs closed 0 2022-06-09 2022-11-28 2022-06-09
Use URLFrontier API 2.1 and support crawlIDs closed 1 2022-05-20 2022-11-28 2022-05-20
HttpProtocol doesn't consider http.content.limit in test for filesize closed 11 2022-05-18 2022-12-07 2022-06-09
ES look at / measure the new random sampler aggregation open 0 2022-05-16 2022-12-08 -
Update jsoup from 1.14.3 to 1.15.1 closed 5 2022-05-16 2022-11-28 2022-05-16
Can configure multiple addresses for URLFrontier nodes closed 0 2022-05-03 2022-11-28 2022-05-03
Update to Apache Tika 2.4.0 closed 0 2022-05-03 2022-12-09 2022-05-05
FetcherBolt - customize fetcher delay based on metadata property closed 2 2022-05-02 2022-12-09 2022-06-27
JSoupParserBolt - toOutlinks method as protected closed 2 2022-05-02 2022-12-09 2022-05-03
Allow compatibility.mode for rest client to connect to ES8+ closed 0 2022-04-06 2022-12-07 2022-04-06
Upgrade to Storm 2.4.0 closed 0 2022-03-28 2022-11-28 2022-03-28
Setting "maxDepth": 0 in urlfilter.json prevents ES seed injection closed 1 2022-03-25 2022-11-28 2022-03-28
Enable _source for content index in ES archetype closed 0 2022-03-14 2022-11-28 2022-03-21
Investigate doc-value-only fields open 0 2022-03-11 2022-11-28 -
Spout does not reconnect to URLFrontier if an exception occurs closed 0 2022-03-03 2022-11-28 2022-03-03
Issue with the order of emit and emitOutlink for redirections in FetcherBolt closed 5 2022-02-21 2022-11-28 2022-02-24
Upgrade Caffeine to 2.9.3 closed 0 2022-02-10 2022-11-28 2022-02-10
Bump Elasticsearch HLRC to 7.17.0 closed 0 2022-02-10 2022-12-07 2022-02-10
Newer Elasticsearch Version deprecate the REST High Level Client in favour of the Java API Client open 8 2022-02-09 2022-11-28 -
Convert LinkParseFilter into a JSoupFilter closed 0 2022-02-04 2022-11-28 2022-02-04
ES client to use compression (configurable) closed 0 2022-01-26 2022-11-28 2022-01-26
Upgrade ES client to 7.15.2 closed 0 2022-01-26 2022-11-28 2022-01-26
Investigate the PointInTime API in Elaticsearch closed 1 2022-01-12 2022-11-28 2022-11-25
Log4J vulnerability - CVE-2021-44228 closed 7 2021-12-13 2022-11-28 2021-12-21
Remove selenium.instances.num closed 0 2021-11-22 2022-12-08 2021-11-22
[Elasticsearch] Fielddata is disabled on text fields by default. closed 4 2021-11-19 2022-11-28 2021-11-19
Follow refresh redirects closed 4 2021-11-17 2022-11-28 2021-11-19
Kyro Class is not registered: com.digitalpebble.stormcrawler.persistence.Status closed 2 2021-11-16 2022-11-28 2021-11-17
JSoupParserBolt cannot configure more than one JSoupFilters per worker closed 0 2021-11-16 2022-11-28 2021-11-17
Need to register Status class with Kryo closed 1 2021-11-09 2022-11-28 2021-11-17
http.content.limit has effect on robots.txt closed 3 2021-11-04 2022-11-28 2021-11-09
Performance degradation using version 2.1.0 closed 9 2021-10-15 2022-11-28 2021-10-22
Dependencies upgrades closed 1 2021-10-14 2022-11-28 2021-10-15
Use GH Actions to set up Maven CI closed 0 2021-10-12 2022-11-28 2021-10-12
Update Storm to 2.3.0 closed 0 2021-10-06 2022-12-10 2021-10-06
Issue with ConcurrentModificationException for Metadata in StatusMetricsBolt closed 5 2021-09-16 2022-11-28 2021-09-17
AggregationSpout does not release IsInQuery boolean sometimes closed 6 2021-09-09 2022-12-08 2022-06-06
Replace Guava caches with Caffeine ones closed 0 2021-08-09 2022-12-03 2021-08-12
Bug: StackOverFlow issue in CharsetIdentification closed 3 2021-07-14 2022-12-08 2021-07-15