{"_id":"extractor","_rev":"85-290be0bf1f787bdd22a6ec67a0215cba","name":"extractor","description":"A small utility library for retrieving and scraping web content. It targets scraping content with a unique attribute id, class or tag.","dist-tags":{"latest":"0.1.2"},"versions":{"0.1.0":{"name":"extractor","version":"0.1.0","description":"A small utility library for retrieving and scraping web content. It targets scraping content with a unique attribute id, class or tag.","main":"./extractor.js","repository":{"type":"git","url":"git://github.com/rsdoiel/extractor-js.git"},"author":{"name":"R. S. Doiel","email":"rsdoiel@gmail.com","url":"https://github.com/rsdoiel"},"maintainers":[{"name":"rsdoiel","email":"rsdoiel@gmail.com"}],"engines":{"node":">= 0.6","npm":">= 1"},"dependencies":{"jsdom":">= 0.2.10"},"scripts":{"test":"node extractor_test.js"},"bugs":{"url":"http://github.com/rsdoiel/extractor-js/issues"},"_npmUser":{"name":"rsdoiel","email":"rsdoiel@gmail.com"},"_id":"extractor@0.1.0","contributors":[{"name":"R. S. Doiel","email":"rsdoiel@gmail.com"}],"devDependencies":{},"optionalDependencies":{},"_engineSupported":true,"_npmVersion":"1.1.21","_nodeVersion":"v0.6.17","_defaultsLoaded":true,"dist":{"shasum":"9c06a64301dfb51f9168caca15c13cc8e2995f64","tarball":"https://registry.npmjs.org/extractor/-/extractor-0.1.0.tgz","integrity":"sha512-+SsfGJDVcmhvgqMbdT6QJs47pvww5GpFHwkVTRZiIOgftJT0Kfc/47g/QitiRiATVWv7wp8mnvUvETYDudCMtA==","signatures":[{"keyid":"SHA256:jl3bwswu80PjjokCgh0o2w5c2U4LhQAE57gj9cz1kzA","sig":"MEUCICHbEO2LTUhlJRap9iR27WKSnng2sY7o0tLPNzXyQiQAAiEAxmsvqStf+DZIIig3p4rWWGEz8BVl61mX+8MfzOtsGlY="}]},"readme":"extractor-js\n============\nrevision 0.1.0\n--------------\n\n# Overview\n\nPeriodically I wind up writing scripts to screen scrape or spider sites.  Node  is really nice for this. extractor-js is a utility module for these types of script.\n\n## Four methods\n\nextractor has four methods -\n\n* `FetchPage` - get a page from the local file system or via http/https\n* `Scrape` - A combination of FetchPage(), Cleaner(), and Transformer() which fetches the content, parses it via \n jsDOM/jQuery and extracting interesting parts based on the jQuery selectors pass to it creating an object with \nthe same structure as the selector object and passing it to Scrape()'s callback method. You can override the Cleaner()\nand Transformer() methods by passing in your own cleaner and transformer functions.\n* `Spider` - a utility implementation of Scrape using a fixed map for anchors, links, scripts and image tags.\n\n# Example (Scrape)\n\n```javascript\nvar extractor = require('extractor'),\npages  = [ 'http://nodejs.org', 'http://golang.org'],\nselector = {\n\t'page_title':'title',\n\t'titles' :'h2',\n\t'js' : 'script[type=text/javascript]',\n\t'css' : 'link[type=text/css]',\n\t'about_as_id' : '#about',\n\t'about_as_class' : '.about'\n    };\n\npages.forEach(function(page) {\n    extractor.scrape(page, selector, function (err, data, env) {\n        if (err) throw err;\n\n        console.log(\"Processed \" + env.pathname);\n        console.log(\"Last modified \" + env.modified);\n        console.log(\"Page record: \" + JSON.stringify(data));\n    });\n});\n```\n\n\nThis example script would process three pages from the pages array and output a console log of the processed page and a JSON representation of the content described by the selectors.\n\n\n# Example (Spider)\n\nIn this example we spider the homepage of the NodeJS website and list of the links found on the page.\n\n```javascript\nvar extractor = require('extractor');\n\nextractor.spider('http://nodejs.org', function(err, data, env) {\n    var i;\n    if (err) {\n        console.error(\"ERROR: \" + err);\n    }\n    console.log(\"from -> \"+ env.pathname);\n    console.log(\"data -> \" + JSON.stringify(data));\n    for(i = 0; i < data.links.length; i += 1) {\n        console.log(\"Link \" + i + \": \" + data.links[i].href);\n    }\n});\n```\n\n","directories":{}},"0.1.2":{"name":"extractor","version":"0.1.2","description":"A small utility library for retrieving and scraping web content. It targets scraping content with a unique attribute id, class or tag.","main":"./extractor.js","repository":{"type":"git","url":"https://github.com/rsdoiel/extractor-js.git"},"author":{"name":"R. S. Doiel","email":"rsdoiel@gmail.com","url":"https://github.com/rsdoiel"},"maintainers":[{"name":"rsdoiel","email":"rsdoiel@gmail.com"}],"engines":{"node":">= 0.7.12","npm":">= 1"},"dependencies":{"jsdom":">= 0.2.10"},"scripts":{"test":"node extractor_test.js"},"bugs":{"url":"http://github.com/rsdoiel/extractor-js/issues"},"contributors":[{"name":"R. S. Doiel","email":"rsdoiel@gmail.com"}],"readme":"[![build status](https://secure.travis-ci.org/rsdoiel/extractor-js.png)](http://travis-ci.org/rsdoiel/extractor-js)\nextractor-js\n============\n\n# Overview\n\nPeriodically I wind up writing scripts to screen scrape or spider sites.  Node  is really nice for this. extractor-js is a utility module for these types of script.\n\n## Four methods\n\nextractor has four methods -\n\n* `fetchPage` - get a page from the local file system or via http/https\n* `scrape` - A combination of FetchPage(), Cleaner(), and Transformer() which fetches the content, parses it via \n jsDOM/jQuery and extracting interesting parts based on the jQuery selectors pass to it creating an object with \nthe same structure as the selector object and passing it to Scrape()'s callback method. You can override the Cleaner()\nand Transformer() methods by passing in your own cleaner and transformer functions.\n* `spider` - a utility implementation of Scrape using a fixed map for anchors, links, scripts and image tags.\n\n# Example (Scrape)\n\n```javascript\nvar extractor = require('extractor'),\npages  = [ 'http://nodejs.org', 'http://golang.org'],\nselector = {\n\t'page_title':'title',\n\t'titles' :'h2',\n\t'js' : 'script[type=text/javascript]',\n\t'css' : 'link[type=text/css]',\n\t'about_as_id' : '#about',\n\t'about_as_class' : '.about'\n    };\n\npages.forEach(function(page) {\n    extractor.scrape(page, selector, function (err, data, env) {\n        if (err) throw err;\n\n        console.log(\"Processed \" + env.pathname);\n        console.log(\"Last modified \" + env.modified);\n        console.log(\"Page record: \" + JSON.stringify(data));\n    });\n});\n```\n\n\nThis example script would process three pages from the pages array and output a console log of the processed page and a JSON representation of the content described by the selectors.\n\n\n# Example (Spider)\n\nIn this example we spider the homepage of the NodeJS website and list of the links found on the page.\n\n```javascript\nvar extractor = require('extractor');\n\nextractor.spider('http://nodejs.org', function(err, data, env) {\n    var i;\n    if (err) {\n        console.error(\"ERROR: \" + err);\n    }\n    console.log(\"from -> \"+ env.pathname);\n    console.log(\"data -> \" + JSON.stringify(data));\n    for(i = 0; i < data.links.length; i += 1) {\n        console.log(\"Link \" + i + \": \" + data.links[i].href);\n    }\n});\n```\n\n","_id":"extractor@0.1.2","dist":{"shasum":"261a5ef3d0ee57356d664cc75e166e1963fd2064","tarball":"https://registry.npmjs.org/extractor/-/extractor-0.1.2.tgz","integrity":"sha512-OmTX5kfT41tPGqLXaqxdZSmSANhZiVapPCczE6lwMB4Gv3QA5CexBZ2rq8Yag4YaPFJntmg492H192fA12NPiQ==","signatures":[{"keyid":"SHA256:jl3bwswu80PjjokCgh0o2w5c2U4LhQAE57gj9cz1kzA","sig":"MEUCIF8pfBklpfmzNEt+PpjAbdSdfPVEBWIRUJzrFF5XwXBGAiEA4z43xYNao/mw3GEFSNAA+IlAMMf/VAYsy2Yg9blNl94="}]},"_npmVersion":"1.1.59","_npmUser":{"name":"rsdoiel","email":"rsdoiel@gmail.com"}}},"maintainers":[{"name":"rsdoiel","email":"rsdoiel@gmail.com"}],"time":{"modified":"2022-06-17T23:00:32.636Z","created":"2011-08-21T19:05:22.642Z","0.0.3":"2011-08-21T19:05:24.268Z","0.0.4":"2011-08-25T05:24:38.959Z","0.0.5":"2011-11-02T17:54:22.594Z","0.0.6":"2011-11-23T00:17:30.320Z","0.0.6b":"2011-12-14T00:26:18.315Z","0.0.7":"2011-12-14T02:14:03.317Z","0.0.7b":"2011-12-14T03:15:11.876Z","0.0.7c":"2011-12-14T22:34:07.786Z","0.0.7d":"2012-01-05T17:48:18.619Z","0.0.7e":"2012-01-05T19:44:06.981Z","0.0.7f":"2012-01-05T23:40:01.931Z","0.0.7g":"2012-01-06T21:55:01.333Z","0.0.8":"2012-01-12T23:55:28.015Z","0.0.9":"2012-02-18T02:56:07.843Z","0.0.9b":"2012-02-18T03:59:56.886Z","0.0.9c":"2012-03-20T00:07:22.208Z","0.0.9d":"2012-03-22T17:27:48.958Z","0.1.0":"2012-05-18T05:25:23.805Z","0.1.1":"2012-06-20T17:20:33.842Z","0.1.2":"2012-09-11T22:56:56.921Z"},"author":{"name":"R. S. Doiel","email":"rsdoiel@gmail.com","url":"https://github.com/rsdoiel"},"repository":{"type":"git","url":"https://github.com/rsdoiel/extractor-js.git"},"users":{"srcloop":true}}