Can Javascript read the source of any web page?

I am working on screen scraping, and want to retrieve the source code a particular page.

How can achieve this with javascript? Please help me.

--------------Solutions-------------

Simple way to start, try jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

More at jQuery Docs

Another way to do screen scraping in a much more structured way is to use YQL or Yahoo Query Language. It will return the scraped data structured as JSON or xml.
e.g.
Let's scrape stackoverflow.com

select * from html where url="http://stackoverflow.com"

will give you a JSON array (I chose that option) like this

"results": {
"body": {
"noscript": [
{
"div": {
"id": "noscript-padding"
}
},
{
"div": {
"id": "noscript-warning",
"p": "Stack Overflow works best with JavaScript enabled"
}
}
],
"div": [
{
"id": "notify-container"
},
{
"div": [
{
"id": "header",
"div": [
{
"id": "hlogo",
"a": {
"href": "/",
"img": {
"alt": "logo homepage",
"height": "70",
"src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
"width": "250"
}
……..

The beauty of this is that you can do projections and where clauses which ultimately gets you the scraped data structured and only the data what you need (much less bandwidth over the wire ultimately)
e.g

select * from html where url="http://stackoverflow.com" and
xpath='//div/h3/a'

will get you

"results": {
"a": [
{
"href": "/questions/414690/iphone-simulator-port-for-windows-closed",
"title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
"content": "iphone\n simulator port for windows [closed]"
},
{
"href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
"title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
"content": "How\n to redirect the web page in flex application ?"
},
…..

Now to get only the questions we do a

select title from html where url="http://stackoverflow.com" and
xpath='//div/h3/a'

Note the title in projections

"results": {
"a": [
{
"title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
},
{
"title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
},
{
"title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
},
{
"title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
},
{
……

Once you write your query it generates a url for you

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D'%2F%2Fdiv%2Fh3%2Fa'%0A%20%20%20%20&format=json&callback=cbfunc

in our case.

So ultimately you end up doing something like this

var titleList = $.getJSON(theAboveUrl);

and play with it.

Beautiful, isn’t it?

Javascript can be used, as long as you grab whatever page you're after via a proxy on your domain:

<html>
<head>
<script src="/js/jquery-1.3.2.js"></script>
</head>
<body>
<script>
$.get("www.mydomain.com/?url=www.google.com", function(response) {
alert(response)
});
</script>
</body>

You could simply use XmlHttp (AJAX) to hit the required URL and the HTML response from the URL will be available in the responseText property. If it's not the same domain, your users will receive a browser alert saying something like "This page is trying to access a different domain. Do you want to allow this?"

If you absolutely need to use javascript, you could load the page source with an ajax request.

Note that with javascript, you can only retrieve pages that are located under the same domain with the requesting page.

As a security measure, Javascript can't read files from different domains. Though there might be some strange workaround for it, I'd consider a different language for this task.

Using jquery

<html>
<head>
<script src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.js" ></script>
</head>
<body>
<script>
$.get("www.google.com", function(response) { alert(response) });
</script>
</body>

You can generate a XmlHttpRequest and request the page,and then use getResponseText() to get the content.

You can use the FileReader API to get a file, and when selecting a file, put the url of your web page into the selection box. Use this code:

function readFile() {
var f = document.getElementById("yourfileinput").files[0];
if (f) {
var r = new FileReader();
r.onload = function(e) {
alert(r.result);
}
r.readAsText(f);
} else {
alert("file could not be found")
}
}
}

You can bypass the same-origin-policy by either creating a browser extension or even saving the file as .hta in Windows (HTML Application).

Despite many comments to the contrary I believe that it is possible to overcome the same origin requirement with simple JavaScript.

I am not claiming that the following is original because I believe I saw something similar elsewhere a while ago.

I have only tested this with Safari on a Mac.

The following demonstration fetches the page in the base tag and and moves its innerHTML to a new window. My script adds html tags but with most modern browsers this could be avoided by using outerHTML.

<html>
<head>
<base href='http://apod.nasa.gov/apod/'>
<title>test</title>
<style>
body { margin: 0 }
textarea { outline: none; padding: 2em; width: 100%; height: 100% }
</style>
</head>
<body onload="w=window.open('#'); x=document.getElementById('t'); a='<html>\n'; b='\n</html>'; setTimeout('x.innerHTML=a+w.document.documentElement.innerHTML+b; w.close()',2000)">
<textarea id=t></textarea>
</body>
</html>

Category:javascript Time:2009-03-25 Views:0

Related post

  • How to Use ASP.NET MVC to get "content = page source" from another web page? 2011-01-21

    How to Use ASP.NET MVC to get "content = page source" from another web page? --------------Solutions------------- I'm not sure what you mean by "get content", but here's a function for downloading html-source from any website: public string Download(

  • Log in using facebook to get the source of a web page (C#) 2011-07-09

    I want to get the source of a web page, and I have to be logged in using facebook to access that page. I know I can use WebClient (or HttpWebRequest) to get a web page source in C#, but how do I log in to that site using a facebook account? ---------

  • Moving the MIT and/or GPL attribution from source to own web page 2010-08-11

    If I have a website which implements a number of open source licensed libraries (For example, jQuery, jQuery UI, along with half a dozen plugins), is it OK to take the copyright notice out of the JavaScript source files, and place it on it's own web

  • grab source of ajax web page 2012-03-30

    how is it possible to grab web page source from a ajax type web page: curl doesn't seem to be able to get ajax generated source. Sorry if duplicate, but looking throw questions didn't find answer. --------------Solutions------------- If the page you

  • Using JavaScript, how can I prefetch another web page on my site? 2009-10-10

    I have Large.html, which is a web page that has a lot of images and javascript on it which takes a long time to load. From other pages (a.html, b.html) how can I use JavaScript to prefetch Large.html (and all of the elements on the page) so that I ca

  • Can JavaScript be used in a mobile web page to count items? 2011-09-17

    What I need to do is the following, and I'm not sure what the best way to approach it is. I need to create a mobile web page that counts things... there will be four specific things that need to be counted in a particular session. What I would like t

  • Is JavaScript allowed to call a remote web page during a click event? 2010-05-20

    When viewing the click in firebug, the call turns red (i.e. error) but I can't see the error because the page redirects. So is it allowed to call a remote website (in my case, its a 1x1 image using a standard url like http://www.example.com/becon). -

  • Javascript Mouse Wheel Zooming for a Web Page 2012-01-19

    I'm currently trying to build a poor man's Google Maps using scripts and things to simulate all the functions. I have a massive image that I sliced up into tiles and display in a table on the web page. I found a script that enables a user to click an

  • Cross-platform (JavaScript) tools for taming complexity of web pages? 2012-02-16

    What are the platform-independent tools for managing the complexity of developing (interactive) web pages? I came across Backbone.js - "models with key-value binding and custom events ...views with declarative event handling, and connects it ...over

  • Is there a way to view the source of a web page AFTER all jquery scripts have run? 2009-08-26

    I currently use Chrome/Firefox for my web development. Is there a plugin, or am I just another way, where you can view the HTML source AFTER all jQuery plug-ins have run? I just want to see what and how jQuery modified the HTML? --------------Solutio

  • Using injected JavaScript to copy text from a web page 2010-12-17

    As part of a job I'm doing on a web site I have to copy a few thousand lines of text from several pages of the old site and paste them into the HTML for the new site. The long and painstaking way of going to the old page and copying the many lines of

  • One JavaScript link will open on a web page, but the other will not? 2014-04-25

    Hello. I have a user who is getting irritated because she is trying to review an order on a web site, but one of the links will not open. Both are JavaScript links on the same page and we tried it on another computer and it worked. Any idea why one w

  • Javascript memory leaks after unloading a web page 2009-07-03

    I have been reading up to try to make sense of memory leaks in browsers, esp. IE. I understand that the leaks are caused by a mismatch in garbage collection algorithms between the Javascript engine and the DOM object tree, and will persist past. What

  • What's the most efficient way to get source code of web page in C? 2010-07-05

    In PHP I can do it as simple as : file_get_contents('http://stackoverflow.com/questions/ask'); What's the shortest code to do the same in C? UPDATE When I compile the sample with curl, got errors like this: unresolved external symbol __imp__curl_easy

  • Keeping session id when getting source code from web page 2011-05-13

    I'm making a C# windows form application that needs to parse data from an external site that requires me to log in. First I send the POST data on the login page using a WebRequest. This works correctly as I can see the page source of a page that requ

  • Trouble download source from a web page 2011-09-15

    Im trying to download the source code for this webpage for a school project using c#. this is the page im trying to get: http://www.epicurious.com/tools/fooddictionary/entry?id=1650 I have tried code such as HttpWebRequest request = (HttpWebRequest)

  • How to inject a javascript file in a Non-Local web page and invoke it 2011-12-20

    I'm using VB.NET 2008. I am building an application which had a webbrowser named "browser1". When I navigate a URL on it like "http://www.google.com" (not "file:///c:/test.html") it successfully loads the page. I am using the code to inject a javascr

  • How to dynamically get the DOM of a web page in Javascript 2010-05-29

    I would want to dynamically get the DOM structure (HTML source) of a web page. I want to do some manipulations with it later. Is this possible in javascript at all? Thanks. --------------Solutions------------- Your best (and safest) bet is to either

  • How can I prevent injected JavaScript file reference code from running in my web page? 2012-04-24

    I don't know if it's possible to prevent the injection of a JavaScript reference file script into a web page, but I have seen sites that prevent you from running functions within the injected JavaScript reference file script unless it's JavaScript th

Copyright (C) pcaskme.com, All Rights Reserved.

processed in 0.610 (s). 13 q(s)