How to control indexing of your website in search engines

July 2nd, 2009

How to remove webpage of website from search engines using meta tags?

This approach is suitable when user does not have root access of server and user is not able to create “robots.txt” file.

To prevent all robots from indexing a page on your site, place the following meta tag into the <head> section of your page:

<meta name=”robots” content=”noindex”>

To allow other robots to index the page on your site, preventing only Google’s robots from indexing the page:

<meta name=”googlebot” content=”noindex”>

When google see the noindex meta tag on a page, Google will completely drop the page from search results, even if other pages link to it. Other search engines, however, may interpret this directive differently. As a result, a link to the page can still appear in their search results.

If the content is currently in google index, Google will remove it after the next time crawl the site. To expedite removal, use the URL removal request tool in Google Webmaster Tools.

What is a Robot Meta Tag?

You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.

For example:

<html>
<head>
<title>Test Page</title>
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>
</head>

There are two important considerations when using the robots <META> tag:

- robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.

- the NOFOLLOW directive only applies to links on this page. It’s entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

How to write a Robots Meta Tag?

Where to put it:
Like any <META> tag it should be placed in the HEAD section of an HTML page, as in the example above. You should put it in every page on your site, because a robot can encounter a deep link to any page on your site.

What to put into:
robots meta tag have two attributes “NAME” and “CONTENT” attribute.
The “NAME” attribute must be “ROBOTS”.
Valid values for the “CONTENT” attribute are: “INDEX”, “NOINDEX”, “FOLLOW”, “NOFOLLOW”. Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is “INDEX, FOLLOW”, so there’s no need to spell that out. That leaves:

<META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”>
<META NAME=”ROBOTS” CONTENT=”INDEX, NOFOLLOW”>
<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>

How to remove cached copies of web pages using robots meta tag?

Google automatically takes a “snapshot” of each page it crawls and archives it. This “cached” version allows a webpage to be retrieved for your end users if the original page is ever unavailable. The cached page appears to users exactly as it looked when Google last crawled it, and google display a message at the top of the page to indicate that it’s a cached version. Users can access the cached version by choosing the “Cached” link on the search results page.

Before you begin, you must do one of the following:

To update the cached version of a page:
change the content of the page. The next time Google crawls the page, It will update the cached version.

To removed cached versions of a page from Google’s index and prevent Google from caching the page in the future:
you must add a noarchive meta tag to that page. The next time we crawl that site, we’ll see the tag and remove the page.

To prevent all search engines from showing a “Cached” link for your site, place this tag in the <HEAD> section of your page:

<meta name=”robots” content=”noarchive”>

To prevent only Google from displaying one, use the following tag:

 <meta name=”googlebot” content=”noarchive”>

Once this is complete, you can use the URL removal tool in Webmaster Tools to request expedited removal of the cached content for a minimum of six months.

How to remove snippets that appear below web pages  in Google search results and describe the content of your page?

A snippet is a text excerpt that appears below a page’s title in our search results and describes the content of the page.

To prevent Google from displaying snippets for your page, place this tag in the <HEAD> section of your page:

<meta name=”googlebot” content=”nosnippet”>

Note: Removing snippets also removes cached pages.

How to remove outdated pages from google index by returning proper server response?

Google updates its entire index regularly. When google crawl the web, it automatically find new pages, remove outdated links, and reflect updates to existing pages, keeping the Google index fresh and as up-to-date as possible.

If outdated pages from your site appear in the search results, ensure that the pages return a status of either 404 (not found) or 410 (gone) in the header. These status codes tell Googlebot that the requested URL isn’t valid.

How to remove images from Google Image Search using a robots.txt file?

To remove an image from Google’s image index, add a robots.txt file to the root of the server that blocks the image.

For example, if you want Google to exclude the logo.jpg image that appears on your site at www.yoursite.com/images/logo.jpg, add the following to your robots.txt file:

User-agent: Googlebot-Image
Disallow: /images/logo.jpg

To remove all the images on your site from google index, place the following robots.txt file in your server root:

User-agent: Googlebot-Image
Disallow: /

Additionally, Google has introduced increased flexibility to the robots.txt file standard through the use asterisks. Disallow patterns may include “*” to match any sequence of characters, and patterns may end in “$” to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you’d use the following robots.txt entry:

User-agent: Googlebot-Image
Disallow: /*.gif$

How to remove entire or partial website content from search engines using a robots.txt file

July 2nd, 2009

Some time publisher creates new website and want to remove old website from search engines, Publisher can do this by the help of “robots.txt” file.

“robots.txt” file is the text file in website server root, “robots.txt” file is used to request search engines for remove your site and prevent robots from crawling it in the future.

To prevent all robots from crawling your site,

Create file name “robots.txt” in your server root and paste following content in the “robots.txt” file:

User-agent: *

Disallow: /

To remove your site from Google only and prevent just Googlebot from crawling your site in the future, paste following content in the file:

User-agent: Googlebot

Disallow: /

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you’ll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you’d use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):

User-agent: *

Allow: /

For the https protocol (https://yourserver.com/robots.txt):

User-agent: *

Disallow: /

Note: A robot can discovers your site by other means - for example, by following a link to your URL from another site - your content may still appear in our index and our search results. To entirely prevent a page from being added to the Google index even if other sites link to it, use a noindex meta tag.

Some More Examples:

Examp1:

The following example “/robots.txt” file specifies that no robots should visit any URL starting with “/India/delhi/” or “/test/”, or /prince.html:

# robots.txt for http://www.princejain.com/

User-agent: *
Disallow: /India/delhi/ # This is an infinite virtual URL space
Disallow: /test/ # these will soon disappear
Disallow: /prince.html

Examp2:
This example “/robots.txt” file specifies that no robots should visit any URL starting with “/India/delhi /”, except the robot called “Googlebot”:

# robots.txt for http://www.princejain.com/
User-agent: *
Disallow: /India/delhi / # This is an infinite virtual URL space
# Googlebot knows where to go.
User-agent: Googlebot
Disallow:

Examp3:
This example indicates that no robots should visit this site further:

# go away
User-agent: *
Disallow: /

API Testing

June 24th, 2009

What is API Testing?
An API (Application Programming Interface) is a collection of software functions and procedures, called API calls, which can be executed by other software applications. API testing is mostly used for the system which has collection of API that needs to be tested. The system could be system software, application software or libraries.
API testing is different from Unit, white box and UI testing, UI is rarely involved in API Testing. Tester need to setup initial environment, invoke API with required set of parameters and then finally analyze the result.
Initial environment means test environment setup and application setup, database creation, server configuration, config and properties file setup and deployment of application or any coding (if it is required).

Usually people or company synonym as Unit or white box testing but there is huge difference between API, Unit and while box testing. Tester need to or may need to do coding during API testing.

Difference between Api Testing and Unit Testing
Unit testing is an activity that is owned by the development team; developers are expected to build unit tests for each of their code modules (these are typically code may or may not contains classes, functions, stored procedures, or some other ‘atomic’ unit of code), and to ensure that each module passes its unit tests before the code is included in a build.
Unit tests are typically designed by the developers to verify the functionality of each unit. The scope of unit testing often does not consider the system-level interactions of the various units; the developers simply verify that each unit in isolation performs as it should.

API testing is typically an activity owned by the QA team, API tests are often run after the build has been created, and it is common that the authors of the tests do not have access to the source code; they are essentially creating black box tests against an API rather than the traditional GUI.

In API testing, QA team must consider the ‘full’ functionality of the system, as it will be used by the end user. This means that API tests must be far more extensive than unit tests, and take into consideration the sorts of ’scenarios’ that the API will be used for, which typically involve interactions between several different modules within the application.

API is mostly black box testing where as unit testing is essentially kind of white box testing. Unit test cases are typically designed by the developers and there scope is limited to the unit under test. In API testing, test cases are designed by the QE team and there scope is not limited to any specific unit, but it normally covers complete system.

So a tester should able to identify before start API testing…

*Find the way to approach the task?
*Do boundary analysis.
*Create or focus on the most likely usage scenarios (Functional Scenario).
* Check Return Values
*Focus also on negative testing to get exception and error handling.

*Check Event triggers (Optional and depends on API Type)
*Modify resources (Optional and depends on API Type)

Type of API and how to approach them:
API can be called directly or it can be called because of some event or in response of some exception. Output of API could be some data or status or it can just wait for some other call to complete in a-synchronized environment.

A. IF API return value based on input condition:
- In this condition, test cases will be based on the input and relative output.
- This is relatively simple to test as input can be defined and results can be validated against expected return value.
- User can pass different combinations of values or parameters and validate these against known results.

B. If API does not return anything:
-
In this situation tester should need to identify some mechanism to check behavior of API on the system.
- For example, if you need to write test cases for delete (List Element) function you will probably validate size of the list, absence of list element in the list.

C. If API Trigger some other API/event/interrupt:
-
If API is triggering some event or raising some interrupt, then you need to listen for those events and interrupt listener.
- Test suite should call appropriate API and asserts should be on the interrupts and listener.

D. If API is used to Update data structure:
- Updating data structure will have some effect on the system and that should be validated.
- If you have other means of accessing the data structure, it should be used to validate that data structure is updated.

E. If API is Modifying certain resources:
-
If API call is modifying some resources, for example updating some database, changing registry, killing some process etc, then it should be validated by accessing those resources.

Challenges of API Testing:

a.     Parameter Selection:
Ensuring that the test harness varies parameters of the API calls in ways that verify functionality and expose failures.         This includes assigning common parameter values as well as exploring boundary conditions.

b.     Parameter combination:
Generating interesting parameter value combinations for calls with two or more parameters.

c.     Setting environment:
Determining the content under which an API call is made. This might include setting external environment conditions         (files, peripheral devices, and so forth) and also internal stored data that affect the API.

d.     Call sequencing:
Sequencing API calls to vary the order in which the functionality is exercised and to make the API produce useful results     from successive calls.

Most common seniors in API Testing:

a.     Test Response: Each API method in isolation using only the mandatory elements and typical content so if system calls        any API then it send any response so tester need to check valid response from API.

b.     Test Limit: These tests exercise each API method using all optional elements and maximum allowable content lengths        and/or instances of repeated elements.

c.     Test Business Logic: This is where the business application logic is simulated in the test code. Each API method will         have a defined set of test cases that explore its interaction and influence on other API methods and any underlying         database.

d.     Test Negative or Illegal: These tests contain a sampling of typical error scenarios, such as missing required elements,     empty content, and content exceeding maximum limits, across a representative sampling of the API methods, so API         should enough intelligent to handle exception, errors and missing parameters issues.

e.     Test Load: Most of applications are web 2.0 and running on internet and access by number of users so API should able     to take heavy load and should not break during peak time when number of people will use application, tester also             measure response time, through put , latency, memory leak and any other factor according to requirement

If you were to ask testers how to test API, you would get several different perspectives. Every people have different thinking and approaches to solve problems and every approach have option for enhancements.

Fifty Tech Startups by BusinessWeek

June 21st, 2009

BusinessWeek and market researcher YouNoodle have teamed up to identify 50 tech startups flying under the radar. The list includes fledgling tech companies most started in 2005 and later from the U.S., China, India, Israel, and Russia that are attracting some early buzz and are poised to grow beyond their regional or niche-market origins.

*A YouNoodle Score is a measurement, on a scale of 0 to 100, of a startup’s progress as an early-stage company. Typically, a 0-15 company is just getting started, a 30-60 company has experienced some very strong growth (through traffic, funding, or revenue), and a 90-plus company is a strong IPO or acquisition candidate. The score is based on a sophisticated algorithm using information from thousands of online sources: traffic, level of mainstream media coverage, funding, blogosphere activity, and other key factors.

Company Headquarters *YouNoodle

Score
Total Funding
(in Millions of Dollars)
Year Funded CEO
Zynga San Francisco 97 39 2007 Mark Pincus
Tudou Shanghai 96 84.5 2005 Gary Wang
Ning Palo Alto, Calif. 96 104 2004 Gina Bianchini
OpenDNS San Francisco 93 2.5 2005 David Ulevitch
Etsy Brooklyn, N.Y. 91 31.6 2005 Maria Thomas
Sonico Buenos Aires 91 4.3 2007 Rodrigo Teijeiro
Scribd San Francisco 91 12.8 2007 Trip Adler
Slide San Francisco 91 58 2005 Max Levchin
RockYou Redwood City, Calif. 89 68.5 2006 Lance Tokuda
Komli Media Mumbai, India 83 7 2006 Amar Goel
Justin.tv San Francisco 81 4 2006 Michael Siebel
Ibibo Gurgaon, Haryana, India 77 N/A 2007 Ashish Kashyap
AdMob San Mateo, Calif. 76 47.2 2006 Omar Hamoui
Jajah Mountain View, Calif. 68 28 2005 Trevor Healy
Daylife New York 67 8.3 2007 Upendra Shardanand
TheFind Mountain View, Calif. 66 26 2006 Siva Kumar
QueBarato Brazil 65 6 2007 Pending confirmation
Adconion Media Group London 64 80 2005 T. Tyler Moebius
Kosmix Mountain View, Calif. 60 55 2005 Venky Harinarayan
Evernote Mountain View, Calif. 60 13.5 2005 Phil Libin
Yola San Francisco 60 25 2007 Vinny Lingham
PBworks San Mateo, Calif. 57 2.5 2005 Jim Groff
Spotify Stockholm 51 20 2006 Daniel Ek
TokBox San Francisco 51 14 2007 Ian Small
Loopt Mountain View, Calif. 48 13.3 2005 Sam Altman
Xobni San Francisco 48 14.6 2006 Jeff Bonforte
KupiVIP Moscow 48 11 2008 Oskar Hartmann
Fon Madrid 44 48 2005 Martin Varsavsky
Metaweb Technologies San Francisco 42 57 2005 Thomas Layton
Huddle.net London 38 4 2006 Alaisdair Mitchell
Mochi Media San Francisco 37 14 2005 George Garrick
Boxee New York 36 4 2008 Avner Ronen
Better Place Palo Alto, Calif. 33 200 2007 Shai Agassi
Palantir Technologies Palo Alto, Calif. 31 36.7 2004 Alex Karp
SecondMarket New York 31 N/A 2004 Barry E. Silbert
Livescribe Oakland, Calif. 30 18.6 2007 Jim Marggraff
Inrix Kirkland, Wash. 29 31.1 2004 Bryan Mistele
Sermo Cambridge, Mass. 29 37.5 2006 Daniel Palestrant
Modu Kfar-Saba, Israel 27 85 2007 Dov Moran
SynapSense Folsom, Calif. 26 11 2006 Peter Van Deventer
Pelago Seattle 25 22.4 2006 Jeff Holden
Raydiance Petaluma, Calif. 24 20 2005 Barry Schuler
Fusion-io Salt Lake City 23 66.5 2006 David Bradford
Cloudera Burlingame, Calif. 22 11 2008 Michael Olson
Bloom Energy Sunnyvale, Calif. 22 N/A 2002 K.R. Sridhar
Positive Energy Arlington, Va. 22 15.5 2007 Daniel Yates
Nila Los Angeles 22 0.6 2004 Jim Sanfilippo
Monitise London 21 19 2004 Alastair Lukies
Proclivity Systems New York 16 6.2 2006 Sheldon Gilbert
Cotendo San Carlos, Calif. 12 7 2008 Ronni Zehavi

List of Ad Networks in India

June 19th, 2009

I am mentioning all known ad networks in India which display ads directly or indirectly on the own website or publishers (client) site, this is partial list and welcome for all suggestions.
1. Google Ad Sense
2. Network Play
3. Komli Media
4. mKhoj
5. Tyroo Media
6. Ozone Media
7. DGM India
8. Tribal Fusion
9. PubMatic
10. Ad Magnet
11. AdRevenue
12. Integrid Media
13. IndiAds
14. Kyphy
15. Social Media Exchange (Formally Axill India)
16. AdChakra
17. Tonic Tag
18. Ruipzads
19. Sulekha
20. Quikr (free Classifieds site)
21. Oridian (A Ybrant Digital Network – The Group of Ad Networks)
22. Jivox
23. PayPod
24. Adaptive Ads (Glam Media Company)
25. mGinger