Coordination
Biblioteca de catalunya
Collaboration
CESCA
Sponsored
Generalitat de Catalunya
Support
Fundació puntCAT
Member
IIPC
Powered by
HERITRIX
nutchwax
WERA
WEBCURATOR
FAQ (Frequently Asked Questions) about PADICAT

  • What is PADICAT?
  • What can I do to get my website to appear in PADICAT?
  • What can I do to avoid my website appearing in PADICAT?
  • When I visit some of the captured websites, whay is that I can't see some images or access some of the links?
  • What does PADICAT capture from each website?
  • I recommended my website as a part of the collection and I can't find it on the database - why?
  • What volume and capacity of data does PADICAT have?
  • What hardware does PADICAT use?
  • What software does PADICAT use?
  • Can PADICAT capture and display correctly any kind of website?
  • Does the language I use to search have a bearing on the search results?
  • Help with searching
  • What is the content of PADICAT?
  • Doubts and suggestions




    What is PADICAT?

    It is an initiative of the Biblioteca de Catalunya which consists of capturing, processing and giving permanent access to all the Catalan output of a cultural, scientific and general nature in digital format. Definitively, the object is to archive the Catalan internet.

    The complete and detailed explanation regarding the aims, objectives and functioning can be found in the section What is it?.

    back to top




    What can I do to get my website to appear in PADICAT?

    PADICAT has various means of capturing websites: the systematic capture of websites with the domain .cat, the capture of the websites pertaining to those institutions with which the Biblioteca de Catalunya has signed a collaboration agreement, the capture of websites which are considered relevant from a search made when browsing, and the capture of those websites which (once their relevance has been confirmed) are incorporated into the collection through the recommendations of users.

    If you wish your website to form part of the PADICAT collection, you can send in your recommendations by completing a short form in the section Proposing a website.

    From the moment that a website becomes a part of the repository, it is captured at least twice a year, and this frequency may increase in the future.


    back to top




    What can I do to avoid my website appearing in PADICAT?

    Your website can avoid forming part of the collection by the simple inclusion of a robots.txt file which will prevent the website from being visited by our robot.

    The robot which we use is identified as PADICAT, and follows the Standard for Robot Exclusion (SRE), which means that it does not enter into any website or component part of a website which is protected using this method, unless there has been previous agreement and authorization between the institution and the Biblioteca de Catalunya.



    back to top




    When I visit some of the captured websites, why is it that I can’t see some images or access some of the links?

    The purpose of PADICAT is to preserve websites exactly as they were at the moment of capture. At the same time, it seeks to offer users the possibility of browsing the captured websites in the same way as if they were doing so on the real Internet.

    However, often there are elements which make the optimum viewing of these websites difficult, or browsing between hyperlinks. 3 basic tips to avoid some of the anomalies in the viewing of captured websites are:

    • Don't use adresses which include the URL in pages of the same site. So, instead of:

      http://www.example.cat/imagenes/logotipo.jpg

      or

      http://www.example.cat/menu.html,

      it would be more advisable to use:

      /images/logotipo.jpg

      and

      /menu.html

    • Don't use the html tag refresh to return to another page. Example:

      < html >
      < head >
      ....
      < meta http-equiv="refresh" content="2;url=http://example.cat" >
      ....
      < /head >
      ....
      < /html >

    • Don't use extracts from external pages, whether images, scripts or other.


    To obtain a more detailed explanation of the causes that produce these incidences, and advice on how to avoid them if you are the proprietor of a website, see Can PADICAT capture and display correctly any kind of website?


    back to top




    What does PADICAT capture from each website?

    PADICAT captures only the websites and parts of websites that are accessible from the Internet. Apart from respecting the limitations that the proprietors of websites may impose (see What can I do to avoid my website appearing in PADICAT?), PADICAT does not enter into or capture any website that requires a password, form, etc, such as for example, areas reserved for the collegiate members of a professional association, or for subscribers to a publication, etc.


    back to top




    I recommended my website as a part of the collection and I can't find it on the database - why?

    PADICAT currently has 4 ProLiant DL360 G4p servers working at 100% of their capacity around the clock. Even so, the large number of resources to be captured means that queues are formed, which can slow down the capture of proposed resources.


    back to top




    What volume and capacity of data does PADICAT have?

    The volume of data stored in PADICAT can be consulted through the What do we have section of our website, in which the figures are periodically updated.


    back to top




    What hardware does PADICAT use?

    PADICAT has at its disposal seven HP ProLiant DL360 G4p nodes dedicated to the tasks of collecting and indexing websites, whilst the searching and viewing of results on the web interface is carried out by a high-availability Linux cluster with characteristics of balancing of weight of requests and error-tolerance in the event of disaster to the nodes that make up the platform.





    The nodes are connected by fibre to a Storage Area Network (SAN) and the system is completed by a robot in which back-up copies of data are stored on tape.





    back to top




    What software does PADICAT use?

    For capturing, indexing and access to the stored resources various programmes are used. Heritrix is used to collect the websites just as they are seen by the user browsing the Internet, and to store them in files compressed into the ARC format. Then, Nutchwax and Hadoop carry out an indexing process on the information gathered which will allow these indexes to be used at a later date in order to locate resources within the collection.

    There are two interfaces to conduct searches among the captured resources: WERA, which enables searching by key word among the indexes generated by NutchWax, and Wayback, which allows direct consultation by URL.





    The programme Web Curator Tool is used for the cataloguing of the captured resources.

    All of the software used by PADICAT is of open and free code and has been developed by non-profit-seeking organizations associated with the International Internet Preservation Consortium (IIPC) of which the Biblioteca de Catalunya is a member.


    back to top




    Can PADICAT capture and display correctly any kind of website?

    Owing to irregularities in the file viewing software and inconsistencies during the archiving of these websites (e.g. robots.txt exclusions), some websites may not be displayed correctly (external links, forms or search boxes, fallen images) or may redirect to the current version of the website.

    Websites which use html standards of accessibility and language shouldn’t have any problems either in the capture or viewing once they are stored in PADICAT. However, on the other hand, there are certain elements which may complicate both the capture of resources and, above all, their subsequent viewing within the collection. Some recommendations:

    For the capture of a website:

    -robots.txt; for general norm, PADICAT respects websites which use exclusion elements.

    For browsing and viewing of the captured version:

    Links:

    -links: images, scripts, etc. of other external websites. If these elements belong to an external website, they will not be displayed correctly once the website has been captured by PADICAT. It is recommended that you save these logos in the image directory of your server, and that you use relative paths in your website.
    -use relative and/or absolute paths to build the link, rather than using the complete URL.
    -don’t use scripts to build links dynamically.
    -avoid the embedding of flash objects where the links are absolute.
    -avoid using the base href label.
    -avoid using links to URLs that redirect to another site.

    Interpreted languages:

    -avoid using local variables on the server that allow variations to the appearance of the site to be viewed, such as for instance, changes of language or dynamic changes to menus.

    Encoding:

    -PADICAT uses UTF-8 encoding for the visualization of characters. Errors may occur in the viewing of websites (for example diacritics, et. al) which use a different encoding (e.g. Latin-1), if this is not specified in the original website. Thus it is recommended that the encoding used in the website be specified.

    Accessibility recommendations:

    -we recommend avoiding the use of frames, as this can complicate the process of indexing of the website, and, thus, the subsequent retrieval of the website in the search by text.
    -we recommend offering alternatives for access to the information in pages which use Javascript, since there are devices which do not support this code or have the browser option de-activated.

    Other recommendations for webmasters:

    -use pages that are not too heavy.
    -do not fill up the same page with too many images.
    -follow the norms of accessibility (frames, coding, etc.).
    -do not use spaces in filenames.


    back to top




    Does the language I use to search have a bearing on the search results?

    The indexes generated by the software – from the captured websites – and which are used for searches by key word are unique; that is to say, they are independent of the language which the user chooses from the consultation interface of PADICAT, and depend solely on the language in which the captured website is written.

    Therefore, the search terms should be independent of the language in which the user is browsing through PADICAT. Even so, a larger number of results will be obtained if the terms introduced are in Catalan.


    back to top




    Help with searching

    Tips for searching

    -To search by free text, use the search by word.
    -To search by specific domain, use the search by URL.

    Tips for the advanced search

    -Type in one or more search terms.
    -If appropriate, specify the domain on which you wish to perform the search.
    -In order to limit the search to a period of time, specify the start and finish dates.
    -In order to limit the results to a kind of file, specify one.
    -In order to search within one event, select the corresponding collection; if you wish to search in all the resources, select "All".

    Combined and / or expert searches

    -The word can be complete or abbreviated (e.g. coun to find council and councilor)
    -If you type in one or more terms to search, the system will retrieve those resources which contain all of the search terms typed in.
    -Use the AND operator in order to retrieve resources which contain all of the words typed in (e.g.. councilor AND elections).
    -Use the OR operator to retrieve resources which contain one or other of the words typed in (e.g. education OR formation).;
    -Use inverted commas (“”) to look for an exact phrase (e.g. “roda de ter”)


    back to top



    What is the content of PADICAT?

    In the section What do we have? you can consult the number of websites contained in PADICAT and the number of captures from these websites carried out on different dates. Also shown is the number of files which make up each capture that can be found in the repository. These files are mainly web pages, approximately 70% of them html, 10% images, 2% pdf, etc. (in order to know the exact details of the kinds of files which make up the websites in PADICAT, see our press release).
    Finally, the space occupied is shown, which includes the size of the ARC compressed files which store the captures and their indexes.

    This data is updated automatically when new resources are added to the collection.



    back to top





    Doubts and suggestions

    If you have any query which has not been resolved or any suggestion to make to us, you can do this using the following form.


    back to top