Layout of comdev/projects.apache.org:

/scripts:
        - Contains scripts used for import and maintenance of foundation-wide
          data, such as committer IDs/names, project VPs, founding dates,
          reporting cycles etc. See README.txt for more info.

/data:
        - Contains data maintained by committees

/site:
        - Contains the HTML, images and javascript needed to run the site

/site/json:
        - Contains the JSON data storage calculated from /data and external sources
          (notice: used by reporter.apache.org too, see getjson.py)

/site/json/foundation:
        - Contains foundation-wide JSON data (committers, chairs, podling
          evolution etc)

/site/json/projects:
        - Contains project-specific data extracted from projects' DOAP files.

N.B. The directory structure should be owned by the www-data login (or whatever is used for the webserver)
This is because at least one of the scripts (parsecommitteeinfo.py) may invoke SVN commands

Suggested cron setup:
    scripts/cronjobs/parsecomitters.py - daily/hourly (whatever we need/want)
    scripts/cronjobs/podlings.py - daily
    scripts/cronjobs/countaccounts.py - weekly
    scripts/cronjobs/parsereleases.py - daily
    scripts/cronjob/parsecommitteeinfo.py - daily


Webserver required:
To test the site locally, a webserver is required or you'll get
"Cross origin requests are only supported for HTTP" errors.
An easy setup for development is: run "python -m SimpleHTTPServer 8888" from
site directory to have site available at http://localhost:8888/

Current crontab settings:

crontab root:
# m h  dom mon dow   command
10 5 * * * cd /var/www/projects.apache.org/site/json && svn ci -m "updating projects data" --username projects_role --password `cat /root/.rolepwd` --non-interactive

crontab -l -u www-data:
# m h  dom mon dow   command
00 00 * * * cd /var/www/projects.apache.org/scripts/cronjobs && ./python3logger.sh podlings.py
01 00 * * * cd /var/www/projects.apache.org/scripts/cronjobs && ./python3logger.sh parsecommitters.py
02 00 * * * cd /var/www/projects.apache.org/scripts/cronjobs && ./python3logger.sh countaccounts.py
03 00 * * * cd /var/www/projects.apache.org/scripts/cronjobs && ./python3logger.sh parsereleases.py
00 01 * * * cd /var/www/projects.apache.org/scripts/cronjobs && ./python3logger.sh parsecommitteeinfo.py
00 02 * * * cd /var/www/projects.apache.org/scripts/cronjobs && ./python3logger.sh parseprojects.py

# Run pubsubber
@reboot         cd /var/www/projects.apache.org/scripts/cronjobs && ./pubsubber.sh
@monthly        cd /var/www/projects.apache.org/scripts/cronjobs && ./pubsubber.sh restart

# ensure that any new data files get picked up by the commit (which must be done by root)
10 4 * * *      cd /var/www/projects.apache.org/scripts/cronjobs && ./svnadd.sh ../../site/json

There are additional jobs for reporter.a.o which are documented in its code source (README.txt).

Note: the puppet config for the VM is stored at:

https://github.com/apache/infrastructure-p6/blob/production/data/nodes/projects-vm3.apache.org.yaml
and
https://github.com/apache/infrastructure-p6/tree/production/modules/projects_pvm_asf

See also scripts/README.txt

Statistics (Snoot) Sources https://cwiki.apache.org/confluence/display/COMDEV/Snoot
Updates to list of sources is done by admins with following conventions:

- code repositories
rationale: get git mirrors list from git.apache.org, but remove repositories for sites since they contain too much generated (html) content that cheats real code statistics
(notice: sites in svn don't have the issue, only sites in git have the issue. But filtering site repositories based on svn vs git is too complex and not really understandable)

wget -q -O - http://git.apache.org/index.txt | grep -v \\-site.git | grep -v \\-website.git | grep -v \\-www.git | grep -v \\-web.git > index.txt
(notice: removal detected by diff to be managed manually)

- issue trackers: Jira
wget -q -O - https://issues.apache.org/jira/secure/BrowseProjects.jspa | sed -n 's/.*"\(.jira.browse.[^"]\+\)".*/https:\/\/issues.apache.org\1/p' | sort > jira.txt
(notice: removal detected by diff to be managed manually)

- issue trackers: Bugzilla
TBD

- mailing lists
https://lists.apache.org/api/preferences.lua
TODO parse JSON and generate txt with one line per list: https://lists.apache.org/list.html?<list>@<tlp>.apache.org

- irc
TBD