这是stanford大学的数据库分析

时间：2015-07-22 12:47:42 阅读：147 评论：0 收藏：0 [点我收藏+]

标签：

Broad Steps

Setup a new AWS VPC (This step is optional, so you don‘t have to follow along if you don‘t want to).

Stanford is running an entire AWS VPC devoted to analytics, which hosts:

the analytics report, API application, and dashboard application databases,
the ElasticMapReduce cluster,
the task scheduler (which we use Jenkins for),
the API servers, and
the dashboard app servers.

Our data VPC also has a peering connection to our prod VPC, so that the EMR cluster machines can get access to our production RDS read-replica, needed for some of the analytics tasks.

Note that none of this is necessary. Everything will work fine as long as you can set up a cluster, the app machines, and the databases, and they can all connect to each other as needed.

Upload your tracking logs somewhere (like S3) accessible to the Hadoop cluster you will create.

Tracking logs, in recent release of edx-platform, are typically located on the app server at/edx/var/log/tracking/tracking.log-+%Y%m%d-%s. At Stanford (and edX), the tracking logs from all our app servers get synced up to a single bucket in S3. (Stanford uses rsync). Whether it‘s pushed by the app servers or periodically synched by some other process, make sure there are no duplicate or missing tracking log files in this bucket, as that will affect the statistical calculations.

Launch a ElasticMapReduce (EMR) Hadoop Cluster.

Stanford keeps a long running cluster around (1 m3.medium master node and 1 m3.medium core node) and sizes up/down the number of task instances with each task run. The article on creating an EMR cluster has more details.

Note that this is somewhat different than edx.org, which, with every task run, provisions a new EMR clusters using a custom ansible module driven by a shell script. Consult theedx-analytics-configuration repo if you are interested in this workflow.

Setup the analytics report, API application and dashboard application MySQL databases.

It‘s pretty much standard RDS, but make sure your RDS security groups for the reports database (written to by the code in edx-analytics-pipeline and read by the code in edx-analytics-data-api) allow access by all the master and slave cluster machines (there are Security Groups associated with EMR-Master and EMR-Slave that were created for us when we launched an EMR cluster), and all the data api servers. The data API and dashboard (edx-analytics-dashboard) django apps also need databases to function, and we just use the same DB server for these 3 databases.

Setup the tasks scheduler

The reports db is filled periodically by the luigi tasks, so a scheduler is needed. We set up a Jenkins box because it provides a nice interface to allows us to schedule jobs periodically (and to view the console output) but also run them on demand. We did a vanilla sudo apt-get install jenkins on a Ubuntu server. However, the edx-analytics-pipeline needs to be checked out and installed on this jenkins box, because the executable python script remote-task supplied by the install is what kicks off the luigi tasks on the EMR cluster.

Setup tasks themselves, and provide all the parameters and the sundry things needed by tasks

Task parameters can be supplied in 3 ways, on the command line of the remote-taskcommand, or via an overrides.cfg file that lives on the file system of scheduler Jenkins box and pointed to by a command line parameter to remote-task (This is what Stanford does currently), or in a override.cfg kept in another repo, with the repo location being supplied by yet another command-line parameter to remote-task.

Sundry things are mainly kept in S3, like mysql credentials files for the reports database or.jar libraries needed by various tasks.

Setup the API app servers (from repo:https://github.com/edx/edx-analytics-data-api)

Once you‘re able to launch tasks and have them run to completion and confirm there‘s data in your reports mysql DB, you need to setup the data-api application servers to serve that data, from the reports MySQL db, over a REST API. There are ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/playbooks/roles/analytics-api) for this, and even a playbook that runs this role (ours is at https://github.com/Stanford-Online/configuration/blob/master/playbooks/edx-west/data-api.yml) so you don‘t need to do much except to edit the vars files used by the playbook.

The data api app has a self-documenting front page (https:///docs/) that you can use to test that the data is being correct served.

Setup the Insights app servers (from repohttps://github.com/edx/edx-analytics-dashboard)

Once you confirm that the data API is serving up data over REST, you can set up the insights (dashboard) app which is responsible for the UX / presentation of the analytics data. need to setup the data-api application servers to serve that data over a REST API. This app does not directly interact with the reports database, but rather it makes REST calls to the data API and interprets/displays the JSON retunred.

There are ansible roles available in the edx configuration repo (https://github.com/edx/configuration/tree/master/playbooks/roles/analytics-insights) for this, and even a playbook that runs this role (ours is at https://github.com/Stanford-Online/configuration/blob/master/playbooks/edx-west/data-insights.yml) so you don‘t need to do much except to edit the vars files used by the playbook.

Configure the OpenID Connect (OAuth2) parameters between insights and edx-platform

The insights app relies on the edx-platform instance for its authentication / authorization to create a more integrated user experience. In particular, when a user visits the insights app, the app uses the OpenID Connect protocol to seamlessly create an insights account that‘s linked with the users‘ edx-platform account. The users‘ course staff privileges are also propagated from edx-platform to insights, so that users only see analytics data for courses in which they have staff privileges.

This means that some configuration is required in edx-platform to add insights as an OpenID Connect client, and that configuration needs to be in synch with configuration in the insights app. See article for details.

这是stanford大学的数据库分析

标签：

原文地址：http://www.cnblogs.com/zhaojianwei/p/4666875.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行