码迷,mamicode.com
首页 > 编程语言 > 详细

CC 03 Python

时间:2019-12-19 13:01:43      阅读:79      评论:0      收藏:0      [点我收藏+]

标签:form data   park   gui   for   main   appear   variables   asc   called   


CC 03
The game rock, paper, scissors is a classic tool used to make important decisions among two friends (read https://www.wikihow.com/Play-Rock,-Paper,-Scissors).

1. Create a function named winner_RPS. The function winner_RPS takes two parameters, p1 and p2.
?It returns p1 (the parameter) if p1 wins; p2 if p2 wins; and None if there is no winner.
?Try to write your solution such that the winner_RPS has a single return statement.
?This code needs to be in the lesson.py module.

2. p1 and p2 can only be one of the following values:
"rock", "paper", "scissors"

3. Test your code. You can write your own tests and not rely on the testing framework. In main.py write a test function named test_RPS to verify that your code is working. The function doesn‘t have any parameters and should test your winner_RPS at least a few times (more than twice!). Try to figure out how to return True if it passes all your tests (False otherwise).
?Use the import statement in main.py (i.e. import lesson)

For example the following could be a test:
import lesson
t1 = ‘rock‘
t2 = ‘paper‘
if lesson.winner_RPS(t1, t2) != t2:
print(‘Test FAIL‘)

You can use also random.choice to model the selection part of the game play as well:
values = "rock,paper,scissors".split(‘,‘)
p1 = random.choice(values)
print(p1)

How would you test all possible cases? Is that even possible ?

CC 04
?? Coding Challenges: Lucky 777

Prerequisites:
?Python Regular Expressions, Parts 1,2,3
?Python Remote I/O
?DSP: Jupyter Lesson

Hapaxe more thing...
When a word occurs only once in a body of work or entire written record, it‘s called a hapaxe. However, there are disagreements on how narrow the set of works can be. Usually, a hapaxe can only appear once for an author‘s entire collection rather than just within a specific piece.

For example, Hamlet has a famous hapaxe ‘Hebenon‘, a poison. It is said that this is Shakespeare‘s only use of the word. However if you look for hapaxes (aka hapax legomenon) in a single piece of text, there are many: Hamlet has over 2700 words that occur only once. Let‘s extend this fun fact to find a unique set of words within a body of text that do share some very specific attributes.

Let‘s classify all words in a body of text by how often they occur. A body of text is a lucky winner if it contains 7 words each that occur 7 times and each word is 7 characters long. For this project you will create a notebook, import some text and determine if the text is a ‘winner‘.

However you will write your solution to be generic so that any number could be passed in (e.g. 4 letter words that only occur 4 times and there are a total of 4 of them).

All code will be in your Colaboratory Notebook and it will be graded using Gradescope.

Step 0: Starting Point, New Notebook
https://colab.research.google.com/notebooks/welcome.ipynb
Be sure you are logged into Google using your @gmail address so your notebook will be saved to your drive.

?Open a new colab notebook via File->New Python 3 notebook
?Name it INFO490-777
?Your notebook will be saved to your google drive in a special folder:

Step 1: Paste in Starter Code
In lesson.py there is some starter code for this project. Put this code into a new code cell in your notebook.

Step 2: Make Hamlet‘s text available via Google Drive
This step is a bit superfluous in that we moving data from Project Gutenberg to your Google Drive and then accessing Hamlet from there. Why? Because it‘s useful to know the steps involved to make data accessible via Google. You can also use this method to access any data (csv files, images, etc) that are located in your personal drive.

Many versions
A previous lesson also used a specific text of Hamlet (RemoteIO); however, there are many editions/versions of this famous play (you can even take classes that study the different versions). On Project Gutenberg you can see different versions:
http://www.gutenberg.org/ebooks/search/?query=hamlet

For this project we will use this version.
Please read the Director‘s and Scanner‘s note to learn some of the details of this specific version of Hamlet.

Here‘s the easiest workflow (you are free to use any other method as well) to move that document into your Google Drive space:
CC 03留学生作业代做、代写Python编程设计作业
?Open a new tab in your browser, go to http://www.gutenberg.org/ebooks/2265
?Save the UTF-8 version to your computer:

?Save to your computer (name it hamlet.txt)
?Go to your Google Drive account and select the New Button and the ‘File upload‘ option to upload hamlet.txt from your computer.

?Get the share link.

?The main thing is you need the ID of the document. For example,
https://drive.google.com/open?id=19pOCDIXak04cTs7TLiEA3TKUCESU10ZM
?Note that this is NOT the url you can use to fetch via the remote I/O in Python. It is a ‘browser‘ friendly URL.

Step 3: Define the following function which returns the ID of Hamlet on your Google Drive:
def get_book_id():
# replace this with your resource Id
return ‘19pOCDIXak04cTs7TLiEA3TKUCESU10ZM‘

Step 4: finish the implementation for build_google_drive_url() (see lesson.py for the code)

This function builds the url to fetch a document saved on Google Drive. You will then add to the baseurl the request parameters properly encoded. If this sounds difficult, go back to the RemoteIO lesson -- the answer is there.
You can use the current implementation (which returns the Project Gutenberg url) for partial credit.

Step 5: TEST it
test your solution by downloading and reading your novel:
def get_hamlet():
g_id = get_book_id()
url = build_google_drive_url(g_id)
return read_remote(url)

hamlet = get_hamlet()
print(hamlet[0:100])

Step 6: Take a break to find the answer to life.
It‘s common? knowledge that the answer to life is 42: https://www.independent.co.uk/life-style/history/42-the-answer-to-life-the-universe-and-everything-2205734.html.

Shakespeare must have known this as well. Run the following code (get_hamlet needs to be working):
ANSWER_TO_LIFE = 42
def answer_to_life():
text = get_hamlet()
idx = text.find(‘To be,‘)
ans = text[idx:idx+ANSWER_TO_LIFE]
return ans

print(answer_to_life())

Step 7: Implement the following:
def clean_hamlet(text):
return text

?Remove everything before the start of the play (i.e. the play starts with the line: The Tragedie of Hamlet)
?Remove everything after the end of the play (i.e. the play ends after the final line (hint: the final line starts with FINIS)
?Hint: use the search method for regular expressions (see Regular Expressions Part 3)
?Remove any leading or trailing whitespace
?Be sure to test your code before moving on
?Do not hard code indices (e.g. return text[2345:4509])
?If you find yourself using \n\r\t, you‘re on the wrong path. The auto-grader uses the same version of Hamlet, but the whitespace is not the same as what‘s on Project Gutenberg -- and this was not done on purpose, it‘s the result of what happens when you download/upload text documents between different architectures.

Step 8: Implement the following:
def find_lucky(text, num):
lucky = []
return sorted(lucky)

The function find_lucky parses/tokenizes text (see rules below) and returns a sorted list of words if the text is ‘lucky‘ (see above definition). Otherwise, return the empty list.

The following rules apply to tokenize and classify words:
?Use the re module to tokenize the text
?a token is a word that contains only letters and/or apostrophes (e.g. who, do‘s, wrong‘d).
?normalize the token to lower case. For this lesson you can keep quoted words (it won‘t affect the answer) but ideally, you would remove them (e.g. ‘happy‘ would become happy).

For example, if the parameter num is 7 then it returns an array of words ONLY if all the following conditions are true:
?each word has 7 characters
?each word occurs 7 times in the text
?there are 7 of these words

For example:
text = """
A boy, a cat, a rat and a dog were friends.
But the cat ate the rat. The dog ate the cat.
The boy? The boy and dog were friends.
"""
print(find_lucky(text, 3))

Should return 3 words (‘boy‘, ‘cat‘, ‘dog‘)

Step 9: Test your function:
def test_777():

hamlet = clean_hamlet(get_hamlet())
print(find_lucky(hamlet, 7))

# comment me out before submitting!!
test_777()

See if Hamlet has any lucky numbers: (put this code inside the function test_777):
for n in range(2,10):
print(n, find_lucky(hamlet, n))

Step 10: Submit notebook to Gradescope:
?Go to gradescope.com and signup (or login for those who have already signed up). You MUST use your @illinois.edu address. Hit the "Sign Up" button at the top of the page.
?The class code is 9YGP8E
?Comment out any testing code that exists outside of any function.
?Download your notebook as .py file
?Rename the file as solution.py
?submit that file to gradescope assignment named Lucky777

Final Submission:
Submission Process for Repl.it Credit:
To submit this repl.it lesson, the ONLY code that needs to pass is get_book_id (lesson.py) The testing framework will attempt to download and read it.
def get_book_id:
# return the id of Hamlet stored on your Google Drive

You can tag any question you have with py777 on Piazza
10.25.2019
All rights reserved

Addendum:
As part of the working out the details for the 777 assignment (the idea came from reading about finding a reference to the hapaxe Hebenon) the following fun fact was found: William Shakespeare had a fascination with the number 7. So the question to ponder (after you finished this assignment) is did Shakespeare hide this fun fact inside of Hamlet or is it purely coincidental?
Readings and References:
?https://books.google.com/books?id=rn18DwAAQBAJ&pg=PT154&lpg=PT154&dq=William+Shakespeare++%22number+7
?https://books.google.com/books?id=MwBNel_aX0wC&pg=PA67&lpg=PA67&dq=shakespeare+numerology
?http://www.richardking.net/numart-ws.htm
?https://www.celebrities-galore.com/celebrities/william-shakespeare/lucky-number/ ??
CC 05
?? Coding Challenges: Finding Characters
Prerequisites:
?UP: Regular Expressions
?DSP: Jupyter
?DSP: Ngrams

One of the goals of the Cliff Note Generator was to generate a list of characters that are in a novel. We can actually use our current skill set and include the techniques discussed in the nGrams lesson to extract (with a good level of accuracy) the main characters of a novel. We will also make some improvements with some of the parsing, cleaning, and preparation of the data. It would be best to read this entire lesson before doing any coding. Also note that this lesson is a bit different in that you will be responsible for more of the code writing. What is being specified is a minimum. I highly recommend that you decompose any complex processes into multiple functions.
Step 0: Start a New Colab Notebook and name it INFO490-FindingCharacters

Step 1: Copy your working solution from the DSP Jupyter Lesson into a new code cell. Test it. The required functions are also given in lesson.py.

Step 2: Copy your working solution from the DSP Ngrams Lesson into a new code cell. Test it. These functions are also given in lesson.py. Note that load_stop_words is already finished.

Step 3: Finding the Characters
With this machinery in place, we are ready to find characters in a novel (I hope you are reading this with great anticipation) using different strategies. Each of the strategies has a function to implement that strategy.

Attempt #1
One attribute (or feature) of the text we are analyizing is that proper nouns are capitalized. Let’s capitalize on this and find all single words in the text whose first character is an uppercase letter and the word is NOT a stop word.

Create and define the function find_characters_v1(text, stoplist, top):

?Tokenize and clean the text using the function split_text_into_tokens
?Filter the tokens so it has no stop words in it (regardless of case). The parameter stoplist is the array returned from load_stop_words
?Create a new list of tokens (keep the order) of words that are capitalized. You can test the first character of the token.
?Return the top words as a list of tuples (the first element is the word, the second is the count)

For Huck finn, you should get the following (the output is formatted for clarity):
HUCK_ID = "13F68-nA4W-0t3eNuIodh8fxTMZV5Nlpp"
text = read_google_doc(HUCK_ID)
stop = load_stop_words()
v1 = find_characters_v1(text, stop, 15)
print(v1)

You should see:
(‘Jim‘, 341),
(‘Well‘, 318),
(‘Tom‘, 217),
(‘Huck‘, 70),
(‘Yes‘, 68),
(‘Oh‘, 65),
(‘Miss‘, 63),
(‘Mary‘, 60),
(‘Aunt‘, 53),
(‘Now‘, 53),
(‘Sally‘, 46),
(‘CHAPTER‘, 43),
(‘Sawyer‘, 43),
(‘Jane‘, 43),
(‘Buck‘, 38),

Notice with this very simple method we found 8 characters in the top 15. You also found an Aunt and a Miss too. You might be inclined to start fiddling with the stop-words. The one would you could add is ‘CHAPTER‘ and ‘Well‘ -- the interjection, since we know that word does not provide much content in this context. But as we mentioned in the nGrams lesson, that‘s a dangerous game, since other novels might include some of these:

Attempt #2
Another feature of characters in a novel is that many of them have two names (Tom Sawyer, Aunt Polly, etc).

Create and define the function find_characters_v2(text, stoplist, top):

?Tokenize and clean the text using the function split_text_into_tokens
?Convert the list of tokens into a list of bigrams (using your bi_grams method)
?Filter out all bigrams such that only if both words are capitalized (just the first character) then are they used.
?Neither word should (either lower or upper) be in stoplist (remember stoplist could be the empty list)
?Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count

Note that we are NOT removing the stopwords from the text (see lesson on ngrams). We are now using the stopwords to make decisions on the text.

With the text of Huckleberry Finn, the following is the output with stopwords being the empty list:
v2 = find_characters_v2(text, [], 15)
print(v2)

((‘Mary‘, ‘Jane‘), 41),
((‘Tom‘, ‘Sawyer‘), 40),
((‘Aunt‘, ‘Sally‘), 39),
((‘Miss‘, ‘Watson‘), 20),
((‘Miss‘, ‘Mary‘), 19),
((‘Mars‘, ‘Tom‘), 16),
((‘Huck‘, ‘Finn‘), 15),
((‘Uncle‘,‘Silas‘), 15),
((‘Aunt‘, ‘Polly‘), 11),
((‘Judge‘,‘Thatcher‘), 10),
((‘But‘, ‘Tom‘), 9),
((‘Ben‘, ‘Rogers‘), 8),
((‘So‘, ‘Tom‘), 8),
((‘St‘, ‘Louis‘), 7),
((‘Miss‘, ‘Sophia‘), 7)

That found 11 characters in the top 15 bigrams frequency table. This method is pretty good and the method didn‘t need to consider stop words. What happens if you consider stop words?

Note: in order to match these outputs, use the collections.Counter class. Otherwise, it‘s possible that your version of sorting will handle those tuples with equal counts differently (unstable sorting).

Titles
Another feature of characters is that many of them have a title (also called honorifics) precede them (Dr. Mr. Mrs. Miss. Ms. Rev. Prof. Sir. etc). We will look for bi-grams that have these titles. However, we will NOT hard code the titles. We will let the data tell us what the ‘titles‘ are.

Here‘s the process to use to self discover titles:
?Let‘s define a title as a capital letter followed by 1 to 3 lower case letters followed by a period. This is not perfect, but it captures a good majority of them.
?create a list named title_tokens that the text matches the above criteria (hint: use regular expressions)
?you now have to remove words that might have ended a sentence with those same title characteristics (e.g. Tom. Bill. Pat. Etc. ). Use the same definition as above but instead of ending with a period, the token must end with whitespace. The idea is that hopefully somewhere in the text the same name will appear but without a period. It’s very likely that you would encounter ‘Tom‘ somewhere in the text without a period, but it’s unlikely that Mr., Mrs., Dr., etc would appear without a period. Let‘s call this list pseudo_titles.
?the set of titles is essentially the first list of tokens, title_tokens with all the tokens in the second set (pseudo_titles) removed. For example, the first list might have ‘Dr.‘, ‘Tom.‘ and ‘Mr.‘ in it and the second set might have ‘Tom‘ and ‘Ted‘ in it. The final title list would include ‘Dr‘ and ‘Mr‘
?Name your function get_titles that encapsulates the above logic; it should return a list of titles.

Once you have get_titles working, the following should work:
titles = get_titles(text)
print(titles)

You should get 7 computed titles in Huckleberry Finn:
[‘Col‘, ‘Dr‘, ‘Mr‘, ‘Mrs‘, ‘Otto‘, ‘Rev‘, ‘St‘]

Attempt #3
Create and define the function find_characters_v3(text, stoplist, top):
?Tokenize and clean the text
?Convert the list of tokens into a list of bigrams
?Filter out all bigrams such that the first word in the bigram is a title and the second word is capitalized (hint: use the output of get_titles)
?the second word (either lower or upper) should not be in stoplist
?Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count

v3 = find_characters_v3(text, load_stop_words(), 15)
print(v3)

For Huck Finn, you should get the following:
((‘St‘, ‘Louis‘), 7),
((‘Mr‘, ‘Lothrops‘), 6),
((‘Mrs‘, ‘Phelps‘), 4),
((‘St‘, ‘Petersburg‘), 3),
((‘Dr‘, ‘Robinson‘), 3),
((‘Mr‘, ‘Garrick‘), 2),
((‘Mr‘, ‘Kean‘), 2),
((‘Mr‘, ‘Wilks‘), 2),
((‘Mr‘, ‘Mark‘), 1),
((‘Mrs‘, ‘Judith‘), 1),
((‘Mr‘, ‘Parker‘), 1),
((‘Dr‘, ‘Gunn‘s‘), 1),
((‘Col‘, ‘Grangerford‘), 1),
((‘Dr‘, ‘Armand‘), 1),
((‘St‘, ‘Jacques‘), 1)

Clearly, that yields a lot of good information. Although looking at the counts, none of them are that prominent. We also found a few places as well as people.

Machine Learning?
You may have heard of the NLTK Python library that’s a popular choice for processing text. These libraries include models that were built by processing large amounts of text. We will use both the NLTK and SpaCy NLP libraries to do something similar in another lesson. However, these libraries have models built from using large data sets to extract entities (called NER for named entity recognition). These entities include organizations, people, places, money.

The models that were built essentially learned what features (like capitalization or title words) were important when analyzing text and came up with a model that attempts to do the same thing we did here. However, we hard coded the rules (use bigrams, remove stop words, look for capital letters, etc). This is sometimes referred to as a rule-based system. The analysis is built on manually crafted rules.

In machine learning (sometimes referred to as an automatic system), some of the algorithms essentially learn what features are important (or can learn how much weight to apply to each feature) to build a model and then uses the model to classify tokens as named entities. The biggest issue is that these models could be built with a very different text source (e.g. journal articles or twitter feed) than what you are processing. Also the models themselves require a large set of resources (memory, cpu) that you may not have available. What you built in this lesson is efficient, fast and fairly accurate.

Submission Guidelines:
You will upload your notebook to Gradescope.com for grading.
?do NOT use any external Python library other than collections and re (nothing else).
?do NOT use the zip function (we will soon though)
?try to solve all of these problems by yourself with your own brain and a piece of paper. Surely there are solutions available, but copying will not make you a better programmer. This is not the time to copy or share code.
?You should test the code you are writing against sample sentences instead of the full text; once you have it working, then try the full data set
?you are free to write as many helper functions as you need. The following functions will be tested:
• get_titles
• find_characters_v[1-3]
?each of the find_characters_v functions should use your top_n function
?the output of find_characters_v should always be a list of tuples AND match the example output before you ‘run tests‘

Before you submit:
?Be sure to comment out all print statements -- especially those inside of loops.
?To speed up the grading process, comment out any testing code/cells
?When you download your notebook (as Python code), you must name it solution.py before you upload it.

Replit Credit:
Once you submit, return the URL of your shared Google notebook via the jupyter function in lesson.py
You can tag any question on Piazza with FindingChars.
CC 06
?? Coding Challenges:
Harry Potter and the Plotting of Characters (part 1)

Prerequisites:
?CC 05: Finding Characters
?Named Parameters
?NLP
?Numpy (part 1)
?Matplotlib Introduction
Do not start this lesson until all of the above lessons have been submitted successfully.
This project builds on the finding characters project. You will create a new notebook, but you can copy all of the working code from the previous challenge.
Plotting Characters across Chapters
This lesson will bring together your Numpy skills with what you learned about finding characters from the ngrams lesson to build a visualization like the one below. It shows the main characters of The Adventures of Huckleberry Finn and the cumulative count of their occurrences throughout the novel.

Lesson Assignment
We will build a similar graph for Harry Potter and the Sorcerer‘s Stone. This is part 1 of that process.

1. Create a New Notebook
Be sure you are logged into your Google Account using your @gmail email. Go to https://colab.research.google.com and create a new Python 3 notebook. Name it INFO490HP-P1. Be sure to save it into your personal drive space.

2. Access Remote Resource
In lesson.py is the Google Drive ID for the text for Harry Potter and the Sorcerer‘s Stone.
Write a function named get_harry_potter() that returns the text of that remote resource. You should use good coding conventions of writing and using single task helper functions (note that you have already done this in previous lessons). The following should work:
hp = get_harry_potter()
print(len(hp))

You must use valid Python code to gain access to remote resources. You cannot use any Jupyter specific code (e.g wget, curl, etc)

3. Clean Data
Write a function named clean_hp that does the following to its incoming string parameter:
?remove all header information up until the title of the book
?remove all leading and trailing whitespace
?you can keep ‘THE END‘ as well as all the page numbers
?return the cleaned text
hp = clean_hp(get_harry_potter())
print(len(hp))

4. Find Characters
Copy your working solution for load_stop_words, bi_grams, top_n, find_characters_v1 and find_characters_v2 as well as any helper functions they depend upon.

Make the the following changes:
def load_stop_words
?use spacy to load its stopwords
?add a named parameter (called add_pronouns) to the function with a default value of False. If add_pronouns is True, add the pronouns (found in lesson.py) to the returned list

def bi_grams
?use the nltk ngrams function inside your bi_grams function to turn the incoming list of tokens into a list of tuples
?remove your original implementation (or rename it to bi_grams_v1)

def split_text_into_tokens
?keep the same solution (using regular expressions to tokenize)
?augment the normalization step to strip off the possessive of any token that ends with ‘s (e.g Harry‘s becomes Harry)

def find_characters_v1
?change the parameter stopwords to have a default value of an empty list
?change the parameter top to have a default value of 15

def find_characters_v2
?change the parameter stopwords to have a default value of an empty list
?change the parameter top to have a default value of 15
?return a two elment tuple where the first element is the combined elements of the bigram. So instead of returning, for example,
(‘Uncle‘, ‘Vernon‘), 97) you would return
(‘Uncle Vernon‘, 97).

The following code should now work:
hp = clean_hp(get_harry_potter())
stop1 = load_stop_words(True)
stop2 = load_stop_words()
print(find_characters_v1(hp, stop1, 10))
print(find_characters_v2(hp, stop2, 10))

5. NLP for four
Write the function find_characters_nlp that has two parameters: the text to process, and top that has a default value of 15. It does the following:
?use spacy‘s Named Entity recognizer to pull out all people.
?return the top list of characters found (just like v1 and v2)
?this is the only place you should be using spacy to tokenize text

You should now run the following code and carefully analyze the results:
print(find_characters_nlp(hp))

A few questions for which you should remember the answers:
?What did you notice about the running time for v1, v2 and the nlp version?
?Which version found Hermione?
?Which version found Voldemort?
?How much do you have to increase the top parameter to find them?

You can use the time module for simple timing if you want to know the exact time spent on each algorithm:
import time

start = time.time()
print("hello")
end = time.time()
print(end - start)

I think we can agree on that without any human intervention (other than writing and running code), we could build an algorithm that uses the results from above and decide that the following characters are central to the Harry Potter and the Sorcerer‘s Stone:
?Harry
?Ron
?Hagrid
?Hermione

Note that we would probably miss Voldemort even though that character is important to the novel (we might miss that question on our 9th grade english test if we relied on our code to "read" the book for us). Can you think of any analysis that might bring Voldemort to the forefront? Make a post on Piazza of any ideas you have (try to keep all ideas on a single threaded post). This is a conversation starter not a requirement.

6. Data By Chapter
Looking at the graph that we need to build for this lesson, it‘s clear that we are going to need to get occurrence counts for the four characters for each chapter. Ideally, our data would look like the following (numbers are made up):
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapters = [14, 0, 0, ...] # 17 numbers

Note that each column is the data for each chapter.

Write a function named split_into_chapters that uses a regular expression to split the parameter text into an array of chapters
def split_into_chapters
?return an array whose elements are the text for each chapter
?each element is trimmed of leading and trailing whitespace
?each element can start with the title of the chapter or the first word of the chapter (ideally it would be the latter, but the regex is a bit more complicated)

Note: if you had to split a novel that had to use the title of the chapters (i.e. there is no one pattern that can uniquely captures each of the chapters), you would need to do something like the following (example shows only the first 2 chapters):
def split_into_chapters(data):
# this is not the way you should solve this
m1 = re.search(r"^YOU don‘t know about me", text, re.M)
m2 = re.search(r"^WE went tiptoeing along", text, re.M)
m3 = re.search(r"^WELL, I got a good going",text, re.M)
chp1 = text[m1.span()[0]: m2.span()[0]]
chp2 = text[m2.span()[0]: m3.span()[0]]
return [chp1, chp2]

The ^ and $ insist you are uniquely capturing the correct text. Clearly this is a last resort solution.

7. Character Counts.
Use Numpy to easily get the counts for the four main characters for each chapter. Create a new cell (we won‘t use a function for now) to create an array that has a count for the total number of occurrences the name appeared in each chapter.

The Numpy lesson has the function to get the counts from a string (and an example). As mentioned previously, after you are done your arrays should look like this (but not Python lists):
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapter = [14, 0, 0, ...] # 17 numbers

Note:
If a character is referenced using multiple names or nicknames (something our analysis has not done), you could combine them:
harry = [20, 79, 68]
potter = [21, 2, 4]
harry_potter = [5, 2, 0]

hp_counts = harry + potter - harry_potter
# [36, 79, 72]

This adds all ‘Harry‘ and ‘Potter‘ references together but adjusts for the double counting when the full reference to "Harry Potter" is made. Think of properly counting characters in the following sentences: "Harry?" "Is that you Potter? "I‘m not kidding Harry Potter, I need to see you NOW". There are 3 references to the same character (not counting pronouns). Do not do this, but it is something to keep in mind.

Finish this implementation
def get_character_counts_v1(chapters):

harry = ...
ron = ...
hagrid = ...
hermione = ...

return np.array([harry, ron, hagrid, hermione])

8. Plotting.
Using the same set up as in the Matplotlib lesson, plot each character:
def simple_graph_v1(plots):

fig = plt.figure()
subplot = fig.add_subplot(1,1,1)

subplot.plot(plots[0])
subplot.plot(plots[1])
subplot.plot(plots[2])
subplot.plot(plots[3])

# this is important for testing
return fig

Note that we are now calling the plot method on the returned subplot (a.k.a axes object) from the add_subplot method. In a previous lesson we called subplot.bar(x_pos, counts) to generate a bar graph. The plot method generates a line graph.

Once that is done the following should work (be sure to test this):
def pipeline_v1():

hp = clean_hp(get_harry_potter())
chapters = split_into_chapters(hp)

plots = get_character_counts_v1(chapters)
fig = simple_graph_v1(plots)
return fig
You should see something like the following:

This doesn‘t really look like the graph for which we are aiming. But it‘s a good start. We can see the counts for the four main characters of the novel and we found the characters without reading a single word!! (maybe we shouldn‘t celebrate this?)

Part 2 of this assignment will use Numpy and Matplotlib to do some data wrangling, fix the visualization, make the pipeline generic, and add some details.

Submission Credit

Notebook Prep:
Before submitting your notebook, be sure to comment out any print statements that print out significant amount of text. Also, comment out any calls to find_characters_* (this will speed up the autograder):
#print(find_characters_v1(hp, stopwords, 10))
#print(find_characters_v2(hp, stopwords, 10))
#print(find_characters_nlp(hp, 10))

1. Save your notebook as a .py file, upload that file to gradescope for grading (be sure you name the saved file solution.py). The gradescope assignment name is HarryPotter-Part1.
Use the same spacy english model used in the NLP lesson.
2. Be sure to share your notebook and have the function jupyter (lesson.py) return the full url. Once that is done, you can hit submit.

You can tag any question with HPP1 on Piazza

CC 08
?? Coding Challenges:
Harry Potter and the Plotting of Characters (part 2)

Prerequisites:
?Harry Potter Part 1
?Numpy (part 2)
?Matplotlib (part 2)
Do not start this lesson until all of the above lessons have been submitted successfully.

This lesson builds on the Harry Potter part 1 lesson. You should copy the notebook you used for that lesson so you have access to the where you left off. Be sure to comment out all code that references spacy -- as we will not use that module nor will it be available on the grader.

Plotting Characters across Chapters
The last lesson for Harry Potter had this for it‘s pipeline
def pipeline_v1():

hp = clean_hp(get_harry_potter())
chapters = split_into_chapters(hp)

plots = get_character_counts_v1(chapters)
fig = simple_graph_v1(plot)
return fig

Hopefully you saw something like the following:

The goal for this lesson is to get a graph that‘s similar to the following done for Huckleberry Finn:

Step 1. Better Data Pipeline
The issue with the first plot above (it has several), is that if we wanted to analyze five characters, we would have to add a new variable, another parameter to the function, and update the code as well. This is a symptom of a poor design -- "code smell".

It would be better to pass in a single array (a multi dimensional array) where each element is an array of chapter counts for each character. Also, since our data is already in an array (i.e. chapters), it‘s a bit more Pythonic (Numponic?) to keep everything in arrays.

Write a function named get_character_counts_v2 whose parameter, names, is list of characters (as strings) for which data will be prepared. It returns a single Numpy array that represents the character counts for each chapter. This function will perform the same as get_character_counts_v1 from the previous lesson.
def get_character_counts_v2(chapters, names):

# use the same function as v1
# use a comprehension to easily get things done:
py_data = [ <CODE GOES HERE> for n in names ]
counts = np.array(py_data)
return counts
# test it
who = ["Ron", "Hagrid"]
print(get_character_counts_v2(chapters, who))

Look at the shape of the np.array that is returned by the function. Make sure you understand it. Also note how the data is different than the v1 version.
Once complete, implement the following function that puts all the parts together in a coherent data pipeline:
def pipeline_v2(names):
hp = clean_hp(get_harry_potter())
chapters = split_into_chapters(hp)

np_hp_data = get_character_counts_v2(chapters, names)

print(np_hp_data.shape)
return np_hp_data
who = ["Harry", "Ron", "Hagrid", "Hermione"]
print(pipeline_v2(who))

Step 2. Better Data, Better Plotting?
Let‘s update the graph function to handle a single array:
def simple_graph_v2(counts):
fig = plt.figure()
subplot = fig.add_subplot(1,1,1)
subplot.plot(counts)
return fig # return the figure

# test it
who = ["Harry", "Ron", "Hagrid", "Hermione"]
data = pipeline_v2(who)
simple_graph_v2(data)

You will get something like the following diagram.

Clearly something is wrong. The lines being drawn are the chapters. The issue is that if you pass this single 2D array to plot, you will need to get the data into the proper shape. Looking at the matplotlib documentation (which we all should be doing) (https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.plot.html#matplotlib.axes.Axes.plot), it says the following:

.. The parameters can also be 2-dimensional. Then, the columns represent separate data sets.

So we need to transform our data. Right now the columns represent chapters.
harry_by_chapter = [20, 79, 68, ...] # 17 numbers
ron_by_chapter = [ 0, 73, 14, ...] # 17 numbers
hagrid_by_chapters = [14, 0, 0, ...] # 17 numbers

We need each column to be the data for each character:
[
#HP #Ron #Hagrid
[20, 0, 14]
[79, 73, 0]
[68, 14, 0]
...
]

This is data wrangling!!

Update get_character_counts to use np.transform the data (see Numpy lesson) to get the array into the correct format. Be sure to print out the results and make sure you understand the data.
def get_character_counts_v2(chapters, names):
counts = ...
# transform data here
return counts

You can now pass the transformed data into simple_graph_v2 and see the first graph you created using the individual characters (which is where we started, but now we have a more flexible pipeline and you‘re using Numpy a close friend of Matplotlib):

Step 3. Smoothing out the rough edges
Although this graph does present good information, it is a bit harsh to look at. Instead of plotting the raw counts across the chapters, plot the cumulative sum for each of the characters. You can use the Numpy‘s np.cumsum() method. It basically creates an array whose elements represent a running sum of the input:
values = [1, 0, 2, 0, 3]
print(np.cumsum(values))

You should see: [1 1 3 3 6]
Update the get_character_counts function to get the cumulative sums for each array:
def get_character_counts_v2(chapters, names):
counts = ...
# counts = np.cumsum( ... )
# now transform counts
return counts

Note that you can do this either before or after the data transposition, but you will need to play with the axis parameter (since our data is two dimensional). Your final data should have columns of monotonically increasing values (always increasing or remaining constant, and never decreasing). A simple example is given below (for 3 characters)
[
[ 37 9 6] ?
[ 37 63 6] | monotonically
[ 37 113 6] | increasing
[ 46 125 7] | values
[ 51 129 7] ?
...
]
Your data must still have the shape of (# of chapters, # of characters). Your final graph (still very simple) should have the smooth look similar to the one that started off the lesson.
Your updated pipeline should now look like this (be sure to move this cell to the bottom of your notebook -- it should be the last cell). Note we moved creating the figure into the pipeline (which is returned as well):
def pipeline_v2(names):
hp = clean_hp(get_harry_potter())
chapters = split_into_chapters(hp)
np_hp = get_character_counts_v2(chapters, names)
fig = simple_graph_v2(np_hp)
return fig
who = ["Harry", "Ron", "Hagrid", "Hermione"]
fig = pipeline_v2(who)
Step 4. Gussy it Up
Define a function named simple_graph_hp(counts, names) which adds all the necessary embellishments to make your graph look like the one above:
?set the figure‘s title to ‘HP Characters <your initials here>‘ with your initials as well (without the <>)
?add a grid background
?set the x and y axis labels
?set the x axis tick markers to be for each chapter (e.g. 1 through 17) but do NOT hardcode the number 17 (get it from the data)
?add a legend for each of the characters (you can pick the location)
?style your graph using either fivethirtyeight or seaborn
?make sure the labels don‘t get cut off
Step 5: Update pipeline_v2
?update pipeline as well to call simple_graph_hp (not simple_graph_v2)
?return the figure created from the pipeline:
who = ["Harry", "Ron", "Hagrid", "Hermione"]
fig = pipeline_v2(who)
If your graph looks good, you are good to submit.
Insights?
Can you find a few interesting facts about the characters in Harry Potter and the Sorcerer‘s Stone using only the visualization? You can place your insights in a comment.

Submission Credit
1. Prep for Submission
?Be sure to remove/comment out all code that using NLTK or Spacy
?Share your notebook and return the URL in lesson.py (jupyter())

2. Upload for Grading
?Save your notebook as a .py file, upload that file to gradescope for grading (be sure you name the saved file solution.py).
?The gradescope assignment name is HarryPotter-Part2

3. Be Proud
You should feel pretty good about your programming chops. You proved that you can use NLP or your own algorithms to flush out characters and plot their occurrences throughout a novel. You can also play with different characters as well -- your code should adapt to any number of characters you want to graph. That is powerful.
You can tag any question on Piazza with HPP2.

Notebook Prep Notes:
Before submitting your notebook, be sure to comment out any print statements that print out significant amount of text. Also, comment out any calls to find_characters_* (this will speed up the autograder):
#print(find_characters_v1(hp, stopwords, 10))
#print(find_characters_v2(hp, stopwords, 10))
#print(find_characters_nlp(hp, 10))

Those functions won‘t work anyway, since both spacy and nltk are not part of the assignment.

CC 07
?? Coding Challenges: Finding Mr. Average
The project is about using NumPy to find the most "average" NBA basketball player with respect to age, height, and weight among 502 players. You will copy a notebook that already has the data in it.

Step 1: Log into Google
Be sure you are logged into Google using your @gmail address so your notebook will be saved to your drive.

Step 2: Copy the Template
• Bring up the following notebook:
https://colab.research.google.com/drive/1tFyjhnjd2zaN_joDi-n7dt8X7qQ8PfgV
• Save it to your Google Drive:

Step 3: The notebook has the following data arrays defined as Python lists
players = [‘Aaron Gordon‘, ‘Aaron Holiday‘ ...
teams = [‘ORL‘, ‘IND‘,
years_old = [23, 22,
height_inches = [81, 73,
weight_pounds = [220, 185,

They hold the attributes of NBA (basketball) players. There‘s no need to worry if you are not a fan of basketball. Basically, it defines 4 attributes for over 500 players.

• Create a new code cell that changes each of these lists into NumPy arrays. You MUST name the variables as follows:
np_players = ...
np_teams = ...
np_years_old = ...
np_height_inches = ...
np_weight_pounds = ...

• In the same cell define two constants (you must use proper Python naming conventions). One will hold the number to convert inches into meters; the other will hold the number to convert pounds into kilos.

• Use the above constants and NumPy‘s capabilities to do element-wise operations to convert np_height_inches to np_height_meters and np_weight_pounds to np_weight_kilos. Feel free to re-visit the NumPy lesson to refresh yourself on element-wise operations. If you find yourself using loops and logic, you should re-visit the NumPy lesson.
np_height_meters = ...
np_weight_kilos = ...

Step 4: Who‘s the shortest and tallest ?
Using only NumPy functions and operations define the following two functions. Each will return the index of the player:
def shortest_player(np_h):
return -1
def tallest_player(np_h):
return -1
You should be able to do the following:
print(players[ shortest_player(np_height_meters) ] )

You can solve this in many different ways, but you MUST use NumPy. You can even use the function np.where to find the index of an arbitrary mask:
def example():
mask = (np_years_old == 24) & (np_teams == ‘MIL‘)
idx = np.where(mask)
print(np_players[idx])

Note that the index returned could be an np.ndarray, a scalar (numpy.int64), or a tuple and the index into np_players would work.

Step 5: Who is Mr. Average?
Let‘s find out which NBA player is the closest to the average prototype (Mr. Ave) of the three attributes: age, height (meters), weight (kilos). So Mr. Ave‘s age, height, and weight are the values of the average of the respective attributes.

Your goal is to find the player who is closest to Mr. Ave. However we need to define a few metrics.

Closeness: There‘s many metrics available to measure the distance between points. In fact almost all machine learning, data mining, etc algorithms (e.g. clustering, classification, neural networks, natural language processing) need a metric (i.e. indicator) to measure ‘error‘ in order to minimize it. These metrics are also called similarity metrics.

In a single dimension, the absolute value between the difference of the two values is used. In two dimensions, Euclidean distance (Pythagorean theorem anyone?) can be used. We are going to use a metric called Manhattan distance. Note that both Euclidian and Manhattan generalize to multiple dimensions.

We will use Manhattan distance to find who is most like Mr. Ave. In the picture below shows two players (P1 and P2). Since this is best explained using two dimensions, we can assume that the x axis is age and the y axis is weight. The distance between the two players can be defined as d1 + d2 where d1 is defined as the distance between the x coordinates/attributes (age) and d2 is defined as the distance between the y coordinates/attributes (weight):

So distance between P1 and P2 = | P1.x - P2.x | + | P1.y - P2.y |
The | is the symbol for absolute value. We use absolute value since we only care about the distance between the attributes.
In order to generalize this (i.e. easily write a formula) to more attributes (dimensions), we will change the notation. Each player can be defined as an array of attributes (a vector):
P1 = (x1, x2)
P2 = (y1, y2)
the first column (x1, y1) is the set of values for attribute 1 (age)
the second column (x2, y2) is the set of values for attribute 2 (weight)

We use capital X to denote the vector of attributes for point 1 (i.e. player 1)
We use capital Y to denote the vector of attributes for point 2 (i.e. player 2)
So the distance between X and Y using the Manhattan metric would be
D(X,Y) = | x1 - y1 | + | x2 - y2 |

If we had three dimensions (e.g. age, weight, height)
P1 = (x1, x2, x3)
P2 = (y1, y2, y3)
D(X,Y) = | x1 - y1 | + | x2 - y2 | + | x3 - y3 |

Formally, we can define Manhatten (also called L1 norm, taxi-cab or city-block) distance with the following formula:

Sometimes k is used instead of n to denote the sample or instance (think row number)

As we saw in the NumPy lesson, there are aggregation functions that can be used to derive and build metrics/statistics on an np.array.

define and build the following function:
def find_mr_average(players, x,y,z, v=1):
# x, y, z are parallel np.arrays
# each defines 3 attributes for NBA players
# return a tuple that includes
# index of the player and his name
# who is closest to Mr. Ave
# v is only used if doing extra credit

Notes:
?Mr Ave has 3 attributes: the average age, the average height, the average weight
?Use Manhattan distance as defined above to determine the distance for each player to Mr. Ave
?The tuple returned contains 2 values:
• first value is the index of the player most similar to Mr. Ave
• second value is the player‘s name
?Do not normalize the attributes (see extra credit)
?The named parameter v is used for the extra credit part, you can ignore it for now

Your function will be tested using the following parameter order (age, height, weight):
idx, player = find_mr_average(np_players, np_years_old, np_height_inches, np_weight_pounds)

It will be also be tested using meters and kilos.

Step 6: Submit your notebook to Gradescope.com for grading:
You can earn bonus points (see directions below), but this is the minimum required to submit for 100% credit. For this project it is possible to earn partial credit and extra credit (we will make use of that named parameter).

?Go to gradescope.com and signup (or login for those who have already signed up). You MUST use your @illinois.edu address
?Download your notebook as .py file
?Rename the file as solution.py
?submit that file (solution.py) to the gradescope lesson named Mr. Average

Issues with submissions:
If your code doesn‘t get past the submission stage, here are the most common issues:
?be sure to name the file you submit to solution.py
?use functions to test your code. Do NOT put any print statements at the module level (all print statements need to be inside of functions). If you don‘t you may see the following error:
Your results.json file could not be parsed as JSON. Its contents are as follows:

Submission Process for Repl.it Credit:
To submit this repl.it lesson, write a function named jupyter that returns a string which is the URL for sharing your notebook. Be sure to share your notebook and paste the sharing URL as a string. Test that the url works by opening a different browser in which you are not logged into Google and see if it comes up.
Inside lesson.py:
def jupyter():
# share your notebook and paste the shared URL here as a string

Extra Credit
There‘s one BIG problem with the current solution. Each of the NBA attributes have different scales. In general for distance metrics, you want a change in one attribute to contribute to the distance equally as the same relative change in another attribute.

For example for players in the NBA, age will usually be less than 40, while meters tall will be less than 2.5. Let‘s set up an example:

Mr Ave: 28 yrs old, 2.0 meters tall
Bob: 29 yrs old, 2.0 meters tall
Ned: 28 yrs old, 3.0 meters tall

Using the raw data, both Bob and Ned are equally close to Mr. Ave. However, Bob is much closer to Mr. Ave when you consider scaled data.

For extra credit augment your function such that if the named parameter, v, is 10, do the following:
?scale each of the attributes to [0, 1] using the standard formula:
NumPy has the aggregate function to calculate the denominator (np.ptp -- see the lesson on NumPy) if you want to use it.
?Once each attribute is scaled, the order of the which attributes (x, y, z) are passed will not affect the calculation. This version will indeed reveal the true Mr. Ave.
?Your function has to work for when v == 1 (manhattan metric, do not scale) or v == 10 (manhattan metric, do scale)
You can tag any question you have with pyAve on Piazza

11.15.2019
All rights reserved

Readings and References:
?https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d
?photo credit: https://www.probasketballtroops.com/average-nba-height

因为专业,所以值得信赖。如有需要,请加QQ99515681  微信:codehelp

CC 03 Python

标签:form data   park   gui   for   main   appear   variables   asc   called   

原文地址:https://www.cnblogs.com/bizhunjava/p/12066075.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!