cs262 Programming Languages（2）Lexical Analysis

时间：2015-05-29 15:27:20 阅读：224 评论：0 收藏：0 [点我收藏+]

标签：

这一讲重要的内容从13-Specifying Tokens开始。但是一开始就出现了这么个东西：

def t_RANGLES(token)
    r‘>‘
    return token

前面完全没提怎么来的，看着有点迷糊，特别是r‘>‘这个，这是什么语法，于是第一次就放弃了。后来知道是在用PLY这个库，也看到文档中是这么说的：

When a function is used, the regular expression rule is specified in the function documentation string.

哦，这才想起来r‘>‘这玩意就是doc string啊，真是穿了个马甲就不认识了。。

token的定义顺序很重要。

对html中comment的处理。

其实这讲没什么好总结的，主要还是PLY。

作业：

# Hexadecimal Numbers
# 
# In this exercise you will write a lexical analyzer that breaks strings up
# into whitespace-separated identifiers and numbers. An identifier is a
# sequence of one or more upper- or lower-case letters. In this exercise,
# however, there are two types of numbers: decimal numbers, and
# _hexadecimal_ numbers. 
#
# Humans usually write numbers using "decimal" or "base 10" notation. The 
# number# 234 means 2*10^2 + 3*10 + 4*1. 
#
# It is also possible to write numbers using other "bases", like "base 16"
# or "hexadecimal". Computers often use base 16 because 16 is a convenient
# power of two (i.e., it is a closer fit to the "binary" system that
# computers use internally). A hexadecimal number always starts with the
# two-character prefix "0x" so that you know not to mistake it for a binary
# number. The number 0x234 means
#        2 * 16^2
#     + 3 * 16^1
#     + 4 * 16^0 
# = 564 decimal. 
#
# Because base 16 is larger than base 10, the letters ‘a‘ through ‘f‘ are
# used to represent the numbers ‘10‘ through ‘15‘. So the hexadecimal
# number 0xb is the same as the decimal number 11. When read out loud, the
# "0x" is often pronounced like "hex". "0x" must always be followed by at
# least one hexadecimal digit to count as a hexadecimal number. 
# 
# Modern programming languages like Python can understand hexadecimal
# numbers natively! Try it: 
#
# print 0x234  # uncomment me to see 564 printed
# print 0xb    # uncomment me to see 11 printed 
#
# This provides an easy way to test your knowledge of hexadecimal. 
#
# For this assignment you must write token definition rules (e.g., t_ID,
# t_NUM_hex) that will break up a string of whitespace-separated
# identifiers and numbers (either decimal or hexadecimal) into ID and NUM
# tokens. If the token is an ID, you should store its text in the
# token.value field. If the token is a NUM, you must store its numerical
# value (NOT a string) in the token.value field. This means that if a
# hexadecimal string is found, you must convert it to a decimal value. 
#
# Hint 1: When presented with a hexadecimal string like "0x2b4", you can
# convert it to a decimal number in stages, reading it from left to right:
#       number = 0              # ‘0x‘ 
#       number = number * 16 
#       number = number + 2     # ‘2‘
#       number = number * 16 
#       number = number + 11    # ‘b‘
#       number = number * 16
#       number = number + 4     # ‘4‘
# Of course, since you don‘t know the number of digits in advance, you‘ll 
# probably want some sort of loop. There are other ways to convert a
# hexadecimal string to a number. You may use any way that works. 
#
# Hint 2: The Python function ord() will convert a single letter into 
# an ordered internal numerical representation. This allows you to perform
# simple arithmetic on numbers:  
# 
# print ord(‘c‘) - ord(‘a‘) == 2 

import ply.lex as lex

tokens = (‘NUM‘, ‘ID‘)

####
# Fill in your code here.
####

def t_NUM_hex(token): #this should be placed before t_NUM_decimal
    r‘0x[0-9a-f]+‘
    token.value = int(token.value, 16)
    token.type = ‘NUM‘
    return token

def t_NUM_decimal(token):
  r‘[0-9]+‘
  token.value = int(token.value) # won‘t work on hex numbers!
  token.type = ‘NUM‘
  return token

def t_ID(token):
    r‘[a-zA-z_]+‘
    return token

t_ignore = ‘ \t\v\r‘

def t_error(t):
  print "Lexer: unexpected character " + t.value[0]
  t.lexer.skip(1) 

# We have included some testing code to help you check your work. You will
# probably want to add your own additional tests. 
lexer = lex.lex() 

def test_lexer(input_string):
  lexer.input(input_string)
  result = [ ] 
  while True:
    tok = lexer.token()
    if not tok: break
    result = result + [(tok.type, tok.value)]
  return result

question1 = "0x19 equals 25" # 0x19 = (1*16) + 9
answer1 = [(‘NUM‘, 25), (‘ID‘, ‘equals‘), (‘NUM‘, 25) ]

print test_lexer(question1) == answer1

question2 = "0xfeed MY 0xface" 
answer2 = [(‘NUM‘, 65261), (‘ID‘, ‘MY‘), (‘NUM‘, 64206) ]

print test_lexer(question2) == answer2

question3 = "tricky 0x0x0x" 
answer3 = [(‘ID‘, ‘tricky‘), (‘NUM‘, 0), (‘ID‘, ‘x‘), (‘NUM‘, 0), (‘ID‘, ‘x‘)]
print test_lexer(question3) == answer3


question4 = "in 0xdeed"
print test_lexer(question4)

question5 = "where is the 0xbeef"
print test_lexer(question5)

Hexadecimal Numbers

# Email Addresses & Spam
#
# In this assignment you will write Python code to to extract email
# addresses from a string of text. To avoid unsolicited commercial email
# (commonly known as "spam"), users sometimes add the text NOSPAM to an
# other-wise legal email address, trusting that humans will be smart enough
# to remove it but that machines will not. As we shall see, this provides
# only relatively weak protection. 
#
# For the purposes of this exercise, an email address consists of a
# word, an ‘@‘, and a domain name. A word is a non-empty sequence
# of upper- or lower-case letters. A domain name is a sequence of two or
# more words, separated by periods. 
#
# Example: wes@udacity.com
# Example: username@domain.name
# Example: me@this.is.a.very.long.domain.name
#
# If an email address has the text NOSPAM (uppercase only) anywhere in it,
# you should remove all such text. Example: 
# ‘wes@NOSPAMudacity.com‘ -> ‘wes@udacity.com‘ 
# ‘wesNOSPAM@udacity.com‘ -> ‘wes@udacity.com‘ 
#
# You should write a procedure addresses() that accepts as input a string.
# Your procedure should return a list of valid email addresses found within
# that string -- each of which should have NOSPAM removed, if applicable. 
#
# Hint 1: Just as we can FIND a regular expression in a string using
# re.findall(), we can also REPLACE or SUBSTITUTE a regular expression in a
# string using re.sub(regexp, new_text, haystack). Example: 
# 
# print re.sub(r"[0-9]+", "NUMBER", "22 + 33 = 55") 
# "NUMBER + NUMBER = NUMBER" 
#
# Hint 2: Don‘t forget to escape special characters. 
#
# Hint 3: You don‘t have to write very much code to complete this exercise:
# you just have to put together a few concepts. It is possible to complete
# this exercise without using a lexer at all. You may use any approach that
# works. 


import ply.lex as lex
import re 

# Fill in your answer here. 

def addresses(haystack): 
    emails = re.findall(r‘[a-zA-Z]+@[a-zA-Z]+(?:\.[a-zA-Z]+)+‘, haystack)
    return [re.sub(‘NOSPAM‘, ‘‘, email) for email in emails]

# We have provided a single test case for you. You will probably want to
# write your own. 
input1 = """louiseNOSPAMaston@germany.de (1814-1871) was an advocate for
democracy. irmgardNOSPAMkeun@NOSPAMweimar.NOSPAMde (1905-1982) wrote about
the early nazi era. rahelNOSPAMvarnhagen@berlin.de was honored with a 1994
deutsche bundespost stamp. seti@home is not actually an email address."""

output1 = [‘louiseaston@germany.de‘, ‘irmgardkeun@weimar.de‘, ‘rahelvarnhagen@berlin.de‘]

print addresses(input1) == output1

Email Addresses And Spam

cs262 Programming Languages（2）Lexical Analysis

标签：

原文地址：http://www.cnblogs.com/demoZ/p/4526059.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行