如何下載 TED 的字幕
TED是一個演講計劃,也可以用做練習英聽的題目,問題是和增進英文比較起來,我對演講內容比較有興趣,所以我還是寧可看中文字幕。以往我都是透過 http://tedtalksubtitledownload.appspot.com/ 來下載字幕,問題是 TED 的網頁改版了!!所以上面這個網頁沒辦法用,因此本人就決定自己來打造一個 TED 的字幕下載工具(順便當作 Python 的程式練習)。
很明顯,沒有操作的介面,也許之後一樣改成一個 WEB 的服務吧。import urllib def parseTime( str ): time = str.split("data-time='")[1].split("'")[0] return time def parseSub( str ): subtitle = str.split(">")[1].split("<")[0] return subtitle def time2Str( time ): minisecond = time % 1000 second = ( time / 1000 ) % 60 minute = ( ( time / 1000 ) / 60 ) % 60 hour = ( ( ( time / 1000 ) / 60 ) / 60 ) % 60 miniSecStr = str( minisecond ) secStr = str( second ) minStr = str( minute ) hrStr = str( hour ) # SRT time format: 00:02:52,184 --> 00:02:53,617 if len( miniSecStr ) < 3: for i in range( 0, ( 3 - len( miniSecStr ) ), 1 ): miniSecStr = "0" + miniSecStr if len( secStr ) < 2: secStr = "0" + secStr if len( minStr ) < 2: minStr = "0" + minStr if len( hrStr ) < 2: hrStr = "0" + hrStr return hrStr + ":" + minStr + ":" + secStr + "," + miniSecStr def getSubtitleList( link ): if not link.endswith( '/' ): link = link + "/" url = link + "transcript" subtitleList=[] flag = 0 web = urllib.urlopen( url ) for line in web.readlines(): # The start of transcript options if "talk-transcript__language" in line: flag = 1 continue # The end of transcript options if ( "</select>" in line ) & ( flag == 1 ): break # Parse options # Example: <option value='zh-cn'>Chinese, Simplified</option> if flag == 1: lang = line.split(">")[1].split("<")[0] abbr = line.split("'")[1] subtitleList.append( ( abbr, lang ) ) web.close() return subtitleList def genSubtitle( link, lang ): if not link.endswith( '/' ): link = link + "/" url = link + "transcript?lang=" + lang web = urllib.urlopen( url ) index = 1 times=[] subtitles=[] for line in web.readlines(): if "talk-transcript__fragment" in line: time = parseTime( line ) subtitle = parseSub( line ) # 12 seconds is for the video opening times.append( int( time ) + 12000 ) subtitles.append( subtitle ) web.close() size = len( subtitles ) # Since the last subtitle has no end time, we set 5 seconds directly. times.append( times[ ( size - 1 ) ] + 5000 ) # SRT format example: # 45 # 00:02:52,184 --> 00:02:53,617 # Hello World for i in range( 0, size , 1 ): print i+1 print time2Str( times[i] ) + " --> " + time2Str( times[i+1] ) print subtitles[i] print def main(): options = getSubtitleList( "http://www.ted.com/talks/charmian_gooch_meet_global_corruption_s_hidden_players/" ) size = len( options ) for i in range( 0, size , 1 ): print options[i] genSubtitle( "http://www.ted.com/talks/charmian_gooch_meet_global_corruption_s_hidden_players/", "en" ) main()
留言
張貼留言