{"id":27254,"date":"2017-01-18T12:28:55","date_gmt":"2017-01-18T10:28:55","guid":{"rendered":"https:\/\/mamchenkov.net\/wordpress\/?p=27254"},"modified":"2017-01-18T12:28:55","modified_gmt":"2017-01-18T10:28:55","slug":"10000-most-common-english-words","status":"publish","type":"post","link":"https:\/\/mamchenkov.net\/wordpress\/2017\/01\/18\/10000-most-common-english-words\/","title":{"rendered":"10,000 most common English words"},"content":{"rendered":"<!-- google_ad_section_start -->\n<p>This <a href=\"https:\/\/github.com\/first20hours\/google-10000-english\">GitHub repository<\/a> contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team.<\/p>\n<blockquote><p>Here at Google Research we have been using word n-gram models for a variety of R&amp;D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google&#8217;s datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there&#8217;s no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more &#8211; resulting in a training corpus of one trillion words from public Web pages.<\/p>\n<p>We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That&#8217;s why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.<\/p><\/blockquote>\n<p>There are a few variations of the list &#8211; with and without the swear words and such. \u00a0I took a quick look at it and was surprised to find that &#8220;<em>cyprus<\/em>&#8221; is at <a href=\"https:\/\/github.com\/first20hours\/google-10000-english\/blob\/master\/google-10000-english.txt#L4993\">position 4,993<\/a> (pretty high), immediately after the word &#8220;<em>emails<\/em>&#8220;. \u00a0Weird!<\/p>\n<p>(found via the link from <a href=\"http:\/\/www.netmux.com\/blog\/cracking-12-character-above-passwords\">this article<\/a>)<\/p>\n<!-- google_ad_section_end -->\n","protected":false},"excerpt":{"rendered":"<!-- google_ad_section_start -->\n<p>This GitHub repository contains a list of the 10,000 most common English words, sorted by frequency, as seen by the Google Machine Translation Team. Here at Google Research we have been using word n-gram models for a variety of R&amp;D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and &hellip; <a href=\"https:\/\/mamchenkov.net\/wordpress\/2017\/01\/18\/10000-most-common-english-words\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">10,000 most common English words<\/span><\/a><\/p>\n<!-- google_ad_section_end -->\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"10,000 most common English words #language #stats #English #USA #research #security","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false,"_links_to":"","_links_to_target":""},"categories":[1,22,62],"tags":[2238,243,1117,200,1041,2395],"keyring_services":[],"class_list":["post-27254","post","type-post","status-publish","format-standard","hentry","category-general","category-cyprus","category-technology","tag-google-translate","tag-language","tag-research","tag-security","tag-statistics","tag-usa"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":16733,"url":"https:\/\/mamchenkov.net\/wordpress\/2012\/09\/20\/if-you-have-the-data-use-it\/","url_meta":{"origin":27254,"position":0},"title":"If you have the data, use it!","author":"Leonid Mamchenkov","date":"September 20, 2012","format":false,"excerpt":"Spending quit a bit of time on the web, I've boosted my tolerance levels to bad design, horrible user interfaces, and twisted logic. \u00a0However, there are still things that annoy the crap out of me. \u00a0Among the two most frequent are these: Google throwing me into Greek language. \u00a0Yes, I\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9197,"url":"https:\/\/mamchenkov.net\/wordpress\/2005\/07\/05\/daily-del-icio-us-bookmarks\/","url_meta":{"origin":27254,"position":1},"title":"Daily del.icio.us bookmarks","author":"Leonid Mamchenkov","date":"July 5, 2005","format":false,"excerpt":"Shared bookmarks for del.icio.us user tvset on 2005-07-05 Welcome to Cyprus Naturists Tagged as: beach cyprus leasure naked nudity tourism travel vacations More on Cyprus naturist beaches Tagged as: beach cyprus leasure naked nudity tourism travel vacations Eric J. Heller Gallery -- Quantum mechanics in art Tagged as: art cool\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":29081,"url":"https:\/\/mamchenkov.net\/wordpress\/2018\/12\/11\/cyprus-national-internet-portal-for-open-data\/","url_meta":{"origin":27254,"position":2},"title":"Cyprus National Internet Portal for Open Data","author":"Leonid Mamchenkov","date":"December 11, 2018","format":false,"excerpt":"It is via this Cyprus Mail article that I've learned that not only Cyprus has an official Open Data portal, but that it's also the best in Europe: Cyprus is one of the top five European Union countries in the field of Open Data for 2018, while the new National\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mamchenkov.net\/wordpress\/wp-content\/uploads\/2018\/12\/open-data.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/mamchenkov.net\/wordpress\/wp-content\/uploads\/2018\/12\/open-data.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/mamchenkov.net\/wordpress\/wp-content\/uploads\/2018\/12\/open-data.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/mamchenkov.net\/wordpress\/wp-content\/uploads\/2018\/12\/open-data.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":13828,"url":"https:\/\/mamchenkov.net\/wordpress\/2011\/01\/04\/22-bullets\/","url_meta":{"origin":27254,"position":3},"title":"22 Bullets","author":"Leonid Mamchenkov","date":"January 4, 2011","format":false,"excerpt":"I've heard a few good things about \"22 Bullets\" so I put some effort into getting a copy. \u00a0The film is also originally in French, so I had to spend some time finding a viewable translation. \u00a0I don't really like dubbed films, so I opted for English sub-titles. \u00a0Too bad\u2026","rel":"","context":"In &quot;3 stars&quot;","block_context":{"text":"3 stars","link":"https:\/\/mamchenkov.net\/wordpress\/category\/movies\/movie-reviews\/3-stars\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/mamchenkov.net\/wordpress\/wp-content\/uploads\/2011\/01\/22_bullets-500x375.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":11894,"url":"https:\/\/mamchenkov.net\/wordpress\/2009\/10\/24\/google-docs-google-translate-and-the-web-integration\/","url_meta":{"origin":27254,"position":4},"title":"Google Docs, Google Translate, and the Web integration","author":"Leonid Mamchenkov","date":"October 24, 2009","format":false,"excerpt":"Google Docs recently got a pretty exciting feature - integration with Google Translate.\u00a0 But as exciting as it is, if you combine the new functionality with some bits of the previously available functionality, you can get truly mind-blowing results. Consider an example.\u00a0 You have a feedback form on your web\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"Google Docs Form","src":"https:\/\/i0.wp.com\/mamchenkov.net\/wordpress\/wp-content\/uploads\/2009\/10\/google_docs_form.jpeg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":12160,"url":"https:\/\/mamchenkov.net\/wordpress\/2010\/02\/18\/how-accurate-is-google-analytics\/","url_meta":{"origin":27254,"position":5},"title":"How accurate is Google Analytics?","author":"Leonid Mamchenkov","date":"February 18, 2010","format":false,"excerpt":"That's the question that I was asked recently by one of the co-workers. \u00a0 It is simple and not so simple at the same time. \u00a0It really depends on what you are looking for, what is the acceptable accuracy, and what is that you are comparing Google Analytics with. For\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts\/27254","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/comments?post=27254"}],"version-history":[{"count":0,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts\/27254\/revisions"}],"wp:attachment":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/media?parent=27254"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/categories?post=27254"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/tags?post=27254"},{"taxonomy":"keyring_services","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/keyring_services?post=27254"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}